All of lore.kernel.org
 help / color / mirror / Atom feed
* [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
@ 2007-04-13 20:21 Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
                   ` (14 more replies)
  0 siblings, 15 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:

   http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.

[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
  new scheduler will be active by default and all tasks will default
  to the new SCHED_FAIR interactive scheduling class. ]

Highlights are:

 - the introduction of Scheduling Classes: an extensible hierarchy of
   scheduler modules. These modules encapsulate scheduling policy
   details and are handled by the scheduler core without the core
   code assuming about them too much.

 - sched_fair.c implements the 'CFS desktop scheduler': it is a
   replacement for the vanilla scheduler's SCHED_OTHER interactivity
   code.

   i'd like to give credit to Con Kolivas for the general approach here:
   he has proven via RSDL/SD that 'fair scheduling' is possible and that
   it results in better desktop scheduling. Kudos Con!

   The CFS patch uses a completely different approach and implementation
   from RSDL/SD. My goal was to make CFS's interactivity quality exceed
   that of RSDL/SD, which is a high standard to meet :-) Testing
   feedback is welcome to decide this one way or another. [ and, in any
   case, all of SD's logic could be added via a kernel/sched_sd.c module
   as well, if Con is interested in such an approach. ]

   CFS's design is quite radical: it does not use runqueues, it uses a
   time-ordered rbtree to build a 'timeline' of future task execution,
   and thus has no 'array switch' artifacts (by which both the vanilla
   scheduler and RSDL/SD are affected).

   CFS uses nanosecond granularity accounting and does not rely on any
   jiffies or other HZ detail. Thus the CFS scheduler has no notion of
   'timeslices' and has no heuristics whatsoever. There is only one
   central tunable:

         /proc/sys/kernel/sched_granularity_ns

   which can be used to tune the scheduler from 'desktop' (low
   latencies) to 'server' (good batching) workloads. It defaults to a
   setting suitable for desktop workloads. SCHED_BATCH is handled by the
   CFS scheduler module too.

   due to its design, the CFS scheduler is not prone to any of the
   'attacks' that exist today against the heuristics of the stock
   scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
   work fine and do not impact interactivity and produce the expected
   behavior.

   the CFS scheduler has a much stronger handling of nice levels and
   SCHED_BATCH: both types of workloads should be isolated much more
   agressively than under the vanilla scheduler.

   ( another rdetail: due to nanosec accounting and timeline sorting,
     sched_yield() support is very simple under CFS, and in fact under
     CFS sched_yield() behaves much better than under any other
     scheduler i have tested so far. )

 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
   way than the vanilla scheduler does. It uses 100 runqueues (for all
   100 RT priority levels, instead of 140 in the vanilla scheduler)
   and it needs no expired array.

 - reworked/sanitized SMP load-balancing: the runqueue-walking
   assumptions are gone from the load-balancing code now, and
   iterators of the scheduling modules are used. The balancing code got
   quite a bit simpler as a result.

the core scheduler got smaller by more than 700 lines:

 kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
 1 file changed, 372 insertions(+), 1082 deletions(-)

and even adding all the scheduling modules, the total size impact is
relatively small:

 18 files changed, 1454 insertions(+), 1133 deletions(-)

most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:

   text    data     bss     dec     hex filename
  23366    4001      24   27391    6aff kernel/sched.o.vanilla
  24159    2705      56   26920    6928 kernel/sched.o.CFS

(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)

thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.

as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
@ 2007-04-13 20:27 ` Bill Huey
  2007-04-13 20:55   ` Ingo Molnar
  2007-04-13 21:50 ` Ingo Molnar
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Bill Huey @ 2007-04-13 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Bill Huey (hui)

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
... 
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]

Ingo,

Con has been asking for module support for years if I understand your patch
corectly. You'll also need this for -rt as well with regards to bandwidth
scheduling. Good to see that you're moving in this direction.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 20:55   ` Ingo Molnar
  2007-04-13 21:21     ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:55 UTC (permalink / raw)
  To: Bill Huey
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> Con has been asking for module support for years if I understand your 
> patch corectly. [...]

Yeah. Note that there are some subtle but crutial differences between 
PlugSched (which Con used, and which i opposed in the past) and this 
approach.

PlugSched cuts the interfaces at a high level in a monolithic way and 
introduces kernel/scheduler.c that uses one pluggable scheduler 
(represented via the 'scheduler' global template) at a time.

while in this CFS patchset i'm using modularization ('scheduler 
classes') to simplify the _existing_ multi-policy implementation of the 
scheduler. These 'scheduler classes' are in a hierarchy and are stacked 
on top of each other. They are in use at once. Currently there's two of 
them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the 
performance impact is minimal.

So scheduler classes are mainly a simplification of the design of the 
scheduler - not just a mere facility to select multiple schedulers. 
Their ability to also facilitate easier experimentation with schedulers 
is 'just' a happy side-effect. So all in one: it's a fairly different 
model from PlugSched (and that's why i didnt reuse PlugSched) - but 
there's indeed overlap.

> [...] You'll also need this for -rt as well with regards to bandwidth 
> scheduling.

yeah.

scheduler classes are also useful for other purposes like containers and 
virtualization, hierarchical/group scheduling, security encapsulation, 
etc. - features that can be on-demand layered, and which we dont 
necessarily want to have enabled all the time.

> [...] Good to see that you're moving in this direction.

thanks! :)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:55   ` Ingo Molnar
@ 2007-04-13 21:21     ` William Lee Irwin III
  2007-04-13 21:35       ` Bill Huey
  2007-04-13 21:39       ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-13 21:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> Yeah. Note that there are some subtle but crutial differences between 
> PlugSched (which Con used, and which i opposed in the past) and this 
> approach.
> PlugSched cuts the interfaces at a high level in a monolithic way and 
> introduces kernel/scheduler.c that uses one pluggable scheduler 
> (represented via the 'scheduler' global template) at a time.

What I originally did did so for a good reason, which was that it was
intended to support far more radical reorganizations, for instance,
things that changed the per-cpu runqueue affairs for gang scheduling.
I wrote a top-level driver that did support scheduling classes in a
similar fashion, though it didn't survive others maintaining the patches.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 21:21     ` William Lee Irwin III
@ 2007-04-13 21:35       ` Bill Huey
  2007-04-13 21:39       ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-13 21:35 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> > Yeah. Note that there are some subtle but crutial differences between 
> > PlugSched (which Con used, and which i opposed in the past) and this 
> > approach.
> > PlugSched cuts the interfaces at a high level in a monolithic way and 
> > introduces kernel/scheduler.c that uses one pluggable scheduler 
> > (represented via the 'scheduler' global template) at a time.
> 
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the patches.

Also, gang scheduling is needed to solve virtualization issues regarding
spinlocks in a guest image. You could potentally be spinning on a thread
that isn't currently running which, needless to say, is very bad.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 21:21     ` William Lee Irwin III
  2007-04-13 21:35       ` Bill Huey
@ 2007-04-13 21:39       ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:39 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> What I originally did did so for a good reason, which was that it was 
> intended to support far more radical reorganizations, for instance, 
> things that changed the per-cpu runqueue affairs for gang scheduling. 
> I wrote a top-level driver that did support scheduling classes in a 
> similar fashion, though it didn't survive others maintaining the 
> patches.

yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 21:50 ` Ingo Molnar
  2007-04-13 21:57 ` Michal Piotrowski
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> and even adding all the scheduling modules, the total size impact is 
> relatively small:
> 
>  18 files changed, 1454 insertions(+), 1133 deletions(-)
> 
> most of the increase is due to extensive comments. The kernel size 
> impact is in fact a small negative:
> 
>    text    data     bss     dec     hex filename
>   23366    4001      24   27391    6aff kernel/sched.o.vanilla
>   24159    2705      56   26920    6928 kernel/sched.o.CFS

update: these were older numbers, here are the stats redone with the 
latest patch:

     text    data     bss     dec     hex filename
    23366    4001      24   27391    6aff kernel/sched.o.vanilla
    23671    4548      24   28243    6e53 kernel/sched.o.sd.v40
    23349    2705      24   26078    65de kernel/sched.o.cfs

so CFS is now a win both for text and for data size :)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
  2007-04-13 21:50 ` Ingo Molnar
@ 2007-04-13 21:57 ` Michal Piotrowski
  2007-04-13 22:15 ` Daniel Walker
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 712+ messages in thread
From: Michal Piotrowski @ 2007-04-13 21:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar napisał(a):
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 

Friday the 13th, my lucky day :).

/mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers


make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'.  Stop.
make[2]: *** [linux] Error 2
make[1]: *** [headers_check] Error 2
make: *** [vmlinux] Error 2

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>

--- linux-msc-cfs-clean/include/linux/Kbuild	2007-04-13 23:52:47.000000000 +0200
+++ linux-msc-cfs/include/linux/Kbuild	2007-04-13 23:49:41.000000000 +0200
@@ -133,6 +133,7 @@ header-y += quotaio_v1.h
 header-y += quotaio_v2.h
 header-y += radeonfb.h
 header-y += raw.h
+header-y += rbtree.h
 header-y += resource.h
 header-y += rose.h
 header-y += smbno.h

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-04-13 21:57 ` Michal Piotrowski
@ 2007-04-13 22:15 ` Daniel Walker
  2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:21 ` William Lee Irwin III
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Daniel Walker @ 2007-04-13 22:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

I'm not in love with the current or other schedulers, so I'm indifferent
to this change. However, I was reviewing your release notes and the
patch and found myself wonder what the logarithmic complexity of this
new scheduler is .. I assumed it would also be constant time , but the
__enqueue_task_fair doesn't appear to be constant time (rbtree insert
complexity).. Maybe that's not a critical path , but I thought I would
at least comment on it.

Daniel


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:21 ` William Lee Irwin III
  2007-04-13 22:52   ` Ingo Molnar
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-13 22:31 ` Willy Tarreau
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-13 22:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]

A pleasant surprise, though I did see it coming.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> Highlights are:
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.

It probably needs further clarification that they're things on the order
of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst
the classes is furthermore assumed, and so on. They're not quite
capable of being full-blown alternative policies, though quite a bit
can be crammed into them.

There are issues with the per- scheduling class data not being very
well-abstracted. A union for per-class data might help, if not a
dynamically allocated scheduling class -private structure. Getting
an alternative policy floating around that actually clashes a little
with the stock data in the task structure would help clarify what's
needed.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!

Bob Mullens banged out a virtual deadline interactive task scheduler
for Multics back in 1976 or thereabouts. ISTR the name Ferranti in
connection with deadline task scheduling for UNIX in particular. I've
largely seen deadline schedulers as a realtime topic, though. In any
event, it's not so radical as to lack a fair number of precedents.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).

A binomial heap would likely serve your purposes better than rbtrees.
It's faster to have the next item to dequeue at the root of the tree
structure rather than a leaf, for one. There are, of course, other
priority queue structures (e.g. van Emde Boas) able to exploit the
limited precision of the priority key for faster asymptotics, though
actual performance is an open question.

Another advantage of heaps is that they support decreasing priorities
directly, so that instead of removal and reinsertion, a less invasive
movement within the tree is possible. This nets additional constant
factor improvements beyond those for the next item to dequeue for the
case where a task remains runnable, but is preempted and its priority
decreased while it remains runnable.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
>          /proc/sys/kernel/sched_granularity_ns
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.

I like not relying on timeslices. Timeslices ultimately get you into
a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque
artifacts therefore.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.

I'm always suspicious of these claims. A moderately formal regression
test suite needs to be assembled and the testcases rather seriously
cleaned up so they e.g. run for a deterministic period of time, have
their parameters passable via command-line options instead of editing
and recompiling, don't need Lindenting to be legible, and so on. With
that in hand, a battery of regression tests can be run against scheduler
modifications to verify their correctness and to detect any disturbance
in scheduling semantics they might cause.

A very serious concern is that while a fresh scheduler may pass all
these tests, later modifications may later cause failures unnoticed
because no one's doing the regression tests and there's no obvious
test suite for testing types to latch onto. Another is that the
testcases themselves may bitrot if they're not maintainable code.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.

Speaking of regression tests, let's please at least state intended
nice semantics and get a regression test for CPU bandwidth distribution
by nice levels going.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )

And there's another one. sched_yield() semantics need a regression test
more transparent than VolanoMark or other macrobenchmarks.

At some point we really need to decide what our sched_yield() is
intended to do and get something out there to detect whether it's
behaving as intended.


On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.

The SMP load balancing class operations strike me as unusual and likely
to trip over semantic issues in alternative scheduling classes. Getting
some alternative scheduling classes out there to clarify the issues
would help here, too.


A more general question here is what you mean by "completely fair;"
there doesn't appear to be inter-tgrp, inter-pgrp, inter-session,
or inter-user fairness going on, though one might argue those are
relatively obscure notions of fairness. Complete fairness arguably
precludes static prioritization by nice levels, so there is also
that. There is also the issue of what a fair CPU bandwidth distribution
between tasks of varying desired in-isolation CPU utilization might be.
I suppose my thorniest point is where the demonstration of fairness is
as, say, a testcase. Perhaps it's fair now; when will we find out when
that fairness has been disturbed?

What these things mean when there are multiple CPU's to schedule across
may also be of concern.

I propose the following two testcases:
(1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels
	Create a number of tasks at varying nice levels. Measure the
	CPU bandwidth allocated to each.
	Success depends on intent: we decide up-front that a given nice
	level should correspond to a given share of CPU bandwidth.
	Check to see how far from the intended distribution of CPU
	bandwidth according to those decided-up-front shares the actual
	distribution of CPU bandwidth is for the test.

(2) CPU bandwidth distribution of tasks with varying CPU demands
	Create a number of tasks that would in isolation consume
	varying %cpu. Measure the CPU bandwidth allocated to each.
	Success depends on intent here, too. Decide up-front that a
	given %cpu that would be consumed in isolation should
	correspond to a given share of CPU bandwidth and check the
	actual distribution of CPU bandwidth vs. what was intended.
	Note that the shares need not linearly correspond to the
	%cpu; various sorts of things related to interactivity will
	make this nonlinear.

A third testcase for sched_yield() should be brewed up.

These testcases are oblivious to SMP. This will demand that a scheduling
policy integrate with load balancing to the extent that load balancing
occurs for the sake of distributing CPU bandwidth according to nice level.
Some explicit decision should be made regarding that.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:37     ` Willy Tarreau
  2007-04-13 23:59     ` Daniel Walker
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:30 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Daniel Walker <dwalker@mvista.com> wrote:

> I'm not in love with the current or other schedulers, so I'm 
> indifferent to this change. However, I was reviewing your release 
> notes and the patch and found myself wonder what the logarithmic 
> complexity of this new scheduler is .. I assumed it would also be 
> constant time , but the __enqueue_task_fair doesn't appear to be 
> constant time (rbtree insert complexity).. [...]

i've been worried about that myself and i've done extensive measurements 
before choosing this implementation. The rbtree turned out to be a quite 
compact data structure: we get it quite cheaply as part of the task 
structure cachemisses - which have to be touched anyway. For 1000 tasks 
it's a loop of ~10 - that's still very fast and bound in practice.

here's a test i did under CFS. Lets take some ridiculous load: 1000 
infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted 
into the same rbtree), and lets run lat_ctx:

  neptune:~/l> uptime
  22:51:23 up 8 min,  2 users,  load average: 713.06, 254.64, 91.51

  neptune:~/l> ./lat_ctx -s 0 2
  "size=0k ovr=1.61
  2 1.41

lets stop the 1000 tasks and only have ~2 tasks in the runqueue:

  neptune:~/l> ./lat_ctx -s 0 2

  "size=0k ovr=1.70
  2 1.16

so the overhead is 0.25 usecs. Considering the load (1000 tasks trash 
the cache like crazy already), this is more than acceptable.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (4 preceding siblings ...)
  2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:31 ` Willy Tarreau
  2007-04-13 23:18   ` Ingo Molnar
  2007-04-13 23:07 ` Gabriel C
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Hi Ingo,

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
(...)
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).

I have a high confidence this will work better : I've been using
time-ordered trees in userland projects for several years, and never
found anything better. To be honnest, I never understood the concept
behind the array switch, but as I never felt brave enough to hack
something in this kernel area, I simply preferred to shut up (not
enough knowledge and not enough time).

However, I have been using a very fast struct timeval-ordered RADIX
tree. I found generic rbtree code to generally be slower, certainly
because of the call to a function with arguments on every node. Both
trees are O(log(n)), the rbtree being balanced and the radix tree
being unbalanced. If you're interested, I can try to see how that
would fit (but not this week-end).

Also, I had spent much time in the past doing paper work on how to
improve fairness between interactive tasks and batch tasks. I came
up with the conclusion that for perfectness, tasks should not be
ordered by their expected wakeup time, but by their expected completion
time, which automatically takes account of their allocated and used
timeslice. It would also allow both types of workloads to share equal
CPU time with better responsiveness for the most interactive one through
the reallocation of a "credit" for the tasks which have not consumed
all of their timeslices. I remember we had discussed this with Mike
about one year ago when he fixed lots of problems in mainline scheduler.
The downside is that I never found how to make this algo fit in
O(log(n)). I always ended in something like O(n.log(n)) IIRC.

But maybe this is overkill for real life anyway. Given that a basic two
arrays switch (which I never understood) was sufficient for many people,
probably that a basic tree will be an order of magnitude better.

>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns
> 
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.

I find this useful, but to be fair with Mike and Con, they both have
proposed similar tuning knobs in the past and you said you did not want
to add that complexity for admins. People can sometimes be demotivated
by seeing their proposals finally used by people who first rejected them. 
And since both Mike and Con both have done a wonderful job in that area,
we need their experience and continued active participation more than ever.

>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.

I'm very pleased to read this. Because as I have already said it, my major
concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the
basic problems for me to the point that I switched the default lilo entry
on my notebook to 2.6 ! I hope that whatever the next scheduler will be,
we'll definitely get rid of any heuristics. Heuristics are good in 95% of
situations and extremely bad in the remaining 5%. I prefer something
reasonably good in 100% of situations.

>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.
> 
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )
> 
>  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
>    way than the vanilla scheduler does. It uses 100 runqueues (for all
>    100 RT priority levels, instead of 140 in the vanilla scheduler)
>    and it needs no expired array.
> 
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.

Will this have any impact on NUMA/HT/multi-core/etc... ?

> the core scheduler got smaller by more than 700 lines:

Well done !

Cheers,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:30   ` Ingo Molnar
@ 2007-04-13 22:37     ` Willy Tarreau
  2007-04-13 23:59     ` Daniel Walker
  1 sibling, 0 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Walker, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote:
> 
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > I'm not in love with the current or other schedulers, so I'm 
> > indifferent to this change. However, I was reviewing your release 
> > notes and the patch and found myself wonder what the logarithmic 
> > complexity of this new scheduler is .. I assumed it would also be 
> > constant time , but the __enqueue_task_fair doesn't appear to be 
> > constant time (rbtree insert complexity).. [...]
> 
> i've been worried about that myself and i've done extensive measurements 
> before choosing this implementation. The rbtree turned out to be a quite 
> compact data structure: we get it quite cheaply as part of the task 
> structure cachemisses - which have to be touched anyway. For 1000 tasks 
> it's a loop of ~10 - that's still very fast and bound in practice.

I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n)
than dumb O(1).

In a userland TCP stack I started to write 2 years ago, I used a comparable
scheduler and could reach a sustained rate of 145000 connections/s at 4
millions of concurrent connections. And yes, each time a packet was sent or
received, a task was queued/dequeued (so about 450k/s with 4 million tasks,
on an athlon 1.5 GHz). So that seems much higher than what we currently need.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:52   ` Ingo Molnar
  2007-04-13 23:30     ` William Lee Irwin III
  2007-04-14 22:38   ` Davide Libenzi
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:52 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> >   new scheduler will be active by default and all tasks will default
> >   to the new SCHED_FAIR interactive scheduling class. ]
> 
> A pleasant surprise, though I did see it coming.

hey ;)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > Highlights are:
> >  - the introduction of Scheduling Classes: an extensible hierarchy of
> >    scheduler modules. These modules encapsulate scheduling policy
> >    details and are handled by the scheduler core without the core
> >    code assuming about them too much.
> 
> It probably needs further clarification that they're things on the 
> order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization 
> amongst the classes is furthermore assumed, and so on. [...]

yep - they are linked via sched_ops->next pointer, with NULL delimiting 
the last one.

> [...] They're not quite capable of being full-blown alternative 
> policies, though quite a bit can be crammed into them.

yeah, they are not full-blown: i extended them on-demand, for the 
specific purposes of sched_fair.c and sched_rt.c. More can be done too.

> There are issues with the per- scheduling class data not being very 
> well-abstracted. [...]

yes. It's on my TODO list: i'll work more on extending the cleanups to 
those fields too.

> A binomial heap would likely serve your purposes better than rbtrees. 
> It's faster to have the next item to dequeue at the root of the tree 
> structure rather than a leaf, for one. There are, of course, other 
> priority queue structures (e.g. van Emde Boas) able to exploit the 
> limited precision of the priority key for faster asymptotics, though 
> actual performance is an open question.

i'm caching the leftmost leaf, which serves as an alternate, task-pick 
centric root in essence.

> Another advantage of heaps is that they support decreasing priorities 
> directly, so that instead of removal and reinsertion, a less invasive 
> movement within the tree is possible. This nets additional constant 
> factor improvements beyond those for the next item to dequeue for the 
> case where a task remains runnable, but is preempted and its priority 
> decreased while it remains runnable.

yeah. (Note that in CFS i'm not decreasing priorities anywhere though - 
all the priority levels in CFS stay constant, fairness is not achieved 
via rotating priorities or similar, it is achieved via the accounting 
code.)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> >    due to its design, the CFS scheduler is not prone to any of the
> >    'attacks' that exist today against the heuristics of the stock
> >    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> >    work fine and do not impact interactivity and produce the expected
> >    behavior.
> 
> I'm always suspicious of these claims.  [...]

hey, sure - but please give it a go nevertheless, i _did_ test all these 
;)

> A moderately formal regression test suite needs to be assembled [...]

by all means feel free! ;)

> A more general question here is what you mean by "completely fair;"

by that i mean the most common-sense definition: with N tasks running 
each gets 1/N CPU time if observed for a reasonable amount of time. Now 
extend this to arbitrary scheduling patterns, the end result should 
still be completely fair, according to the fundamental 1/N(time) rule 
individually applied to all the small scheduling patterns that the 
scheduling patterns give. (this assumes that the scheduling patterns are 
reasonably independent of each other - if they are not then there's no 
reasonable definition of fairness that makes sense, and we might as well 
use the 1/N rule for those cases too.)

> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or 
> inter-user fairness going on, though one might argue those are 
> relatively obscure notions of fairness. [...]

sure, i mainly concentrated on what we have in Linux today. The things 
you mention are add-ons that i can see handling via new scheduling 
classes: all the CKRM and containers type of CPU time management 
facilities.

> What these things mean when there are multiple CPU's to schedule 
> across may also be of concern.

that is handled by the existing smp-nice load balancer, that logic is 
preserved under CFS.

> These testcases are oblivious to SMP. This will demand that a 
> scheduling policy integrate with load balancing to the extent that 
> load balancing occurs for the sake of distributing CPU bandwidth 
> according to nice level. Some explicit decision should be made 
> regarding that.

this should already work reasonably fine with CFS: try massive_intr.c on 
an SMP box.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (5 preceding siblings ...)
  2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:07 ` Gabriel C
  2007-04-13 23:25   ` Ingo Molnar
  2007-04-14  2:04 ` Nick Piggin
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Gabriel C @ 2007-04-13 23:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
>   
[...]
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
>   

Compile error here.

...

CC kernel/sched.o
kernel/sched.c: In function '__rq_clock':
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c:219: warning: type defaults to 'int' in declaration of 
'__ret_warn_once'
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'rq_clock':
kernel/sched.c:230: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'sched_init':
kernel/sched.c:6013: warning: unused variable 'j'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
==> ERROR: Build Failed. Aborting...

...

There the config :

http://frugalware.org/~crazy/other/kernel/config

> 	Ingo
> -
>
>   

Regards,

Gabriel

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:18   ` Ingo Molnar
  2007-04-14 18:48     ` Bill Huey
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> >    central tunable:
> > 
> >          /proc/sys/kernel/sched_granularity_ns
> > 
> >    which can be used to tune the scheduler from 'desktop' (low
> >    latencies) to 'server' (good batching) workloads. It defaults to a
> >    setting suitable for desktop workloads. SCHED_BATCH is handled by the
> >    CFS scheduler module too.
> 
> I find this useful, but to be fair with Mike and Con, they both have 
> proposed similar tuning knobs in the past and you said you did not 
> want to add that complexity for admins. [...]

yeah. [ Note that what i opposed in the past was mostly the 'export all 
the zillion of sched.c knobs to /sys and let people mess with them' kind 
of patches which did exist and still exist. A _single_ knob, which 
represents basically the totality of parameters within sched_fair.c is 
less of a problem. I dont think i ever objected to this knob within 
staircase/SD. (If i did then i was dead wrong.) ]

> [...] People can sometimes be demotivated by seeing their proposals 
> finally used by people who first rejected them. And since both Mike 
> and Con both have done a wonderful job in that area, we need their 
> experience and continued active participation more than ever.

very much so! Both Con and Mike has contributed regularly to upstream 
sched.c:

 $ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l
 19

 $ git-log kernel/sched.c | grep 'by: Mike' | wc -l
 6

and i'd very much like both counts to increase steadily in the future 
too :)

> >  - reworked/sanitized SMP load-balancing: the runqueue-walking
> >    assumptions are gone from the load-balancing code now, and 
> >    iterators of the scheduling modules are used. The balancing code 
> >    got quite a bit simpler as a result.
> 
> Will this have any impact on NUMA/HT/multi-core/etc... ?

it will inevitably have some sort of effect - and if it's negative, i'll 
try to fix it.

I got rid of the explicit cache-hot tracking code and replaced it with a 
more natural pure 'pick the next-to-run task first, that is likely the 
most cache-cold one' logic. That just derives naturally from the rbtree 
approach.

> > the core scheduler got smaller by more than 700 lines:
> 
> Well done !

thanks :)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:07 ` Gabriel C
@ 2007-04-13 23:25   ` Ingo Molnar
  2007-04-13 23:39     ` Gabriel C
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:25 UTC (permalink / raw)
  To: Gabriel C
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Gabriel C <nix.or.die@googlemail.com> wrote:

> > as usual, any sort of feedback, bugreports, fixes and suggestions 
> > are more than welcome,
> 
> Compile error here.

ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also 
updated the full patch at the cfs-scheduler URL)

	Ingo

----------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix !CONFIG_SMP build

fix the !CONFIG_SMP build error reported by Gabriel C

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl
 	return rq->rq_clock;
 }
 
-static inline unsigned long long rq_clock(struct rq *rq)
-{
-	int this_cpu = smp_processor_id();
-
-	if (this_cpu == rq->cpu)
-		return __rq_clock(rq);
-
-	return rq->rq_clock;
-}
-
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline unsigned long long rq_clock(struct rq *rq)
+{
+	int this_cpu = smp_processor_id();
+
+	if (this_cpu == cpu_of(rq))
+		return __rq_clock(rq);
+
+	return rq->rq_clock;
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:52   ` Ingo Molnar
@ 2007-04-13 23:30     ` William Lee Irwin III
  2007-04-13 23:44       ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A binomial heap would likely serve your purposes better than rbtrees. 
>> It's faster to have the next item to dequeue at the root of the tree 
>> structure rather than a leaf, for one. There are, of course, other 
>> priority queue structures (e.g. van Emde Boas) able to exploit the 
>> limited precision of the priority key for faster asymptotics, though 
>> actual performance is an open question.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> i'm caching the leftmost leaf, which serves as an alternate, task-pick 
> centric root in essence.

I noticed that, yes. It seemed a better idea to me to use a data
structure that has what's needed built-in, but I suppose it's not gospel.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Another advantage of heaps is that they support decreasing priorities 
>> directly, so that instead of removal and reinsertion, a less invasive 
>> movement within the tree is possible. This nets additional constant 
>> factor improvements beyond those for the next item to dequeue for the 
>> case where a task remains runnable, but is preempted and its priority 
>> decreased while it remains runnable.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> yeah. (Note that in CFS i'm not decreasing priorities anywhere though - 
> all the priority levels in CFS stay constant, fairness is not achieved 
> via rotating priorities or similar, it is achieved via the accounting 
> code.)

Sorry, "priority" here would be from the POV of the queue data
structure. From the POV of the scheduler it would be resetting the
deadline or whatever the nomenclature cooked up for things is, most
obviously in requeue_task_fair() and task_tick_fair().


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm always suspicious of these claims.  [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> hey, sure - but please give it a go nevertheless, i _did_ test all these 
> ;)

The suspicion essentially centers around how long the state of affairs
will hold up because comprehensive re-testing is not noticeably done
upon updates to scheduling code or kernel point releases.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A moderately formal regression test suite needs to be assembled [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by all means feel free! ;)

I can only do so much, but I have done work to clean up other testcases
going around. I'm mostly looking at testcases as I go over them or
develop some interest in the subject and rewriting those that already
exist or hammering out new ones as I need them. The main contribution
toward this is that I've sort of made a mental note to stash the results
of the effort somewhere and pass them along to those who do regular
testing on kernels or otherwise import test suites into their collections.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A more general question here is what you mean by "completely fair;"

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by that i mean the most common-sense definition: with N tasks running 
> each gets 1/N CPU time if observed for a reasonable amount of time. Now 
> extend this to arbitrary scheduling patterns, the end result should 
> still be completely fair, according to the fundamental 1/N(time) rule 
> individually applied to all the small scheduling patterns that the 
> scheduling patterns give. (this assumes that the scheduling patterns are 
> reasonably independent of each other - if they are not then there's no 
> reasonable definition of fairness that makes sense, and we might as well 
> use the 1/N rule for those cases too.)

I'd start with identically-behaving CPU-bound tasks here. It's easy
enough to hammer out a testcase that starts up N CPU-bound tasks, runs
them for a few minutes, stops them, collects statistics on their
runtime, and gives us an idea of whether 1/N came out properly. I'll
get around to that at some point.

Where it gets complex is when the behavior patterns vary, e.g. they're
not entirely CPU-bound and their desired in-isolation CPU utilization
varies, or when nice levels vary, or both vary. I went on about
testcases for those in particular in the prior post, though not both
at once. The nice level one in particular needs an up-front goal for
distribution of CPU bandwidth in a mixture of competing tasks with
varying nice levels.

There are different ways to define fairness, but a uniform distribution
of CPU bandwidth across a set of identical competing tasks is a good,
testable definition.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or 
>> inter-user fairness going on, though one might argue those are 
>> relatively obscure notions of fairness. [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> sure, i mainly concentrated on what we have in Linux today. The things 
> you mention are add-ons that i can see handling via new scheduling 
> classes: all the CKRM and containers type of CPU time management 
> facilities.

At some point the CKRM and container people should be pinged to see
what (if anything) they need to achieve these sorts of things. It's
not clear to me that the specific cases I cited are considered
relevant to anyone. I presume that if they are, someone will pipe
up with a feature request. It was more a sort of catalogue of different
notions of fairness that could arise than any sort of suggestion.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> What these things mean when there are multiple CPU's to schedule 
>> across may also be of concern.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> that is handled by the existing smp-nice load balancer, that logic is 
> preserved under CFS.

Given the things going wrong, I'm curious as to whether that works, and
if so, how well. I'll drop that into my list of testcases that should be
arranged for, though I won't guarantee that I'll get to it myself in any
sort of timely fashion.

What this ultimately needs is specifying the semantics of nice levels
so that we can say that a mixture of competing tasks with varying nice
levels should have an ideal distribution of CPU bandwidth to check for.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> These testcases are oblivious to SMP. This will demand that a 
>> scheduling policy integrate with load balancing to the extent that 
>> load balancing occurs for the sake of distributing CPU bandwidth 
>> according to nice level. Some explicit decision should be made 
>> regarding that.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> this should already work reasonably fine with CFS: try massive_intr.c on 
> an SMP box.

Where is massive_intr.c, BTW?


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:25   ` Ingo Molnar
@ 2007-04-13 23:39     ` Gabriel C
  0 siblings, 0 replies; 712+ messages in thread
From: Gabriel C @ 2007-04-13 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Gabriel C <nix.or.die@googlemail.com> wrote:
>
>   
>>> as usual, any sort of feedback, bugreports, fixes and suggestions 
>>> are more than welcome,
>>>       
>> Compile error here.
>>     
>
> ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also 
> updated the full patch at the cfs-scheduler URL)
>   

Yes it does , thx :) , only the " warning: unused variable 'j' " left.

> 	Ingo
>   

Regards,

Gabriel


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:30     ` William Lee Irwin III
@ 2007-04-13 23:44       ` Ingo Molnar
  2007-04-13 23:58         ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]


* William Lee Irwin III <wli@holomorphy.com> wrote:

> Where it gets complex is when the behavior patterns vary, e.g. they're 
> not entirely CPU-bound and their desired in-isolation CPU utilization 
> varies, or when nice levels vary, or both vary. [...]

yes. I tested things like 'massive_intr.c' (attached, written by Satoru 
Takeuchi) which starts N tasks which each work for 8msec then sleep 
1msec:

from its output, the second column is the CPU time each thread got, the 
more even, the fairer the scheduling. On vanilla i get:

 mercury:~> ./massive_intr 10 10
 024873  00000150
 024874  00000123
 024870  00000069
 024868  00000068
 024866  00000051
 024875  00000206
 024872  00000093
 024869  00000138
 024867  00000078
 024871  00000223

on CFS i get:

 neptune:~> ./massive_intr 10 10
 002266  00000112
 002260  00000113
 002261  00000112
 002267  00000112
 002269  00000112
 002265  00000112
 002262  00000113
 002268  00000113
 002264  00000112
 002263  00000113

so it is quite a bit more even ;)

another related test-utility is one i wrote:

  http://people.redhat.com/mingo/scheduler-patches/ring-test.c

this is a ring of 100 tasks each doing work for 100 msecs and then 
sleeping for 1 msec. I usually test this by also running a CPU hog in 
parallel to it, and checking whether it gets ~50.0% of CPU time under 
CFS. (it does)

	Ingo

[-- Attachment #2: massive_intr.c --]
[-- Type: text/plain, Size: 9833 bytes --]


#if 0

Hi Ingo and all,

When I was executing massive interactive processes, I found that some of them
occupy CPU time and the others hardly run.

It seems that some of processes which occupy CPU time always has max effective
prio (default+5) and the others have max - 1. What happen here is...

 1. If there are moderate number of max interactive processes, they can be
    re-inserted into active queue without falling down its priority again and
    again.
 2. In this case, the others seldom run, and can't get max effective priority
    at next exhausting because scheduler considers them to sleep too long.
 3. Goto 1, OOPS!

Unfortunately I haven't been able to make the patch resolving this problem
yet. Any idea?

I also attach the test program which easily recreates this problem.

Test program flow:

  1. First process starts child proesses and wait for 5 minutes.
  2. Each child process executes "work 8 msec and sleep 1 msec" loop
     continuously.
  3. After 3 minits have passed, each child processes prints the # of loops
     which executed.

What expected:

  Each child processes execute nearly equal # of loops.

Test environment:

  - kernel:                 2.6.20(*1)
  - # of CPUs:                 1 or 2
  - # of child processes:  200 or 400
  - nice value:            0 or 20(*2)

*1) I confirmed that 2.6.21-rc5 has no change regarding this problem.
*2) If a process have nice 20, scheduler never regards it as interactive.

Test results:

-----------+----------------+------+------------------------------------
 # of CPUs | # of processes | nice |              result
-----------+----------------+------+------------------------------------
           |                |   20 | looks good
   1(i386) |                +------+------------------------------------
           |                |    0 | 4 processes occupy 98% of CPU time
-----------+            200 +------+------------------------------------
           |                |   20 | looks good
           |                +------+------------------------------------
           |                |    0 | 8 processes occupy 72% of CPU time
   2(ia64) +----------------+------+------------------------------------
           |            400 |   20 | looks good
           |                +------+------------------------------------
           |                |    0 | 8 processes occupy 98% of CPU time
-----------+----------------+------+------------------------------------

FYI. 2.6.21-rc3-mm1 (enabling RSDL scheduler) works fine in the all casees :-)

Thanks,

Satoru

-------------------------------------------------------------------------------
#endif
/*
 * massive_intr - run @nproc interactive processes and print the number of
 *		  loops(*1) each process executes in @runtime secs.
 *
 *		  *1) "work 8 msec and sleep 1msec" loop
 *
 *	Usage:  massive_intr <nproc> <runtime>
 *
 *		 @nproc:   number of processes
 *		 @runtime: execute time[sec]
 *
 *	ex) If you want to run 300 processes for 5 mins, issue the
 *	    command as follows:
 *
 *		$ massive_intr 300 300
 *
 *	How to build:
 *
 *		cc -o massive_intr massive_intr.c -lrt
 *
 *
 *  Copyright (C) 2007  Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or (at
 *  your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful, but
 *  WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License along
 *  with this program; if not, write to the Free Software Foundation, Inc.,
 *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 */

#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <err.h>

#define WORK_MSECS	8
#define SLEEP_MSECS	1

#define MAX_PROC	1024
#define SAMPLE_COUNT	1000000000
#define USECS_PER_SEC	1000000
#define USECS_PER_MSEC	1000
#define NSECS_PER_MSEC	1000000
#define SHMEMSIZE	4096

static const char *shmname = "/sched_interactive_shmem";
static void *shmem;
static sem_t *printsem;
static int nproc;
static int runtime;
static int fd;
static time_t *first;
static pid_t pid[MAX_PROC];
static int return_code;

static void cleanup_resources(void)
{
	if (sem_destroy(printsem) < 0)
		warn("sem_destroy() failed");
	if (munmap(shmem, SHMEMSIZE) < 0)
		warn("munmap() failed");
	if (close(fd) < 0)
		warn("close() failed");
}	

static void abnormal_exit(void)
{
	if (kill(getppid(), SIGUSR2) < 0)
		err(EXIT_FAILURE, "kill() failed");
}

static void sighandler(int signo)
{
}

static void sighandler2(int signo)
{
	return_code = EXIT_FAILURE;
}

static void loopfnc(int nloop)
{
	int i;
	for (i = 0; i < nloop; i++)
		;
}

static int loop_per_msec(void)
{
	struct timeval tv[2];
	int before, after;

	if (gettimeofday(&tv[0], NULL) < 0)
		return -1;
	loopfnc(SAMPLE_COUNT);
	if (gettimeofday(&tv[1], NULL) < 0)
		return -1;
	before = tv[0].tv_sec*USECS_PER_SEC+tv[0].tv_usec;
	after = tv[1].tv_sec*USECS_PER_SEC+tv[1].tv_usec;

	return SAMPLE_COUNT/(after - before)*USECS_PER_MSEC;
}

static void *test_job(void *arg)
{
	int l = (int)arg;
	int count = 0;
	time_t current;
	sigset_t sigset;
	struct sigaction sa;
	struct timespec ts = { 0, NSECS_PER_MSEC*SLEEP_MSECS};

	sa.sa_handler = sighandler;
	if (sigemptyset(&sa.sa_mask) < 0) {
		warn("sigemptyset() failed");
		abnormal_exit();
	}
	sa.sa_flags = 0;
	if (sigaction(SIGUSR1, &sa, NULL) < 0) {
		warn("sigaction() failed");
		abnormal_exit();
	}
	if (sigemptyset(&sigset) < 0) {
		warn("sigfillset() failed");
		abnormal_exit();
	}
	sigsuspend(&sigset);
	if (errno != EINTR) {
		warn("sigsuspend() failed");
		abnormal_exit();
	}
	/* main loop */
	do {
		loopfnc(WORK_MSECS*l);
		if (nanosleep(&ts, NULL) < 0) {
			warn("nanosleep() failed");
			abnormal_exit();
		}
		count++;
		if (time(&current) == -1) {
			warn("time() failed");
			abnormal_exit();
		}
	} while (difftime(current, *first) < runtime);

	if (sem_wait(printsem) < 0) {
		warn("sem_wait() failed");
		abnormal_exit();
	}
	printf("%06d\t%08d\n", getpid(), count);
	if (sem_post(printsem) < 0) {
		warn("sem_post() failed");
		abnormal_exit();
	}
	exit(EXIT_SUCCESS);
}

static void usage(void)
{
	fprintf(stderr,
		"Usage : massive_intr <nproc> <runtime>\n"
		"\t\tnproc  : number of processes\n"
		"\t\truntime   : execute time[sec]\n");
	exit(EXIT_FAILURE);
}

int main(int argc, char **argv)
{
	int i, j;
	int status;
	sigset_t sigset;
	struct sigaction sa;
	int c;

	if (argc != 3)
		usage();

	nproc = strtol(argv[1], NULL, 10);
	if (errno || nproc < 1 || nproc > MAX_PROC)
		err(EXIT_FAILURE, "invalid multinum");
	runtime = strtol(argv[2], NULL, 10);
	if (errno || runtime <= 0)
		err(EXIT_FAILURE, "invalid runtime");

	sa.sa_handler = sighandler2;
	if (sigemptyset(&sa.sa_mask) < 0)
		err(EXIT_FAILURE, "sigemptyset() failed");
	sa.sa_flags = 0;
	if (sigaction(SIGUSR2, &sa, NULL) < 0)
		err(EXIT_FAILURE, "sigaction() failed");
	if (sigemptyset(&sigset) < 0)
		err(EXIT_FAILURE, "sigemptyset() failed");
	if (sigaddset(&sigset, SIGUSR1) < 0)
		err(EXIT_FAILURE, "sigaddset() failed");
	if (sigaddset(&sigset, SIGUSR2) < 0)
		err(EXIT_FAILURE, "sigaddset() failed");
	if (sigprocmask(SIG_BLOCK, &sigset, NULL) < 0)
		err(EXIT_FAILURE, "sigprocmask() failed");

	/* setup shared memory */
	if ((fd = shm_open(shmname, O_CREAT | O_RDWR, 0644)) < 0)
		err(EXIT_FAILURE, "shm_open() failed");
	if (shm_unlink(shmname) < 0) {
		warn("shm_unlink() failed");
		goto err_close;
	}
	if (ftruncate(fd, SHMEMSIZE) < 0) {
		warn("ftruncate() failed");
		goto err_close;
	}
	shmem = mmap(NULL, SHMEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (shmem == (void *)-1) {
		warn("mmap() failed");
		goto err_unmap;
	}
	printsem = shmem;
	first = shmem + sizeof(*printsem);
	
	/* initialize semaphore */
	if ((sem_init(printsem, 1, 1)) < 0) {
		warn("sem_init() failed");
		goto err_unmap;
	}

	if ((c = loop_per_msec()) < 0) {
		fprintf(stderr, "loop_per_msec() failed\n");
		goto err_sem;
	}

	for (i = 0; i < nproc; i++) {
		pid[i] = fork();
		if (pid[i] == -1) {
			warn("fork() failed\n");
			for (j = 0; j < i; j++)
				if (kill(pid[j], SIGKILL) < 0)
					warn("kill() failed");
			goto err_sem;
		}
		if (pid[i] == 0)
			test_job((void *)c);
	}

	if (sigemptyset(&sigset) < 0) {
		warn("sigemptyset() failed");
		goto err_proc;
	}
	if (sigaddset(&sigset, SIGUSR2) < 0) {
		warn("sigaddset() failed");
		goto err_proc;
	}
	if (sigprocmask(SIG_UNBLOCK, &sigset, NULL) < 0) {
		warn("sigprocmask() failed");
		goto err_proc;
	}
	if (time(first) < 0) {
		warn("time() failed");
		goto err_proc;
	}
	if ((kill(0, SIGUSR1)) == -1) {
		warn("kill() failed");
		goto err_proc;
	}
	for (i = 0; i < nproc; i++) {
		if (wait(&status) < 0) {
			warn("wait() failed");
			goto err_proc;
		}
	}
	cleanup_resources();
	exit(return_code);
 err_proc:
	for (i = 0; i < nproc; i++)
		if (kill(pid[i], SIGKILL) < 0)
			if (errno != ESRCH)
				warn("kill() failed");
 err_sem:
	if (sem_destroy(printsem) < 0)
		warn("sem_destroy() failed");
 err_unmap:
	if (munmap(shmem, SHMEMSIZE) < 0)
		warn("munmap() failed");
 err_close:
	if (close(fd) < 0)
		warn("close() failed");	
	exit(EXIT_FAILURE);
}

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:44       ` Ingo Molnar
@ 2007-04-13 23:58         ` William Lee Irwin III
  0 siblings, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Where it gets complex is when the behavior patterns vary, e.g. they're 
>> not entirely CPU-bound and their desired in-isolation CPU utilization 
>> varies, or when nice levels vary, or both vary. [...]

On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote:
> yes. I tested things like 'massive_intr.c' (attached, written by Satoru 
> Takeuchi) which starts N tasks which each work for 8msec then sleep 
> 1msec:
[...]
> another related test-utility is one i wrote:
>   http://people.redhat.com/mingo/scheduler-patches/ring-test.c
> this is a ring of 100 tasks each doing work for 100 msecs and then 
> sleeping for 1 msec. I usually test this by also running a CPU hog in 
> parallel to it, and checking whether it gets ~50.0% of CPU time under 
> CFS. (it does)

These are both tremendously useful. The code is also in rather good
shape so only minimal modifications (for massive_intr.c I'm not even
sure if any are needed at all) are needed to plug them into the test
harness I'm aware of. I'll queue them both for me to adjust and send
over to testers I don't want to burden with hacking on testcases I
myself am asking them to add to their suites.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:37     ` Willy Tarreau
@ 2007-04-13 23:59     ` Daniel Walker
  2007-04-14 10:55       ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Daniel Walker @ 2007-04-13 23:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


One other thing, what happens in the case of slow, frequency changing,
are/or inaccurate clocks .. Is the old sched_clock behavior still
tolerated?

Daniel 


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (6 preceding siblings ...)
  2007-04-13 23:07 ` Gabriel C
@ 2007-04-14  2:04 ` Nick Piggin
  2007-04-14  6:32   ` Ingo Molnar
  2007-04-14 15:09 ` S.Çağlar Onur
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-14  2:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Always good to see another contender ;)

> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> 
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]

I don't know why there is such noise about fairness right now... I
thought fairness was one of the fundamental properties of a good CPU
scheduler, and my scheduler definitely always aims for that above most
other things. Why not just keep SCHED_OTHER?


> Highlights are:
> 
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.

Don't really like this, but anyway...


>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
> 
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!

I guess the 2.4 and earlier scheduler kind of did that as well.


>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]

Comment about the code: shouldn't you be requeueing the task in the rbtree
wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick
look so far).


>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).
> 
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever.

Well, I guess there is still some mechanism to decide which process is most
eligible to run? ;) Considering that question has no "right" answer for
SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted
they are obviously fairly elegant in contrast to the O(1) scheduler ;)


> There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns

Suppose you have 2 CPU hogs running, is sched_granularity_ns the
frequency at which they will context switch?


>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )

What is better behaviour for sched_yield?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  2:04 ` Nick Piggin
@ 2007-04-14  6:32   ` Ingo Molnar
  2007-04-14  6:43     ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14  6:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Nick Piggin <npiggin@suse.de> wrote:

> >    The CFS patch uses a completely different approach and implementation
> >    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> >    that of RSDL/SD, which is a high standard to meet :-) Testing
> >    feedback is welcome to decide this one way or another. [ and, in any
> >    case, all of SD's logic could be added via a kernel/sched_sd.c module
> >    as well, if Con is interested in such an approach. ]
> 
> Comment about the code: shouldn't you be requeueing the task in the 
> rbtree wherever you change wait_runtime? eg. task_new_fair? [...]

yes: the task's position within the rbtree is updated every time 
wherever wait_runtime is change. task_new_fair is the method during new 
task creation, but indeed i forgot to requeue the parent. I've fixed 
this in my tree (see the delta patch below) - thanks!

	Ingo

----------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix parent's rbtree position

Nick noticed that upon fork we change parent->wait_runtime but we do not 
requeue it within the rbtree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq,
 
 	p->wait_runtime = parent->wait_runtime/2;
 	parent->wait_runtime /= 2;
+	requeue_task_fair(rq, parent);
+
 	/*
 	 * For the first timeslice we allow child threads
 	 * to move their parent-inherited fairness back

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  6:32   ` Ingo Molnar
@ 2007-04-14  6:43     ` Ingo Molnar
  2007-04-14  8:08       ` Willy Tarreau
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14  6:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> Nick noticed that upon fork we change parent->wait_runtime but we do 
> not requeue it within the rbtree.

this fix is not complete - because the child runqueue is locked here, 
not the parent's. I've fixed this properly in my tree and have uploaded 
a new sched-modular+cfs.patch. (the effects of the original bug are 
mostly harmless, the rbtree position gets corrected the first time the 
parent reschedules. The fix might improve heavy forker handling.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  6:43     ` Ingo Molnar
@ 2007-04-14  8:08       ` Willy Tarreau
  2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:36         ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Nick noticed that upon fork we change parent->wait_runtime but we do 
> > not requeue it within the rbtree.
> 
> this fix is not complete - because the child runqueue is locked here, 
> not the parent's. I've fixed this properly in my tree and have uploaded 
> a new sched-modular+cfs.patch. (the effects of the original bug are 
> mostly harmless, the rbtree position gets corrected the first time the 
> parent reschedules. The fix might improve heavy forker handling.)

It looks like it did not reach your public dir yet.

BTW, I've given it a try. It seems pretty usable. I have also tried
the usual meaningless "glxgears" test with 12 of them at the same time,
and they rotate very smoothly, there is absolutely no pause in any of
them. But they don't all run at same speed, and top reports their CPU
load varying from 3.4 to 10.8%, with what looks like more CPU is
assigned to the first processes, and less CPU for the last ones. But
this is just a rough observation on a stupid test, I would not call
that one scientific in any way (and X has its share in the test too).

I'll perform other tests when I can rebuild with your fixed patch.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:08       ` Willy Tarreau
@ 2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 19:48           ` William Lee Irwin III
  2007-04-14 10:36         ` Ingo Molnar
  1 sibling, 2 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14  8:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > Nick noticed that upon fork we change parent->wait_runtime but we do 
> > > not requeue it within the rbtree.
> > 
> > this fix is not complete - because the child runqueue is locked here, 
> > not the parent's. I've fixed this properly in my tree and have uploaded 
> > a new sched-modular+cfs.patch. (the effects of the original bug are 
> > mostly harmless, the rbtree position gets corrected the first time the 
> > parent reschedules. The fix might improve heavy forker handling.)
> 
> It looks like it did not reach your public dir yet.
> 
> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same time,
> and they rotate very smoothly, there is absolutely no pause in any of
> them. But they don't all run at same speed, and top reports their CPU
> load varying from 3.4 to 10.8%, with what looks like more CPU is
> assigned to the first processes, and less CPU for the last ones. But
> this is just a rough observation on a stupid test, I would not call
> that one scientific in any way (and X has its share in the test too).

Follow-up: I think this is mostly X-related. I've started 100 scheddos,
and all get the same CPU percentage. Interestingly, mpg123 in parallel
does never skip at all because it needs quite less than 1% CPU and gets
its fair share at a load of 112. Xterms are slow to respond to typing
with the 12 gears and 100 scheddos, and expectedly it was X which was
starving. renicing it to -5 restores normal feeling with very slow
but smooth gear rotations. Leaving X niced at 0 and killing the gears
also restores normal behaviour.

All in all, it seems logical that processes which serve many others
become a bottleneck for them.

Forking becomes very slow above a load of 100 it seems. Sometimes,
the shell takes 2 or 3 seconds to return to prompt after I run
"scheddos &"

Those are very promising results, I nearly observe the same responsiveness
as I had on a solaris 10 with 10k running processes on a bigger machine.

I would be curious what a mysql test result would look like now.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:08       ` Willy Tarreau
  2007-04-14  8:36         ` Willy Tarreau
@ 2007-04-14 10:36         ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:36 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> > this fix is not complete - because the child runqueue is locked 
> > here, not the parent's. I've fixed this properly in my tree and have 
> > uploaded a new sched-modular+cfs.patch. (the effects of the original 
> > bug are mostly harmless, the rbtree position gets corrected the 
> > first time the parent reschedules. The fix might improve heavy 
> > forker handling.)
> 
> It looks like it did not reach your public dir yet.

oops, forgot to do the last step - should be fixed now.

> BTW, I've given it a try. It seems pretty usable. I have also tried 
> the usual meaningless "glxgears" test with 12 of them at the same 
> time, and they rotate very smoothly, there is absolutely no pause in 
> any of them. But they don't all run at same speed, and top reports 
> their CPU load varying from 3.4 to 10.8%, with what looks like more 
> CPU is assigned to the first processes, and less CPU for the last 
> ones. But this is just a rough observation on a stupid test, I would 
> not call that one scientific in any way (and X has its share in the 
> test too).

ok, i'll try that too - there should be nothing particularly special 
about glxgears.

there's another tweak you could try:

	echo 500000 > /proc/sys/kernel/sched_granularity_ns

note that this causes preemption to be done as fast as the scheduler can 
do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the 
best results a CONFIG_HZ of 1000 is useful.)

plus there's an add-on to CFS at:

  http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch

this makes the 'CPU usage history cutoff' configurable and sets it to a 
default of 100 msecs. This means that CPU hogs (tasks which actively 
kept other tasks from running) will be remembered, for up to 100 msecs 
of their 'hogness'.

Setting this limit back to 0 gives the 'vanilla' CFS scheduler's 
behavior:

     echo 0 > /proc/sys/kernel/sched_max_hog_history_ns

(So when trying this you dont have to reboot with this patch 
applied/unapplied, just set this value.)

> I'll perform other tests when I can rebuild with your fixed patch.

cool, thanks!

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:36         ` Willy Tarreau
@ 2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 15:17             ` Mark Lord
  2007-04-14 19:48           ` William Lee Irwin III
  1 sibling, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> Forking becomes very slow above a load of 100 it seems. Sometimes, the 
> shell takes 2 or 3 seconds to return to prompt after I run "scheddos 
> &"

this might be changed/impacted by the parent-requeue fix that is in the 
updated (for real, promise! ;) patch. Right now on CFS a forking parent 
shares its own run stats with the child 50%/50%. This means that heavy 
forkers are indeed penalized. Another logical choice would be 100%/0%: a 
child has to earn its own right.

i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
pristine (and smaller/faster) approach to just not give new children any 
stats history to begin with. I've implemented an add-on patch that 
implements this, you can find it at:

    http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

> Those are very promising results, I nearly observe the same 
> responsiveness as I had on a solaris 10 with 10k running processes on 
> a bigger machine.

cool and thanks for the feedback! (Btw., as another test you could also 
try to renice "scheddos" to +19. While that does not push the scheduler 
nearly as hard as nice 0, it is perhaps more indicative of how a truly 
abusive many-tasks workload would be run in practice.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:59     ` Daniel Walker
@ 2007-04-14 10:55       ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:55 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Daniel Walker <dwalker@mvista.com> wrote:

> One other thing, what happens in the case of slow, frequency changing, 
> are/or inaccurate clocks .. Is the old sched_clock behavior still 
> tolerated?

yeah, good question. Yesterday i did a quick testboot with that too, and 
it seemed to behave pretty OK with the low-res [jiffies based] 
sched_clock() too. Although in that case things are much more of an 
approximation and rounding/arithmetics artifacts are possible. CFS works 
best with a high-resolution cycle counter.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 10:53           ` Ingo Molnar
@ 2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
                                 ` (2 more replies)
  2007-04-14 15:17             ` Mark Lord
  1 sibling, 3 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Forking becomes very slow above a load of 100 it seems. Sometimes, the 
> > shell takes 2 or 3 seconds to return to prompt after I run "scheddos 
> > &"
> 
> this might be changed/impacted by the parent-requeue fix that is in the 
> updated (for real, promise! ;) patch. Right now on CFS a forking parent 
> shares its own run stats with the child 50%/50%. This means that heavy 
> forkers are indeed penalized. Another logical choice would be 100%/0%: a 
> child has to earn its own right.
> 
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
> pristine (and smaller/faster) approach to just not give new children any 
> stats history to begin with. I've implemented an add-on patch that 
> implements this, you can find it at:
> 
>     http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

Not tried yet, it already looks better with the update and sched-fair-hog.
Now xterm open "instantly" even with 1000 running processes.

> > Those are very promising results, I nearly observe the same 
> > responsiveness as I had on a solaris 10 with 10k running processes on 
> > a bigger machine.
> 
> cool and thanks for the feedback! (Btw., as another test you could also 
> try to renice "scheddos" to +19. While that does not push the scheduler 
> nearly as hard as nice 0, it is perhaps more indicative of how a truly 
> abusive many-tasks workload would be run in practice.)

Good idea. The machine I'm typing from now has 1000 scheddos running at +19,
and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears,
but I'm pretty sure that it's a top artifact now because the cumulated times
are roughly identical :

 14:33:13  up 13 min,  7 users,  load average: 900.30, 443.75, 177.70
1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped
CPU0 states:  56.0% user  43.0% system   23.0% nice   0.0% iowait   0.0% idle
CPU1 states:  94.0% user   5.0% system    0.0% nice   0.0% iowait   0.0% idle
Mem:  1034764k av,  223788k used,  810976k free,       0k shrd,    7192k buff
       104400k active,              51904k inactive
Swap:  497972k av,       0k used,  497972k free                   68020k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 1325 root      20   0 69240 9400  3740 R    27.6  0.9   4:46   1 X
 1412 willy     20   0  6284 2552  1740 R    14.2  0.2   1:09   1 glxgears
 1419 willy     20   0  6256 2384  1612 R    10.7  0.2   1:09   1 glxgears
 1409 willy     20   0  2824 1940   788 R     8.9  0.1   0:25   1 top
 1414 willy     20   0  6280 2544  1728 S     8.9  0.2   1:08   0 glxgears
 1415 willy     20   0  6256 2376  1600 R     8.9  0.2   1:07   1 glxgears
 1417 willy     20   0  6256 2384  1612 S     8.9  0.2   1:05   1 glxgears
 1420 willy     20   0  6284 2552  1740 R     8.9  0.2   1:07   1 glxgears
 1410 willy     20   0  6256 2372  1600 S     7.1  0.2   1:11   1 glxgears
 1413 willy     20   0  6260 2388  1612 S     7.1  0.2   1:08   0 glxgears
 1416 willy     20   0  6284 2544  1728 S     6.2  0.2   1:06   0 glxgears
 1418 willy     20   0  6252 2384  1612 S     6.2  0.2   1:09   0 glxgears
 1411 willy     20   0  6280 2548  1740 S     5.3  0.2   1:15   1 glxgears
 1421 willy     20   0  6280 2536  1728 R     5.3  0.2   1:05   1 glxgears

>From time to time, one of the 12 aligned gears will quickly perform a full
quarter of round while others slowly turn by a few degrees. In fact, while
I don't know this process's CPU usage pattern, there's something useful in
it : it allows me to visually see when process accelerate/deceleraet. What
would be best would be just a clock requiring low X ressources and eating
vast amounts of CPU between movements. It will help visually monitor CPU
distribution without being too much impacted by X.

I've just added another 100 scheddos at nice 0, and the system is still
amazingly usable. I just tried exchanging a 1-byte token between 188 "dd"
processes which communicate through circular pipes. The context switch
rate is rather high but this has no impact on the rest :

willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo

   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
1105  0  1      0 781108   8364  68180    0    0     0    12    5 82187 59 41  0
1114  0  1      0 781108   8364  68180    0    0     0     0    0 81528 58 42  0
1112  0  1      0 781108   8364  68180    0    0     0     0    1 80899 58 42  0
1113  0  1      0 781108   8364  68180    0    0     0     0   26 83466 58 42  0
1106  0  2      0 781108   8376  68168    0    0     0     8   91 83193 58 42  0
1107  0  1      0 781108   8376  68180    0    0     0     4    7 79951 58 42  0
1106  0  1      0 781108   8376  68180    0    0     0     0   46 80939 57 43  0
1114  0  1      0 781108   8376  68180    0    0     0     0   21 82019 56 44  0
1116  0  1      0 781108   8376  68180    0    0     0     0   16 85134 56 44  0
1114  0  3      0 781108   8388  68168    0    0     0    16   20 85871 56 44  0
1112  0  1      0 781108   8388  68168    0    0     0     0   15 80412 57 43  0
1112  0  1      0 781108   8388  68180    0    0     0     0  101 83002 58 42  0
1113  0  1      0 781108   8388  68180    0    0     0     0   25 82230 56 44  0

Playing with the sched_max_hog_history_ns does not seem to change anything.
Maybe it's useful for other workloads. Anyway, I have nothing to complain
about, because it's not common for me to be able to normally type a mail on
a system with more than 1000 running processes ;-)

Also, mixed with this load, I have started injecting HTTP requests between
two local processes. The load is stable at 7700 req/s (11800 when alone),
and what I was interested in is the response time. It's perfectly stable
between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were
varying a lot under stock scheduler, with some sessions sometimes pausing
for seconds. (RSDL fixed this though).

Well, I'll stop heating the room for now as I get out of ideas about how
to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
with his workload which behaves strangely on RSDL. If it works OK here,
it will be the proof that heuristics should not be needed.

Congrats !
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
@ 2007-04-14 13:27               ` Willy Tarreau
  2007-04-14 14:45                 ` Willy Tarreau
  2007-04-14 16:19                 ` Ingo Molnar
  2007-04-15  7:54               ` Mike Galbraith
  2007-04-19  9:01               ` Ingo Molnar
  2 siblings, 2 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> 
> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it.

Ah, I found something nasty.
If I start large batches of processes like this :

$ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done

the ramp up slows down after 700-800 processes, but something very
strange happens. If I'm under X, I can switch the focus to all xterms
(the WM is still alive) but all xterms are frozen. On the console,
after one moment I simply cannot switch to another VT anymore while
I can still start commands locally. But "chvt 2" simply blocks.
SysRq-K killed everything and restored full control. Dmesg shows lots
of :
SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

I wonder if part of the problem would be too many processes bound to
the same tty :-/

I'll investigate a bit.

Willy




^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:27               ` Willy Tarreau
@ 2007-04-14 14:45                 ` Willy Tarreau
  2007-04-14 16:14                   ` Ingo Molnar
  2007-04-14 16:19                 ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 14:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> > 
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
> 
> Ah, I found something nasty.
> If I start large batches of processes like this :
> 
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
> 
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while
> I can still start commands locally. But "chvt 2" simply blocks.
> SysRq-K killed everything and restored full control. Dmesg shows lots
> of :
> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/

Does not seem easy to reproduce, it looks like some resource pools are
kept pre-allocated after a first run, because if I kill scheddos during
the ramp up then start it again, it can go further. The problem happens
when the parent is forking. Also, I modified scheddos to close(0,1,2)
and to perform the forks itself and it does not cause any problem, even
with 4000 processes running. So I really suspect that the problem I
encountered above was tty-related.

BTW, I've tried your fork patch. It definitely helps forking because it
takes below one second to create 4000 processes, then the load slowly
increases. As you said, the children have to earn their share, and I
find that it makes it easier to conserve control of the whole system's
stability.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (7 preceding siblings ...)
  2007-04-14  2:04 ` Nick Piggin
@ 2007-04-14 15:09 ` S.Çağlar Onur
  2007-04-14 16:09   ` Ingo Molnar
  2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 15:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: 
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:

Currently im using Linus's current git + your extra patches + CFS for a while. 
Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek 
forward/backward while its playing a video with some workload (checking out 
SVN repositories, compiling something). Stopping other process didn't help 
kaffeine so it stays freezed stated until i kill it.

I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have 
that problem) but i can't reproduce this with mainline or mainline + sd-0.39.

[1] http://cekirdek.pardus.org.tr/~caglar/psaux
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 13:01             ` Willy Tarreau
@ 2007-04-14 15:17             ` Mark Lord
  1 sibling, 0 replies; 712+ messages in thread
From: Mark Lord @ 2007-04-14 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
> pristine (and smaller/faster) approach to just not give new children any 
> stats history to begin with. I've implemented an add-on patch that 
> implements this, you can find it at:
> 
>     http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper)
with the new CFS for much of this morning now, with the odd switch back to
the stock scheduler for comparison.

Here, CFS really works and feels better than the stock scheduler.

Even with a "make -j2" kernel rebuild happening (no manual renice, either!)
things "just work" about as smoothly as ever.  That's something which RSDL
never achieved for me, though I have not retested RSDL beyond v0.34 or so.

Well done, Ingo!  I *want* this as my default scheduler.

Things seemed slightly less smooth when I had the CPU hogs
and fair-fork extension patches both applied.

I'm going to try again now with just the fair-fork added on.

Cheers

Mark

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-14 16:09   ` Ingo Molnar
  2007-04-14 16:59     ` S.Çağlar Onur
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:09 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> > i'm pleased to announce the first release of the "Modular Scheduler 
> > Core and Completely Fair Scheduler [CFS]" patchset:
> 
> Currently im using Linus's current git + your extra patches + CFS for 
> a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i 
> seek forward/backward while its playing a video with some workload 
> (checking out SVN repositories, compiling something). Stopping other 
> process didn't help kaffeine so it stays freezed stated until i kill 
> it.

hm, could you try to strace it and/or attach gdb to it and figure out 
what's wrong? (perhaps involving the Kaffeine developers too?) As long 
as it's not a kernel level crash i cannot see how the scheduler could 
directly cause this - other than by accident creating a scheduling 
pattern that triggers a user-space bug more often than with other 
schedulers.

> [1] http://cekirdek.pardus.org.tr/~caglar/psaux

looks quite weird!

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 14:45                 ` Willy Tarreau
@ 2007-04-14 16:14                   ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:14 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> BTW, I've tried your fork patch. It definitely helps forking because 
> it takes below one second to create 4000 processes, then the load 
> slowly increases. As you said, the children have to earn their share, 
> and I find that it makes it easier to conserve control of the whole 
> system's stability.

ok, thanks for testing this out, i think i'll integrate this one back 
into the core. (I'm still unsure about the cpu-hog one.) And it saves 
some code-size too:

   text    data     bss     dec     hex filename
  23349    2705      24   26078    65de kernel/sched.o.cfs-v1
  23189    2705      24   25918    653e kernel/sched.o.cfs-before
  23052    2705      24   25781    64b5 kernel/sched.o.cfs-after

  23366    4001      24   27391    6aff kernel/sched.o.vanilla
  23671    4548      24   28243    6e53 kernel/sched.o.sd.v40

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:27               ` Willy Tarreau
  2007-04-14 14:45                 ` Willy Tarreau
@ 2007-04-14 16:19                 ` Ingo Molnar
  2007-04-14 17:15                   ` Eric W. Biederman
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Eric W. Biederman, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> > 
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
> 
> Ah, I found something nasty.
> If I start large batches of processes like this :
> 
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
> 
> the ramp up slows down after 700-800 processes, but something very 
> strange happens. If I'm under X, I can switch the focus to all xterms 
> (the WM is still alive) but all xterms are frozen. On the console, 
> after one moment I simply cannot switch to another VT anymore while I 
> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
> killed everything and restored full control. Dmesg shows lots of :

> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> I wonder if part of the problem would be too many processes bound to 
> the same tty :-/

hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
maybe this description rings a bell with them?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 16:09   ` Ingo Molnar
@ 2007-04-14 16:59     ` S.Çağlar Onur
  2007-04-15 16:13       ` Kaffeine problem with CFS Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 16:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı: 
> hm, could you try to strace it and/or attach gdb to it and figure out
> what's wrong? (perhaps involving the Kaffeine developers too?) As long
> as it's not a kernel level crash i cannot see how the scheduler could
> directly cause this - other than by accident creating a scheduling
> pattern that triggers a user-space bug more often than with other
> schedulers.

...
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
+++ killed by SIGINT +++

is where freeze occurs. Full log can be found at [1]

> > [1] http://cekirdek.pardus.org.tr/~caglar/psaux
>
> looks quite weird!

:)

[1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 16:19                 ` Ingo Molnar
@ 2007-04-14 17:15                   ` Eric W. Biederman
  2007-04-14 17:29                     ` Willy Tarreau
  0 siblings, 1 reply; 712+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Ingo Molnar <mingo@elte.hu> writes:

> * Willy Tarreau <w@1wt.eu> wrote:
>
>> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>> > 
>> > Well, I'll stop heating the room for now as I get out of ideas about how
>> > to defeat it.
>> 
>> Ah, I found something nasty.
>> If I start large batches of processes like this :
>> 
>> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>> 
>> the ramp up slows down after 700-800 processes, but something very 
>> strange happens. If I'm under X, I can switch the focus to all xterms 
>> (the WM is still alive) but all xterms are frozen. On the console, 
>> after one moment I simply cannot switch to another VT anymore while I 
>> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
>> killed everything and restored full control. Dmesg shows lots of :
>
>> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

This.  Yes. SAK is noisy and tells you everything it kills.

>> I wonder if part of the problem would be too many processes bound to 
>> the same tty :-/
>
> hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
> maybe this description rings a bell with them?

Is there any swapping going on?

I'm inclined to suspect that it is a problem that has more to do with the
number of processes and has nothing to do with ttys.

Anyway you can easily rule out ttys by having your startup program
detach from a controlling tty before you start everything.

I'm more inclined to guess something is reading /proc a lot, or doing
something that holds the tasklist lock, a lot or something like that,
if the problem isn't that you are being kicked into swap.

Eric



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:15                   ` Eric W. Biederman
@ 2007-04-14 17:29                     ` Willy Tarreau
  2007-04-14 17:44                       ` Eric W. Biederman
  2007-04-14 17:50                       ` Linus Torvalds
  0 siblings, 2 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 17:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Hi Eric,

[...]
> >> the ramp up slows down after 700-800 processes, but something very 
> >> strange happens. If I'm under X, I can switch the focus to all xterms 
> >> (the WM is still alive) but all xterms are frozen. On the console, 
> >> after one moment I simply cannot switch to another VT anymore while I 
> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
> >> killed everything and restored full control. Dmesg shows lots of :
> >
> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> This.  Yes. SAK is noisy and tells you everything it kills.

OK, that's what I suspected, but I did not know if the fact that it talked
about the session was systematic or related to any particular state when it
killed the task.

> >> I wonder if part of the problem would be too many processes bound to 
> >> the same tty :-/
> >
> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
> > maybe this description rings a bell with them?
> 
> Is there any swapping going on?

Not at all.

> I'm inclined to suspect that it is a problem that has more to do with the
> number of processes and has nothing to do with ttys.

It is clearly possible. What I found strange is that I could still fork
processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
It first happened under X with frozen xterms but a perfectly usable WM,
then I reproduced it on pure console to rule out any potential X problem.

> Anyway you can easily rule out ttys by having your startup program
> detach from a controlling tty before you start everything.
> 
> I'm more inclined to guess something is reading /proc a lot, or doing
> something that holds the tasklist lock, a lot or something like that,
> if the problem isn't that you are being kicked into swap.

Oh I'm sorry you were invited into the discussion without a first description
of the context. I was giving a try to Ingo's new scheduler, and trying to
reach corner cases with lots of processes competing for CPU.

I simply used a "for" loop in bash to fork 1000 processes, and this problem
happened between 700-800 children. The program only uses a busy loop and a
pause. I then changed my program to close 0,1,2 and perform the fork itself,
and the problem vanished. So there are two differences here :

  - bash not forking anymore
  - far less FDs on /dev/tty1

At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
in this area.

I agree that this is not normal usage at all, I'm just trying to attack
Ingo's scheduler to ensure it is more robust than the stock one. But
sometimes brute force methods can make other sleeping problems pop up.

Thinking about it, I don't know if there are calls to schedule() while
switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
simply blocked. It would have been possible that a schedule() call
somewhere got starved due to the load, I don't know.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:29                     ` Willy Tarreau
@ 2007-04-14 17:44                       ` Eric W. Biederman
  2007-04-14 17:54                         ` Ingo Molnar
  2007-04-14 17:50                       ` Linus Torvalds
  1 sibling, 1 reply; 712+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Willy Tarreau <w@1wt.eu> writes:

> Hi Eric,
>
> [...]
>> >> the ramp up slows down after 700-800 processes, but something very 
>> >> strange happens. If I'm under X, I can switch the focus to all xterms 
>> >> (the WM is still alive) but all xterms are frozen. On the console, 
>> >> after one moment I simply cannot switch to another VT anymore while I 
>> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
>> >> killed everything and restored full control. Dmesg shows lots of :
>> >
>> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>> 
>> This.  Yes. SAK is noisy and tells you everything it kills.
>
> OK, that's what I suspected, but I did not know if the fact that it talked
> about the session was systematic or related to any particular state when it
> killed the task.
>
>> >> I wonder if part of the problem would be too many processes bound to 
>> >> the same tty :-/
>> >
>> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
>> > maybe this description rings a bell with them?
>> 
>> Is there any swapping going on?
>
> Not at all.
>
>> I'm inclined to suspect that it is a problem that has more to do with the
>> number of processes and has nothing to do with ttys.
>
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
> It first happened under X with frozen xterms but a perfectly usable WM,
> then I reproduced it on pure console to rule out any potential X problem.
>
>> Anyway you can easily rule out ttys by having your startup program
>> detach from a controlling tty before you start everything.
>> 
>> I'm more inclined to guess something is reading /proc a lot, or doing
>> something that holds the tasklist lock, a lot or something like that,
>> if the problem isn't that you are being kicked into swap.
>
> Oh I'm sorry you were invited into the discussion without a first description
> of the context. I was giving a try to Ingo's new scheduler, and trying to
> reach corner cases with lots of processes competing for CPU.
>
> I simply used a "for" loop in bash to fork 1000 processes, and this problem
> happened between 700-800 children. The program only uses a busy loop and a
> pause. I then changed my program to close 0,1,2 and perform the fork itself,
> and the problem vanished. So there are two differences here :
>
>   - bash not forking anymore
>   - far less FDs on /dev/tty1

Yes.  But with /dev/tty1 being the controlling terminal in both cases,
as you haven't dropped your session, or disassociated your tty.

The bash problem may have something to setpgid or scheduling effects.
Hmm.  I just looked and setpgid does grab the tasklist lock for
writing so we may possibly have some contention there.

> At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
> in this area.
>
> I agree that this is not normal usage at all, I'm just trying to attack
> Ingo's scheduler to ensure it is more robust than the stock one. But
> sometimes brute force methods can make other sleeping problems pop up.

Yep.  If we can narrow it down to one that would be interesting.  Of course
that also means when we start finding other possibly sleeping problems people
are working in areas of code the don't normally touch, so we must investigate.

> Thinking about it, I don't know if there are calls to schedule() while
> switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
> simply blocked. It would have been possible that a schedule() call
> somewhere got starved due to the load, I don't know.

It looks like there is a call to schedule_work.

There are two pieces of the path. If you are switching in and out of a tty
controlled by something like X.  User space has to grant permission before
the operation happens.  Where there isn't a gate keeper I know it is cheaper
but I don't know by how much, I suspect there is still a schedule happening
in there.

Eric

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:29                     ` Willy Tarreau
  2007-04-14 17:44                       ` Eric W. Biederman
@ 2007-04-14 17:50                       ` Linus Torvalds
  1 sibling, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-14 17:50 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Ingo Molnar, Nick Piggin, linux-kernel,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox



On Sat, 14 Apr 2007, Willy Tarreau wrote:
> 
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.

Considering the patches in question, it's almost definitely just a CPU 
scheduling problem with starvation.

The VT switching is obviously done by the kernel, but the kernel will 
signal and wait for the "controlling process" for the VT. The most obvious 
case of that is X, of course, but even in text mode I think gpm will 
have taken control of the VT's it runs on (all of them), which means that 
when you initiate a VT switch, the kernel will actually signal the 
controlling process (gpm), and wait for it to acknowledge the switch.

If gpm doesn't get a timeslice for some reason (and it sounds like there 
may be some serious unfairness after "fork()"), your behaviour is 
explainable.

(NOTE! I've never actually looked at gpm sources or what it really does, 
so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and 
something else is going on, but quite frankly, it sounds like the obvious 
candidate for this bug. Explaining it with some non-scheduler-related 
thing sounds unlikely, considering the patch in question).

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:44                       ` Eric W. Biederman
@ 2007-04-14 17:54                         ` Ingo Molnar
  2007-04-14 18:18                           ` Willy Tarreau
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-14 17:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Thinking about it, I don't know if there are calls to schedule() 
> > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> > "chvt 2" simply blocked. It would have been possible that a 
> > schedule() call somewhere got starved due to the load, I don't know.
> 
> It looks like there is a call to schedule_work.

so this goes over keventd, right?

> There are two pieces of the path. If you are switching in and out of a 
> tty controlled by something like X.  User space has to grant 
> permission before the operation happens.  Where there isn't a gate 
> keeper I know it is cheaper but I don't know by how much, I suspect 
> there is still a schedule happening in there.

Could keventd perhaps be starved? Willy, to exclude this possibility, 
could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
the command to set it to SCHED_FIFO:50 would be:

  chrt -f -p 50 5

but ... events/0 is reniced to -5 by default, so it should definitely 
not be starved.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:54                         ` Ingo Molnar
@ 2007-04-14 18:18                           ` Willy Tarreau
  2007-04-14 18:40                             ` Eric W. Biederman
  2007-04-15 17:55                             ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 18:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> 
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> > > Thinking about it, I don't know if there are calls to schedule() 
> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> > > "chvt 2" simply blocked. It would have been possible that a 
> > > schedule() call somewhere got starved due to the load, I don't know.
> > 
> > It looks like there is a call to schedule_work.
> 
> so this goes over keventd, right?
> 
> > There are two pieces of the path. If you are switching in and out of a 
> > tty controlled by something like X.  User space has to grant 
> > permission before the operation happens.  Where there isn't a gate 
> > keeper I know it is cheaper but I don't know by how much, I suspect 
> > there is still a schedule happening in there.
> 
> Could keventd perhaps be starved? Willy, to exclude this possibility, 
> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
> the command to set it to SCHED_FIFO:50 would be:
> 
>   chrt -f -p 50 5
> 
> but ... events/0 is reniced to -5 by default, so it should definitely 
> not be starved.

Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
bash forks 1000 processes, then progressively execs scheddos, but it
takes some time). So I'm rebuilding right now. But I think that Linus
has an interesting clue about GPM and notification before switching
the terminal. I think it was enabled in console mode. I don't know
how that translates to frozen xterms, but let's attack the problems
one at a time.

Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:18                           ` Willy Tarreau
@ 2007-04-14 18:40                             ` Eric W. Biederman
  2007-04-14 19:01                               ` Willy Tarreau
  2007-04-15 17:55                             ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Eric W. Biederman @ 2007-04-14 18:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Willy Tarreau <w@1wt.eu> writes:

> On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
>> 
>> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>> 
>> > > Thinking about it, I don't know if there are calls to schedule() 
>> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
>> > > "chvt 2" simply blocked. It would have been possible that a 
>> > > schedule() call somewhere got starved due to the load, I don't know.
>> > 
>> > It looks like there is a call to schedule_work.
>> 
>> so this goes over keventd, right?
>> 
>> > There are two pieces of the path. If you are switching in and out of a 
>> > tty controlled by something like X.  User space has to grant 
>> > permission before the operation happens.  Where there isn't a gate 
>> > keeper I know it is cheaper but I don't know by how much, I suspect 
>> > there is still a schedule happening in there.
>> 
>> Could keventd perhaps be starved? Willy, to exclude this possibility, 
>> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
>> the command to set it to SCHED_FIFO:50 would be:
>> 
>>   chrt -f -p 50 5
>> 
>> but ... events/0 is reniced to -5 by default, so it should definitely 
>> not be starved.
>
> Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> bash forks 1000 processes, then progressively execs scheddos, but it
> takes some time). So I'm rebuilding right now. But I think that Linus
> has an interesting clue about GPM and notification before switching
> the terminal. I think it was enabled in console mode. I don't know
> how that translates to frozen xterms, but let's attack the problems
> one at a time.

I think it is a good clue.  However the intention of the mechanism is
that only processes that change the video mode on a VT are supposed to
use it.  So I really don't think gpm is the culprit.  However it easily could
be something else that has similar characteristics.

I just realized we do have proof that schedule_work is actually working
because SAK works, and we can't sanely do SAK from interrupt context
so we call schedule work.

Eric

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:18   ` Ingo Molnar
@ 2007-04-14 18:48     ` Bill Huey
  0 siblings, 0 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-14 18:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote:
> very much so! Both Con and Mike has contributed regularly to upstream 
> sched.c:

The problem here is tha Con can get demotivated (and rather upset) when an
idea gets proposed, like SchedPlug, only to have people be hostile to it
and then sudden turn around an adopt this idea. It give the impression
that you, in this specific case, were more interested in controlling a
situation and the track of development instead of actually being inclusive
of the development process with discussion and serious consideration, etc...

This is how the Linux community can be perceived as elitist. The old guard
would serve the community better if people were more mindful and sensitive
to developer issues. There was a particular speech that I was turned off by
at OLS 2006 that pretty much pandering to the "old guard's" needs over
newer developers. Since I'm a some what established engineer in -rt (being
the only other person that mapped the lock hierarchy out for full
preemptibility), I had the confidence to pretty much ignored it while
previously this could have really upset me and be highly discouraging to
a relatively new developer.

As Linux gets larger and larger this is going to be an increasing problem
when folks come into the community with new ideas and the community will
need to change if it intends to integrate these folks. IMO, a lot of
these flame ware wouldn't need to exist if folks listent ot each other
better and permit co-ownership of code like the scheduler since it needs
multipule hands in it adapt to new loads and situations, etc...

I'm saying this nicely now since I can be nasty about it.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:40                             ` Eric W. Biederman
@ 2007-04-14 19:01                               ` Willy Tarreau
  0 siblings, 0 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 19:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> >> 
> >> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> 
> >> > > Thinking about it, I don't know if there are calls to schedule() 
> >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> >> > > "chvt 2" simply blocked. It would have been possible that a 
> >> > > schedule() call somewhere got starved due to the load, I don't know.
> >> > 
> >> > It looks like there is a call to schedule_work.
> >> 
> >> so this goes over keventd, right?
> >> 
> >> > There are two pieces of the path. If you are switching in and out of a 
> >> > tty controlled by something like X.  User space has to grant 
> >> > permission before the operation happens.  Where there isn't a gate 
> >> > keeper I know it is cheaper but I don't know by how much, I suspect 
> >> > there is still a schedule happening in there.
> >> 
> >> Could keventd perhaps be starved? Willy, to exclude this possibility, 
> >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
> >> the command to set it to SCHED_FIFO:50 would be:
> >> 
> >>   chrt -f -p 50 5
> >> 
> >> but ... events/0 is reniced to -5 by default, so it should definitely 
> >> not be starved.
> >
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know
> > how that translates to frozen xterms, but let's attack the problems
> > one at a time.
> 
> I think it is a good clue.  However the intention of the mechanism is
> that only processes that change the video mode on a VT are supposed to
> use it.  So I really don't think gpm is the culprit.  However it easily could
> be something else that has similar characteristics.
> 
> I just realized we do have proof that schedule_work is actually working
> because SAK works, and we can't sanely do SAK from interrupt context
> so we call schedule work.

Eric,

I can say that Linus, Ingo and you all got on the right track.
I could reproduce, I got a hung tty around 1400 running processes.
Fortunately, it was the one with the root shell which was reniced
to -19.

I could strace chvt 2 :

20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000>
20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002>
20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000>
20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...>

Then I applied Ingo's suggestion about changing keventd prio :

root@pcw:~# ps auxw|grep event
root         8  0.0  0.0     0    0 ?        SW<  20:31   0:00 [events/0]
root         9  0.0  0.0     0    0 ?        RW<  20:31   0:00 [events/1]

root@pcw:~# rtprio -s 1 -p 50 8 9     (I don't have chrt but it does the same)

My VT immediately switched as soon as I hit Enter. Everything's
working fine again now. So the good news is that it's not a bug
in the tty code, nor a deadlock.

Now, maybe keventd should get a higher prio ? It seems worrying to
me that it may starve when it seems so much sensible.

Also, that may explain why I couldn't reproduce with the fork patch.
Since all new processes got no runtime at first, their impact on
existing ones must have been lower. But I think that if I had waited
longer, I would have had the problem again (though I did not see it
even under a load of 7800).

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:53           ` Ingo Molnar
@ 2007-04-14 19:48           ` William Lee Irwin III
  2007-04-14 20:12             ` Willy Tarreau
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-14 19:48 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> Forking becomes very slow above a load of 100 it seems. Sometimes,
> the shell takes 2 or 3 seconds to return to prompt after I run
> "scheddos &"
> Those are very promising results, I nearly observe the same responsiveness
> as I had on a solaris 10 with 10k running processes on a bigger machine.
> I would be curious what a mysql test result would look like now.

Where is scheddos?


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 19:48           ` William Lee Irwin III
@ 2007-04-14 20:12             ` Willy Tarreau
  0 siblings, 0 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-14 20:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> > Forking becomes very slow above a load of 100 it seems. Sometimes,
> > the shell takes 2 or 3 seconds to return to prompt after I run
> > "scheddos &"
> > Those are very promising results, I nearly observe the same responsiveness
> > as I had on a solaris 10 with 10k running processes on a bigger machine.
> > I would be curious what a mysql test result would look like now.
> 
> Where is scheddos?

I will send it to you off-list. I've been avoiding to publish it for a long
time because the stock scheduler was *very* sensible to trivial attacks
(freezes larger than 30s, impossible to log in). It's very basic, and I
have no problem sending it to anyone who requests it, it's just that as
long as some distros ship early 2.6 kernels I do not want it to appear on
mailing list archives for anyone to grab it and annoy their admins for free.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:21 ` William Lee Irwin III
  2007-04-13 22:52   ` Ingo Molnar
@ 2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
                       ` (2 more replies)
  1 sibling, 3 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-14 22:38 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, William Lee Irwin III wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> >    The CFS patch uses a completely different approach and implementation
> >    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> >    that of RSDL/SD, which is a high standard to meet :-) Testing
> >    feedback is welcome to decide this one way or another. [ and, in any
> >    case, all of SD's logic could be added via a kernel/sched_sd.c module
> >    as well, if Con is interested in such an approach. ]
> >    CFS's design is quite radical: it does not use runqueues, it uses a
> >    time-ordered rbtree to build a 'timeline' of future task execution,
> >    and thus has no 'array switch' artifacts (by which both the vanilla
> >    scheduler and RSDL/SD are affected).
> 
> A binomial heap would likely serve your purposes better than rbtrees.
> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.

Haven't looked at the scheduler code yet, but for a similar problem I use 
a time ring. The ring has Ns (2 power is better) slots (where tasks are 
queued - in my case they were som sort of timers), and it has a current 
base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
also has a bitmap with bits telling you which slots contains queued tasks. 
An item (task) that has to be scheduled at time T, will be queued in the slot:

S = Ib + min((T - Tb) / Tg, Ns - 1);

Items with T longer than Ns*Tg will be scheduled in the relative last slot 
(chosing a proper Ns and Tg can minimize this).
Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to 
suite to your needs.
This is a simple bench between time-ring (TR) and CFS queueing:

http://www.xmailserver.org/smart-queue.c

In my box (Dual Opteron 252):

davide@alien:~$ ./smart-queue -n 8            
CFS = 142.21 cycles/loop
TR  = 72.33 cycles/loop
davide@alien:~$ ./smart-queue -n 16
CFS = 188.74 cycles/loop
TR  = 83.79 cycles/loop
davide@alien:~$ ./smart-queue -n 32
CFS = 221.36 cycles/loop
TR  = 75.93 cycles/loop
davide@alien:~$ ./smart-queue -n 64
CFS = 242.89 cycles/loop
TR  = 81.29 cycles/loop



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
@ 2007-04-14 23:26     ` Davide Libenzi
  2007-04-15  4:01     ` William Lee Irwin III
  2007-04-15 23:09     ` Pavel Pisa
  2 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-14 23:26 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sat, 14 Apr 2007, Davide Libenzi wrote:

> Haven't looked at the scheduler code yet, but for a similar problem I use 
> a time ring. The ring has Ns (2 power is better) slots (where tasks are 
> queued - in my case they were som sort of timers), and it has a current 
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
> also has a bitmap with bits telling you which slots contains queued tasks. 
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> 
> S = Ib + min((T - Tb) / Tg, Ns - 1);

... mod Ns, of course ;)


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (8 preceding siblings ...)
  2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-15  3:27 ` Con Kolivas
  2007-04-15  5:16   ` Bill Huey
                     ` (2 more replies)
  2007-04-15 12:29 ` Esben Nielsen
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-15  3:27 UTC (permalink / raw)
  To: Ingo Molnar, ck list, Peter Williams, Bill Huey
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

The casual observer will be completely confused by what on earth has happened 
here so let me try to demystify things for them.

1. I tried in vain some time ago to push a working extensable pluggable cpu 
scheduler framework (based on wli's work) for the linux kernel. It was 
perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as 
being absolutely the wrong approach and that we should never do that. Oddly 
enough the linux-kernel-mailing list was -dead- at the time and the 
discussion did not make it to the mailing list. Every time I've tried to 
forward it to the mailing list the spam filter decided to drop it so most 
people have not even seen this original veto-forever discussion.

2. Since then I've been thinking/working on a cpu scheduler design that takes 
away all the guesswork out of scheduling and gives very predictable, as fair 
as possible, cpu distribution and latency while preserving as solid 
interactivity as possible within those confines. For weeks now, Ingo has said 
that the interactivity regressions were showstoppers and we should address 
them, never mind the fact that the so-called regressions were purely "it 
slows down linearly with load" which to me is perfectly desirable behaviour. 
While this was not perma-vetoed, I predicted pretty accurately your intent 
was to veto it based on this.

People kept claiming scheduling problems were few and far between but what was 
really happening is users were terrified of lkml and instead used 1. windows 
and 2. 2.4 kernels. The problems were there.

So where are we now? Here is where your latest patch comes in.

As a solution to the many scheduling problems we finally all agree exist, you 
propose a patch that adds 1. a limited pluggable framework and 2. a fairness 
based cpu scheduler policy... o_O

So I should be happy at last now that the things I was promoting you are also 
promoting, right? Well I'll fill in the rest of the gaps and let other people 
decide how I should feel.

> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and 
having trips in and out of hospitals for my condition. I appreciate greatly 
the sympathy and patience from people in this regard. However at one stage I 
virtually begged for support with my attempts and help with the code. Dmitry 
Adamushko is the only person who actually helped me with the code in the 
interim, while others poked sticks at it. Sure the sticks helped at times but 
the sticks always seemed to have their ends kerosene doused and flaming for 
reasons I still don't get. No other help was forthcoming.

Now that you're agreeing my direction was correct you've done the usual Linux 
kernel thing - ignore all my previous code and write your own version. Oh 
well, that I've come to expect; at least you get a copyright notice in the 
bootup and somewhere in the comments give me credit for proving it's 
possible. Let's give some other credit here too. William Lee Irwin provided 
the major architecture behind plugsched at my request and I simply finished 
the work and got it working. He is also responsible for many IRC discussions 
I've had about cpu scheduling fairness, designs, programming history and code 
help. Even though he did not contribute code directly to SD, his comments 
have been invaluable.

So let's look at the code.

kernel/sched.c
kernel/sched_fair.c
kernel/sched_rt.c

It turns out this is not a pluggable cpu scheduler framework at all, and I 
guess you didn't really promote it as such. It's a "modular scheduler core". 
Which means you moved code from sched.c into sched_fair.c and sched_rt.c. 
This abstracts out each _scheduling policy's_ functions into struct 
sched_class and allows each scheduling policy's functions to be in a separate 
file etc.

Ok so what it means is that instead of whole cpu schedulers being able to be 
plugged into this framework we can plug in only cpu scheduling policies.... 
hrm... So let's look on

-#define SCHED_NORMAL		0

Ok once upon a time we rename SCHED_OTHER which every other unix calls the 
standard policy 99.9% of applications used into a more meaningful name, 
SCHED_NORMAL. That's fine since all it did was change the description 
internally for those reading the code. Let's see what you've done now:

+#define SCHED_FAIR		0

You've renamed it again. This is, I don't know what exactly to call it, but an 
interesting way of making it look like there is now more choice. Well, 
whatever you call it, everything in linux spawned from init without 
specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed 
SCHED_NORMAL and now SCHED_FAIR.

You encouraged me to create a sched_sd.c to add onto your design as well. 
Well, what do I do with that? I need to create another scheduling policy for 
that code to even be used. A separate scheduling policy requires a userspace 
change to even benefit from it. Even if I make that sched_sd.c patch, people 
cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to 
read SCHED_SD 0 or similar. The same goes for original staircase cpusched, 
nicksched, zaphod, spa_ws, ebs and so on.

So what you've achieved with your patch is - replaced the current scheduler 
with another one and moved it into another file. There is no choice, and no 
pluggability, just code trumping. 

Do I support this? In this form.... no.

It's not that I don't like your new scheduler. Heck it's beautiful like most 
of your _serious_ code. It even comes with a catchy name that's bound to give 
people hard-ons (even though many schedulers aim to be completely fair, yours 
has been named that for maximum selling power). The complaint I have is that 
you are not providing quite what you advertise (on the modular front), or 
perhaps you're advertising it as such to make it look more appealing; I'm not 
sure.

Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything 
different, and that this is anything other than your NIH (Not Invented Here) 
cpu scheduling policy rewrite which will probably end up taking it's position 
in mainline after yet another truckload of regression/performance tests and 
so on. I haven't seen an awful lot of comparisons with SD yet, just people 
jumping on your bandwagon which is fine I guess. Maybe a few tiny tests 
showing less than 5% variation in their fairness from what I can see. Either 
way, I already feel you've killed off SD... like pretty much everything else 
I've done lately. At least I no longer have to try and support my code mostly 
by myself.

In the interest of putting aside any ego concerns since this is about linux 
and not me...

Because...  You are a hair's breadth away from producing something that I 
would support, which _does_ do what you say and produces the pluggability 
we're all begging for with only tiny changes to the code you've already done. 
Make Kconfig let you choose which sched_*.c gets built into the kernel, and 
make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig 
and even choose one of the alternative built in ones with boot parametersyour 
code has more clout than mine will (ie do exactly what plugsched does). Then 
we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is 
the very thing Linus didn't want in specialisation with the cpu schedulers! 
Does this mean this idea will be vetoed yet again? In all likelihood, yes.

I guess I have lots to put into -ck still... sigh.

> 	Ingo

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
@ 2007-04-15  4:01     ` William Lee Irwin III
  2007-04-15  4:18       ` Davide Libenzi
  2007-04-15 23:09     ` Pavel Pisa
  2 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15  4:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, William Lee Irwin III wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
[...]

On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use 
> a time ring. The ring has Ns (2 power is better) slots (where tasks are 
> queued - in my case they were som sort of timers), and it has a current 
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
> also has a bitmap with bits telling you which slots contains queued tasks. 
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> S = Ib + min((T - Tb) / Tg, Ns - 1);
> Items with T longer than Ns*Tg will be scheduled in the relative last slot 
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to 
> suite to your needs.

I used a similar sort of queue in the virtual deadline scheduler I
wrote in 2003 or thereabouts. CFS uses queue priorities with too high
a precision to map directly to this (queue priorities are marked as
"key" in the cfs code and should not be confused with task priorities).
The elder virtual deadline scheduler used millisecond resolution and a
rather different calculation for its equivalent of ->key, which
explains how it coped with a limited priority space.

The two basic attacks on such large priority spaces are the near future
vs.  far future subdivisions and subdividing the priority space into
(most often regular) intervals. Subdividing the priority space into
intervals is the most obvious; you simply use some O(lg(n)) priority
queue as the bucket discipline in the "time ring," queue by the upper
bits of the queue priority in the time ring, and by the lower bits in
the O(lg(n)) bucket discipline. The near future vs. far future
subdivision is maintaining the first N tasks in a low-constant-overhead
structure like a sorted list and the remainder in some other sort of
queue structure intended to handle large numbers of elements gracefully.
The distribution of queue priorities strongly influences which of the
methods is most potent, though it should be clear the methods can be
used in combination.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  4:01     ` William Lee Irwin III
@ 2007-04-15  4:18       ` Davide Libenzi
  0 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-15  4:18 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sat, 14 Apr 2007, William Lee Irwin III wrote:

> The two basic attacks on such large priority spaces are the near future
> vs.  far future subdivisions and subdividing the priority space into
> (most often regular) intervals. Subdividing the priority space into
> intervals is the most obvious; you simply use some O(lg(n)) priority
> queue as the bucket discipline in the "time ring," queue by the upper
> bits of the queue priority in the time ring, and by the lower bits in
> the O(lg(n)) bucket discipline.

Sure. If you really need sub-millisecond precision, you can replace the 
bucket's list_head with an rb_root. It may be not necessary though for a 
cpu scheduler (still, didn't read Ingo's code yet).


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
@ 2007-04-15  5:16   ` Bill Huey
  2007-04-15  8:44     ` Ingo Molnar
  2007-04-15 16:11     ` Bernd Eckenfels
  2007-04-15  6:43   ` Mike Galbraith
  2007-04-15 15:05   ` Ingo Molnar
  2 siblings, 2 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-15  5:16 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote:
...
> Now that you're agreeing my direction was correct you've done the usual Linux 
> kernel thing - ignore all my previous code and write your own version. Oh 
> well, that I've come to expect; at least you get a copyright notice in the 
> bootup and somewhere in the comments give me credit for proving it's 
> possible. Let's give some other credit here too. William Lee Irwin provided 
> the major architecture behind plugsched at my request and I simply finished 
> the work and got it working. He is also responsible for many IRC discussions 
> I've had about cpu scheduling fairness, designs, programming history and code 
> help. Even though he did not contribute code directly to SD, his comments 
> have been invaluable.

Hello folks,

I think the main failure I see here is that Con wasn't included in this design
or privately in review process. There could have been better co-ownership of the
code. This could also have been done openly on lkml (since this is kind of what
this medium is about to significant degree) so that consensus can happen (Con
can be reasoned with). It would have achieved the same thing but probably more
smoothly if folks just listened, considered an idea and then, in this case,
created something that would allow for experimentation from outsiders in a
fluid fashion.

If these issues aren't fixed, you're going to stuck with the same kind of creeping
elitism that has gradually killed the FreeBSD project and other BSDs. I can't
comment on the code implementation. I'm focus on other things now that I'm at
NetApp and I can't help out as much as I could. Being former BSDi, I had a first
hand account of these issues as they played out.

A development process like this is likely to exclude smart people from wanting
to contribute to Linux and folks should be conscious about this issues. It's
basically a lot of code and concept that at least two individuals have worked
on (wli and con) only to have it be rejected and then sudden replaced by
code from a community gatekeeper. In this case, this results in both Con and
Bill Irwin being woefully under utilized.

If I were one of these people. I'd be mighty pissed.
 
bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
  2007-04-15  5:16   ` Bill Huey
@ 2007-04-15  6:43   ` Mike Galbraith
  2007-04-15  8:36     ` Bill Huey
  2007-04-17  0:06     ` Peter Williams
  2007-04-15 15:05   ` Ingo Molnar
  2 siblings, 2 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15  6:43 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Peter Williams, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> 
> The casual observer will be completely confused by what on earth has happened 
> here so let me try to demystify things for them.

[...]

Demystify what?   The casual observer need only read either your attempt
at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.
Now progress can happen, which was _not_ happening before.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
@ 2007-04-15  7:54               ` Mike Galbraith
  2007-04-15  8:58                 ` Ingo Molnar
  2007-04-19  9:01               ` Ingo Molnar
  2 siblings, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15  7:54 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner

On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:

> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
> with his workload which behaves strangely on RSDL. If it works OK here,
> it will be the proof that heuristics should not be needed.

You mean the X + mp3 player + audio visualization test?  X+Gforce
visualization have problems getting half of my box in the presence of
two other heavy cpu using tasks.  Behavior is _much_ better than
RSDL/SD, but the synchronous nature of X/client seems to be a problem.  

With this scheduler, renicing X/client does cure it, whereas with SD it
did not help one bit.  (I know a trivial way to cure that, and this
framework makes that possible without dorking up fairness as a general
policy.)

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  6:43   ` Mike Galbraith
@ 2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
                         ` (2 more replies)
  2007-04-17  0:06     ` Peter Williams
  1 sibling, 3 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-15  8:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> [...]
> 
> Demystify what?   The casual observer need only read either your attempt

Here's the problem. You're a casual observer and obviously not paying
attention.

> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
> Now progress can happen, which was _not_ happening before.

I think that's inaccurate and there are plenty of folks that have that
technical skill and background. The scheduler code isn't a deep mystery
and there are plenty of good kernel hackers out here across many
communities.  Ingo isn't the only person on this planet to have deep
scheduler knowledge. Priority heaps are not new and Solaris has had a
pluggable scheduler framework for years.

Con's characterization is something that I'm more prone to believe about
how Linux kernel development works versus your view. I think it's a great
shame to have folks like Bill Irwin and Con to have waste time trying to
do something right only to have their ideas attack, then copied and held
as the solution for this kind of technical problem as complete reversal
of technical opinion as it suits a moment. This is just wrong in so many
ways.

It outlines the problems with Linux kernel development and questionable
elistism regarding ownership of certain sections of the kernel code.

I call it "churn squat" and instances like this only support that view
which I would rather it be completely wrong and inaccurate instead.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  5:16   ` Bill Huey
@ 2007-04-15  8:44     ` Ingo Molnar
  2007-04-15  9:51       ` Bill Huey
  2007-04-15 16:11     ` Bernd Eckenfels
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15  8:44 UTC (permalink / raw)
  To: Bill Huey
  Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> Hello folks,
> 
> I think the main failure I see here is that Con wasn't included in 
> this design or privately in review process. There could have been 
> better co-ownership of the code. This could also have been done openly 
> on lkml [...]

Bill, you come from a BSD background and you are still relatively new to 
Linux development, so i dont at all fault you for misunderstanding this 
situation, and fortunately i have a really easy resolution for your 
worries: i did exactly that! :)

i wrote the first line of code of the CFS patch this week, 8am Wednesday 
morning, and released it to lkml 62 hours later, 22pm on Friday. (I've 
listed the file timestamps of my backup patches further below, for all 
the fine details.)

I prefer such early releases to lkml _alot_ more than any private review 
process. I released the CFS code about 6 hours after i thought "okay, 
this looks pretty good" and i spent those final 6 hours on testing it 
(making sure it doesnt blow up on your box, etc.), in the final 2 hours 
i showed it to two folks i could reach on IRC (Arjan and Thomas) and on 
various finishing touches. It doesnt get much faster than that and i 
definitely didnt want to sit on it even one day longer because i very 
much thought that Con and others should definitely see this work!

And i very much credited (and still credit) Con for the whole fairness 
angle:

||  i'd like to give credit to Con Kolivas for the general approach here:
||  he has proven via RSDL/SD that 'fair scheduling' is possible and that
||  it results in better desktop scheduling. Kudos Con!

the 'design consultation' phase you are talking about is _NOW_! :)

I got the v1 code out to Con, to Mike and to many others ASAP. That's 
how you are able to comment on this thread and be part of the 
development process to begin with, in a 'private consultation' setup 
you'd not have had any opportunity to see _any_ of this.

In the BSD space there seem to be more 'political' mechanisms for 
development, but Linux is truly about doing things out in the open, and 
doing it immediately.

Okay? ;-)

Here's the timestamps of all my backups of the patch, from its humble 4K 
beginnings to the 100K first-cut v1 result:

-rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch

i hope this helps :)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
@ 2007-04-15  8:45       ` Mike Galbraith
  2007-04-15  9:06       ` Ingo Molnar
  2007-04-15 16:25       ` Arjan van de Ven
  2 siblings, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15  8:45 UTC (permalink / raw)
  To: Bill Huey
  Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> > 
> > Demystify what?   The casual observer need only read either your attempt
> 
> Here's the problem. You're a casual observer and obviously not paying
> attention.
> 
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> > Now progress can happen, which was _not_ happening before.
> 
> I think that's inaccurate and there are plenty of folks that have that
> technical skill and background. The scheduler code isn't a deep mystery
> and there are plenty of good kernel hackers out here across many
> communities.  Ingo isn't the only person on this planet to have deep
> scheduler knowledge.

Ok <shrug>, I'm not paying attention, and you can't read.  We're even.
Have a nice life.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  7:54               ` Mike Galbraith
@ 2007-04-15  8:58                 ` Ingo Molnar
  2007-04-15  9:11                   ` Mike Galbraith
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15  8:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner


* Mike Galbraith <efault@gmx.de> wrote:

> On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:
> 
> > Well, I'll stop heating the room for now as I get out of ideas about 
> > how to defeat it. I'm convinced. I'm impatient to read about Mike's 
> > feedback with his workload which behaves strangely on RSDL. If it 
> > works OK here, it will be the proof that heuristics should not be 
> > needed.
> 
> You mean the X + mp3 player + audio visualization test?  X+Gforce 
> visualization have problems getting half of my box in the presence of 
> two other heavy cpu using tasks.  Behavior is _much_ better than 
> RSDL/SD, but the synchronous nature of X/client seems to be a problem.
> 
> With this scheduler, renicing X/client does cure it, whereas with SD 
> it did not help one bit. [...]

thanks for testing it! I was quite worried about your setup - two tasks 
using up 50%/50% of CPU time, pitted against a kernel rebuild workload 
seems to be a hard workload to get right.

> [...] (I know a trivial way to cure that, and this framework makes 
> that possible without dorking up fairness as a general policy.)

great! Please send patches so i can add them (once you are happy with 
the solution) - i think your workload isnt special in any way and could 
hit other people too.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
@ 2007-04-15  9:06       ` Ingo Molnar
  2007-04-16 10:00         ` Ingo Molnar
  2007-04-15 16:25       ` Arjan van de Ven
  2 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15  9:06 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Arjan van de Ven, Thomas Gleixner


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> > 
> > Demystify what?  The casual observer need only read either your 
> > attempt
> 
> Here's the problem. You're a casual observer and obviously not paying 
> attention.

guys, please calm down. Judging by the number of contributions to 
sched.c the main folks who are not 'observers' here and who thus have an 
unalienable right to be involved in a nasty flamewar about scheduler 
interactivity are Con, Mike, Nick and me ;-) Everyone else is just a 
happy bystander, ok? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:58                 ` Ingo Molnar
@ 2007-04-15  9:11                   ` Mike Galbraith
  0 siblings, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:

> > [...] (I know a trivial way to cure that, and this framework makes 
> > that possible without dorking up fairness as a general policy.)
> 
> great! Please send patches so i can add them (once you are happy with 
> the solution) - i think your workload isnt special in any way and could 
> hit other people too.

I'll give it a shot.  (have to read and actually understand your new
code first though, then see if it's really viable)

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:44     ` Ingo Molnar
@ 2007-04-15  9:51       ` Bill Huey
  2007-04-15 10:39         ` Pekka Enberg
  0 siblings, 1 reply; 712+ messages in thread
From: Bill Huey @ 2007-04-15  9:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote:
> I prefer such early releases to lkml _alot_ more than any private review 
> process. I released the CFS code about 6 hours after i thought "okay, 
> this looks pretty good" and i spent those final 6 hours on testing it 
> (making sure it doesnt blow up on your box, etc.), in the final 2 hours 
> i showed it to two folks i could reach on IRC (Arjan and Thomas) and on 
> various finishing touches. It doesnt get much faster than that and i 
> definitely didnt want to sit on it even one day longer because i very 
> much thought that Con and others should definitely see this work!
> 
> And i very much credited (and still credit) Con for the whole fairness 
> angle:
> 
> ||  i'd like to give credit to Con Kolivas for the general approach here:
> ||  he has proven via RSDL/SD that 'fair scheduling' is possible and that
> ||  it results in better desktop scheduling. Kudos Con!
> 
> the 'design consultation' phase you are talking about is _NOW_! :)
> 
> I got the v1 code out to Con, to Mike and to many others ASAP. That's 
> how you are able to comment on this thread and be part of the 
> development process to begin with, in a 'private consultation' setup 
> you'd not have had any opportunity to see _any_ of this.
> 
> In the BSD space there seem to be more 'political' mechanisms for 
> development, but Linux is truly about doing things out in the open, and 
> doing it immediately.

I can't even begin to talk about how screwed up BSD development is. Maybe
another time privately.

Ok, Linux development and inclusiveness can be improved. I'm not trying
to "call you out" (slang for accusing you with the sole intention to call
you crazy in a highly confrontative manner). This is discussed publically
here to bring this issue to light, open a communication channel as a means
to resolve it.

> Okay? ;-)

It's cool. We're still getting to know each other professionally and it's
okay to a certain degree to have a communication disconnect but only as
long as it clears. Your productivity is amazing BTW. But here's the
problem, there's this perception that NIH is the default mentality here
in Linux.

Con feels that this kind of action is intentional and has a malicious
quality to it as means of "churn squating" sections of the kernel tree.
The perception here is that there is that there is this expectation that
sections of the Linux kernel are intentionally "churn squated" to prevent
any other ideas from creeping in other than of the owner of that subsytem
(VM, scheduling, etc...) because of lack of modularity in the kernel.
This isn't an API question but a question possibly general code quality
and how maintenance () of it can .

This was predicted by folks and then this perception was *realized* when
you wrote the equivalent kind of code that has technical overlap with SDL
(this is just one dry example). To a person that is writing new code for
Linux, having one of the old guards write equivalent code to that of a
newcomer has the effect of displacing that person both with regards to
code and responsibility with that. When this happens over and over again
and folks get annoyed by it, it starts seeming that Linux development
seems elitist.

I know this because I heard (read) Con's IRC chats all the time about
these matters all of the time. This is not just his view but a view of
other kernel folks that differing views as to. The closing talk at OLS
2006 was highly disturbing in many ways. It went "Christoph" is right
everybody else is wrong which sends a highly negative message to new
kernel developers that, say, don't work for RH directly or any of the
other mainstream Linux companies. After a while, it starts seeming like
this kind of behavior is completely intentional and that Linux is
full of arrogant bastards.

What I would have done here was to contact Peter Williams, Bill Irwin
and Con about what your doing and reach a common concensus about how
to create something that would be inclusive of all of their ideas.
Discussions can technically heated but that's ok, the discussion is
happening and it brings down the wall of this perception. Bill and
Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra.
It might be very useful, it might not be. Folks are all stubborn
about there ideas and hold on to them for dear life. Effective
leaders can deconstruct this hostility and animosity. I don't claim
to be one.

Because of past hostility to something like schedplugin, the hostility
and terseness of responses can be percieved simply as "I'm right,
you're wrong" which is condescending. This effects discussion and
outright destroys a constructive process if this happens continually
since it reenforces that view of "You're an outsider, we don't care
about you". Nobody is listening to each other at that point, folks get
pissed. Then they think about "I'm going to NIH this person with patc
X because he/she did the same here" which is dysfunctional.

Oddly enough, sometimes you're the best person to get a new idea
into the tree. What's not happening here is communication. That takes
sensitivity, careful listening which is a difficult skill, and then
understanding of the characters involved to unify creative energies.

That's a very difficult thing to do for folks that are use to working
solo. It take time to develop trust in those relationships so that
a true collaboration can happen. I know that there is a lot of
creativity in folks like Con and Bill. It would be wise to develop a
dialog with them to see if they can offload some of your work for you
(we all know you're really busy) yet have you be a key facilitator of
their and your ideas. That's a really tough thing to do and it requires
practice. Just imagine (assuming they can follow through) what could
have positively happened if there collective knowledge was leveraged
better. It's not all clear and rosy, but I think these people are more
on your side that you might realized and it might be a good thing to
discover that.

This is tough because I know the personalities involved and I know
kind of how people function and malfunction in this discussion on a
personal basis.

[We can continue privately. This not just about you but applicable
to open source development in general]

The tone of this email is intellectually critical (not ment as
personality attack) and calm. If I'm otherwise, them I'm a bastard. :)

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  9:51       ` Bill Huey
@ 2007-04-15 10:39         ` Pekka Enberg
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 15:16           ` Gene Heskett
  0 siblings, 2 replies; 712+ messages in thread
From: Pekka Enberg @ 2007-04-15 10:39 UTC (permalink / raw)
  To: hui Bill Huey
  Cc: Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> The perception here is that there is that there is this expectation that
> sections of the Linux kernel are intentionally "churn squated" to prevent
> any other ideas from creeping in other than of the owner of that subsytem

Strangely enough, my perception is that Ingo is simply trying to
address the issues Mike's testing discovered in RDSL and SD. It's not
surprising Ingo made it a separate patch set as Con has repeatedly
stated that the "problems" are in fact by design and won't be fixed.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (9 preceding siblings ...)
  2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
@ 2007-04-15 12:29 ` Esben Nielsen
  2007-04-15 13:04   ` Ingo Molnar
  2007-04-15 22:49 ` Ismail Dönmez
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 712+ messages in thread
From: Esben Nielsen @ 2007-04-15 12:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, Ingo Molnar wrote:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>   http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [...]

I took a brief look at it. Have you tested priority inheritance?
As far as  I can see rt_mutex_setprio doesn't have much effect on 
SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change 
scheduler class when boosted in rt_mutex_setprio().

Esben


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 10:39         ` Pekka Enberg
@ 2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
                               ` (2 more replies)
  2007-04-15 15:16           ` Gene Heskett
  1 sibling, 3 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-15 12:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote:
> On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >The perception here is that there is that there is this expectation that
> >sections of the Linux kernel are intentionally "churn squated" to prevent
> >any other ideas from creeping in other than of the owner of that subsytem
> 
> Strangely enough, my perception is that Ingo is simply trying to
> address the issues Mike's testing discovered in RDSL and SD. It's not
> surprising Ingo made it a separate patch set as Con has repeatedly
> stated that the "problems" are in fact by design and won't be fixed.

That's not exactly the problem. There are people who work very hard to
try to improve some areas of the kernel. They progress slowly, and
acquire more and more skills. Sometimes they feel like they need to
change some concepts and propose those changes which are required for
them to go further, or to develop faster. Those are rejected. So they
are constrained to work in a delimited perimeter from which it is
difficult for them to escape.

Then, the same person who rejected their changes comes with something
shiny new, better and which took him far less time. But he sort of
broke the rules because what was forbidden to the first persons is
suddenly permitted. Maybe for very good reasons, I'm not discussing
that. The good reason should have been valid the first time too.

The fact is that when changes are rejected, we should not simply say
"no", but explain why and define what would be acceptable. Some people
here have excellent teaching skills for this, but most others do not.
Anyway, the rules should be the same for everybody.

Also, there is what can be perceived as marketting here. Con worked
on his idea with convictions, he took time to write some generous
documentation, but he hit a wall where his concept was suboptimal on
a given workload. But at least, all the work was oriented on a technical
basis : design + code + doc.

Then, Ingo comes in with something looking amazingly better, with
virtually no documentation, an appealing announcement, and a shiny
advertising at boot. All this implemented without the constraints
other people had to respect. It already looks like definitive work
which will be merge as-is without many changes except a few bugfixes.

If those were two companies, the first one would simply have accused
the second one of not having respected contracts and having employed
heaving marketting to take the first place.

People here do not code for a living, they do it at least because they
believe in what they are doing, and some of them want a bit of gratitude
for their work. I've met people who were proud to say they implement
this or that feature in the kernel, so it is something important for
them. And being cited in an email is nothing compared to advertising
at boot time.

When the discussion was blocked between Con and Mike concerning the
design problems, it is where a new discussion should have taken place.
Ingo could have publicly spoken with them about his ideas of killing
the O(1) scheduler and replacing it with an rbtree-based one, and using
part of Bill's work to speed up development.

It is far easier to resign when people explain what concepts are wrong
and how they think they will do than when they suddenly present something
out of nowhere which is already better.

And it's not specific to Ingo (though I think his ability to work that
fast alone makes him tend to practise this more often than others).

Imagine if Con had worked another full week on his scheduler with better
results on Mike's workload, but still not as good as Ingo's, and they both
published at the same time. You certainly can imagine he would have preferred
to be informed first that it was pointless to continue in that direction.

Now I hope he and Bill will get over this and accept to work on improving
this scheduler, because I really find it smarter than a dumb O(1). I even
agree with Mike that we now have a solid basis for future work. But for
this, maybe a good starting point would be to remove the selfish printk
at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
and improve the documentation a bit so that people can work together on
the new design, without feeling like their work will only server to
promote X or Y.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 13:04   ` Ingo Molnar
  2007-04-16  7:16     ` Esben Nielsen
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 13:04 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Esben Nielsen <nielsen.esben@googlemail.com> wrote:

> I took a brief look at it. Have you tested priority inheritance?

yeah, you are right, it's broken at the moment, i'll fix it. But the 
good news is that i think PI could become cleaner via scheduling 
classes.

> As far as I can see rt_mutex_setprio doesn't have much effect on 
> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task 
> change scheduler class when boosted in rt_mutex_setprio().

i think via scheduling classes we dont have to do the p->policy and 
p->prio based gymnastics anymore, we can just have a clean look at 
p->sched_class and stack the original scheduling class into 
p->real_sched_class. It would probably also make sense to 'privatize' 
p->prio into the scheduling class. That way PI would be a pure property 
of sched_rt, and the PI scheduler would be driven purely by 
p->rt_priority, not by p->prio. That way all the normal_prio() kind of 
complications and interactions with SCHED_OTHER/SCHED_FAIR would be 
eliminated as well. What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
@ 2007-04-15 13:08             ` Pekka J Enberg
  2007-04-15 17:32               ` Mike Galbraith
  2007-04-15 15:26             ` William Lee Irwin III
  2007-04-15 15:39             ` Ingo Molnar
  2 siblings, 1 reply; 712+ messages in thread
From: Pekka J Enberg @ 2007-04-15 13:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, 15 Apr 2007, Willy Tarreau wrote:
> Ingo could have publicly spoken with them about his ideas of killing
> the O(1) scheduler and replacing it with an rbtree-based one, and using
> part of Bill's work to speed up development.

He did exactly that and he did it with a patch. Nothing new here. This is 
how development on LKML proceeds when you have two or more competing 
designs. There's absolutely no need to get upset or hurt your feelings 
over it. It's not malicious, it's how we do Linux development.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
  2007-04-15  5:16   ` Bill Huey
  2007-04-15  6:43   ` Mike Galbraith
@ 2007-04-15 15:05   ` Ingo Molnar
  2007-04-15 20:05     ` Matt Mackall
  2007-04-16  5:16     ` Con Kolivas
  2 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:05 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Con Kolivas <kernel@kolivas.org> wrote:

[ i'm quoting this bit out of order: ]

> 2. Since then I've been thinking/working on a cpu scheduler design 
> that takes away all the guesswork out of scheduling and gives very 
> predictable, as fair as possible, cpu distribution and latency while 
> preserving as solid interactivity as possible within those confines.

yeah. I think you were right on target with this call. I've applied the 
sched.c change attached at the bottom of this mail to the CFS patch, if 
you dont mind. (or feel free to suggest some other text instead.)

> 1. I tried in vain some time ago to push a working extensable 
> pluggable cpu scheduler framework (based on wli's work) for the linux 
> kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he 
> didn't like it) as being absolutely the wrong approach and that we 
> should never do that. [...]

i partially replied to that point to Will already, and i'd like to make 
it clear again: yes, i rejected plugsched 2-3 years ago (which already 
drifted away from wli's original codebase) and i would still reject it 
today.

First and foremost, please dont take such rejections too personally - i 
had my own share of rejections (and in fact, as i mentioned it in a 
previous mail, i had a fair number of complete project throwaways: 
4g:4g, in-kernel Tux, irqrate and many others). I know that they can 
hurt and can demoralize, but if i dont like something it's my job to 
tell that.

Can i sum up your argument as: "you rejected plugsched, but then why on 
earth did you modularize portions of the scheduler in CFS? Isnt your 
position thus woefully inconsistent?" (i'm sure you would never put it 
this impolitely though, but i guess i can flame myself with impunity ;)

While having an inconsistent position isnt a terminal sin in itself, 
please realize that the scheduler classes code in CFS is quite different 
from plugsched: it was a result of what i saw to be technological 
pressure for _internal modularization_. (This internal/policy 
modularization aspect is something that Will said was present in his 
original plugsched code, but which aspect i didnt see in the plugsched 
patches that i reviewed.)

That possibility never even occured to me to until 3 days ago. You never 
raised it either AFAIK. No patches to simplify the scheduler that way 
were ever sent. Plugsched doesnt even touch the core load-balancer for 
example, and most of the time i spent with the modularization was to get 
the load-balancing details right. So it's really apples to oranges.

My view about plugsched: first please take a look at the latest 
plugsched code:

  http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch

  26 files changed, 8951 insertions(+), 1495 deletions(-)

As an experiment i've removed all the add-on schedulers (both the core 
and the include files, only kept the vanilla one) from the plugsched 
patch (and the makefile and kconfig complications, etc), to see the 
'infrastructure cost', and it still gave:

  12 files changed, 1933 insertions(+), 1479 deletions(-)

that's the extra complication i didnt like 3 years ago and which i still 
dont like today. What the current plugsched code does is that it 
simplifies the adding of new experimental schedulers, but it doesnt 
really do what i wanted: to simplify the _scheduler itself_. Personally 
i'm still not primarily interested in having a large selection of 
schedulers, i'm mainly interested in a good and maintainable scheduler 
that works for people.

so the rejection was on these grounds, and i still very much stand by 
that position here and today: i didnt want to see the Linux scheduler 
landscape balkanized and i saw no technological reasons for the 
complication that external modularization brings.

the new scheding classes code in the CFS patch was not a result of "oh, 
i want to write a new scheduler, lets make schedulers pluggable" kind of 
thinking. That result was just a side-effect of it. (and as you 
correctly noted it, the CFS related modularization is incomplete).

Btw., the thing that triggered the scheduling classes code wasnt even 
plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he 
fixed it within the framework of the existing scheduler, and the end 
result behaved quite well when i threw various testloads on it.

But i felt a bit uncomfortable that it added another few hundred lines 
of code to an already complex sched.c. This felt unnatural so i mailed 
Mike that i'd attempt to clean these infrastructure aspects of sched.c 
up a bit so that it becomes more hackable to him. Thus 3 days ago, 
without having made up my mind about anything, i started this experiment 
(which ended up in the modularization and in the CFS scheduler) to 
simplify the code and to enable Mike to fix such itches in an easier 
way. By your logic Mike should in fact be quite upset about this: if the 
new code works out and proves to be useful then it obsoletes a whole lot 
of code of him!

> For weeks now, Ingo has said that the interactivity regressions were 
> showstoppers and we should address them, never mind the fact that the 
> so-called regressions were purely "it slows down linearly with load" 
> which to me is perfectly desirable behaviour. [...]

yes. For me the first thing when considering a large scheduler patch is: 
"does a patch do what it claims" and "does it work". If those goals are 
met (and if it's a complete scheduler i actually try it quite 
extensively) then i look at the code cleanliness issues. Mike's patch 
was the first one that seemed to meet that threshold in my own humble 
testing, and CFS was a direct result of that.

note that i tried the same workloads with CFS and while it wasnt as good 
as mainline, it handled them better than SD. Mike reported the same, and 
Mark Lord (who too reported SD interactivity problems) reported success 
yesterday too.

(but .. CFS is a mere 2 days old so we cannot really tell anything with 
certainty yet.)

> [...] However at one stage I virtually begged for support with my 
> attempts and help with the code. Dmitry Adamushko is the only person 
> who actually helped me with the code in the interim, while others 
> poked sticks at it. Sure the sticks helped at times but the sticks 
> always seemed to have their ends kerosene doused and flaming for 
> reasons I still don't get. No other help was forthcoming.

i'm really sorry you got that impression.

in 2004 i had a good look at the staircase scheduler and said:

  http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html

   "But in general i'm quite positive about the staircase scheduler."

and even tested it and gave you feedback:

   http://lwn.net/Articles/96562/

i think i even told Andrew that i dont really like pluggable schedulers 
and if there's any replacement for the current scheduler then that would 
be a full replacement, and it would be the staircase scheduler.

Hey, i told this to you as recently as 1 month ago as well:

   http://lkml.org/lkml/2007/3/8/54

   "cool! I like this even more than i liked your original staircase 
    scheduler from 2 years ago :)"

	Ingo

----------->
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -16,6 +16,7 @@
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
  *  2004-04-02	Scheduler domains code by Nick Piggin
+ *  2007-04-15	Con Kolivas was dead right: fairness matters! :)
  */
 

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 10:39         ` Pekka Enberg
  2007-04-15 12:45           ` Willy Tarreau
@ 2007-04-15 15:16           ` Gene Heskett
  2007-04-15 16:43             ` Con Kolivas
  1 sibling, 1 reply; 712+ messages in thread
From: Gene Heskett @ 2007-04-15 15:16 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Pekka Enberg wrote:
>On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> The perception here is that there is that there is this expectation that
>> sections of the Linux kernel are intentionally "churn squated" to prevent
>> any other ideas from creeping in other than of the owner of that subsytem
>
>Strangely enough, my perception is that Ingo is simply trying to
>address the issues Mike's testing discovered in RDSL and SD. It's not
>surprising Ingo made it a separate patch set as Con has repeatedly
>stated that the "problems" are in fact by design and won't be fixed.

I won't get into the middle of this just yet, not having decided which dog I 
should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for about 
24 hours, its been generally usable, but gzip still causes lots of 5 to 10+ 
second lags when its running.  I'm coming to the conclusion that gzip simply 
doesn't play well with others...  

Amazing to me, the cpu its using stays generally below 80%, and often below 
60%, even while the kmail composer has a full sentence in its buffer that it 
still hasn't shown me when I switch to the htop screen to check, and back to 
the kmail screen to see if its updated yet.  The screen switch doesn't seem 
to lag so I don't think renicing x would be helpfull.  Those are the obvious 
lags, and I'll build & reboot to the CFS patch at some point this morning 
(whats left of it that is :).  And report in due time of course

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
knot in cables caused data stream to become twisted and kinked

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
@ 2007-04-15 15:26             ` William Lee Irwin III
  2007-04-16 15:55               ` Chris Friesen
  2007-04-15 15:39             ` Ingo Molnar
  2 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote:
> Now I hope he and Bill will get over this and accept to work on improving
> this scheduler, because I really find it smarter than a dumb O(1). I even
> agree with Mike that we now have a solid basis for future work. But for
> this, maybe a good starting point would be to remove the selfish printk
> at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
> and improve the documentation a bit so that people can work together on
> the new design, without feeling like their work will only server to
> promote X or Y.

While I appreciate people coming to my defense, or at least the good
intentions behind such, my only actual interest in pointing out
4-year-old work is getting some acknowledgment of having done something
relevant at all. Sometimes it has "I told you so" value. At other times
it's merely clarifying what went on when people refer to it since in a
number of cases the patches are no longer extant, so they can't
actually look at it to get an idea of what was or wasn't done. At other
times I'm miffed about not being credited, whether I should've been or
whether dead and buried code has an implementation of the same idea
resurfacing without the author(s) having any knowledge of my prior work.

One should note that in this case, the first work of mine this trips
over (scheduling classes) was never publicly posted as it was only a
part of the original plugsched (an alternate scheduler implementation
devised to demonstrate plugsched's flexibility with respect to
scheduling policies), and a part that was dropped by subsequent
maintainers. The second work of mine this trips over, a virtual deadline
scheduler named "vdls," was also never publicly posted. Both are from
around the same time period, which makes them approximately 4 years dead.
Neither of the codebases are extant, having been lost in a transition
between employers, though various people recall having been sent them
privately, and plugsched survives in a mutated form as maintained by
Peter Williams, who's been very good about acknowledging my original
contribution.

If I care to become a direct participant in scheduler work, I can do so
easily enough.

I'm not entirely sure what this is about a basis for future work. By
and large one should alter the API's and data structures to fit the
policy being implemented. While the array swapping was nice for
algorithmically improving 2.4.x -style epoch expiry, most algorithms
not based on the 2.4.x scheduler (in however mutated a form) should use
a different queue structure, in fact, one designed around their
policy's specific algorithmic needs. IOW, when one alters the scheduler,
one should also alter the queue data structure appropriately. I'd not
expect the priority queue implementation in cfs to continue to be used
unaltered as it matures, nor would I expect any significant modification
of the scheduler to necessarily use a similar one.

By and large I've been mystified as to why there is such a penchant for
preserving the existing queue structures in the various scheduler
patches floating around. I am now every bit as mystified at the point
of view that seems to be emerging that a change of queue structure is
particularly significant. These are all largely internal changes to
sched.c, and as such, rather small changes in and of themselves. While
they do tend to have user-visible effects, from this point of view
even changing out every line of sched.c is effectively a micropatch.

Something more significant might be altering the schedule() API to
take a mandatory description of the intention of the call to it, or
breaking up schedule() into several different functions to distinguish
between different sorts of uses of it to which one would then respond
differently. Also more significant would be adding a new state beyond
TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some
tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE
users to use the new state and handle those fatal signals. While not
quite as ostentatious in their user-visible effects as SCHED_OTHER
policy affairs, they are tremendously more work than switching out the
implementation of a single C file, and so somewhat more respectable.

Even as scheduling semantics go, these are micropatches. So SCHED_OTHER
changes a little. Where are the gang schedulers? Where are the batch
schedulers (SCHED_BATCH is not truly such)? Where are the isochronous
(frame) schedulers? I suppose there is some CKRM work that actually has
a semantic impact despite being largely devoted to SCHED_OTHER, and
there's some spufs gang scheduling going on, though not all that much.
And to reiterate a point from other threads, even as SCHED_OTHER
patches go, I see precious little verification that things like the
semantics of nice numbers or other sorts of CPU bandwidth allocation
between competing tasks of various natures are staying the same while
other things are changed, or at least being consciously modified in
such a fashion as to improve them. I've literally only seen one or two
tests (and rather inflexible ones with respect to sleep and running
time mixtures) with any sort of quantification of how CPU bandwidth is
distributed get run on all this.

So from my point of view, there's a lot of churn and craziness going on
in one tiny corner of the kernel and people don't seem to have a very
solid grip on what effects their changes have or how they might
potentially break userspace. So I've developed a sudden interest in
regression testing of the scheduler in order to ensure that various sorts
of semantics on which userspace relies are not broken, and am trying to
spark more interest in general in nailing down scheduling semantics and
verifying that those semantics are honored and remain honored by whatever
future scheduler implementations might be merged.

Thus far, the laundry list of semantics I'd like to have nailed down
are specifically:

(1) CPU bandwidth allocation according to nice numbers
(2) CPU bandwidth allocation among mixtures of tasks with varying
	sleep/wakeup behavior e.g. that consume some percentage of cpu
	in isolation, perhaps also varying the granularity of their
	sleep/wakeup patterns
(3) sched_yield(), so multitier userspace locking doesn't go haywire
(4) How these work with SMP; most people agree it should be mostly the
	same as it works on UP, but it's not being verified, as most
	testcases are barely SMP-aware if at all, and corner cases
	where proportionality breaks down aren't considered

The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
	corresponds to some share of CPU bandwidth. Implementations
	should not have the freedom to change this arbitrarily according
	to some intention.
(2) A given scheduler _implementation_ intends to distribute CPU
	bandwidth among mixtures of tasks that would each consume some
	percentage of the CPU in isolation varying across tasks in some
	particular pattern. For example, maybe some scheduler
	implementation assigns a share of 1/%cpu to a task that would
	consume %cpu in isolation, for a CPU bandwidth allocation of
	(1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks
	(this is not to say that such a policy makes sense).
(3) sched_yield() is intended to result in some particular scheduling
	pattern in a given scheduler implementation. For instance, an
	implementation may intend that a set of CPU hogs calling
	sched_yield() between repeatedly printf()'ing their pid's will
	see their printf()'s come out in an approximately consistent
	order as the scheduler cycles between them.
(4) What an implementation intends to do with respect to SMP CPU
	bandwidth allocation when precise emulation of UP behavior is
	impossible, considering sched_yield() scheduling patterns when
	possible as well. For instance, perhaps an implementation
	intends to ensure equal CPU bandwidth among competing CPU-bound
	tasks of equal priority at all costs, and so triggers migration
	and/or load balancing to make it so. Or perhaps an implementation
	intends to ensure precise sched_yield() ordering at all costs
	even on SMP. Some sort of specification of the intention, then
	a verification that the intention is carried out in a testcase.

Also, if there's a semantic issue to be resolved, I want it to have
something describing it and verifying it. For instance, characterizing
whatever sort of scheduling artifacts queue-swapping causes in the
mainline scheduler and then a testcase to demonstrate the artifact and
its resolution in a given scheduler rewrite would be a good design
statement and verification. For instance, if someone wants to go back
to queue-swapping or other epoch expiry semantics, it would make them
(and hopefully everyone else) conscious of the semantic issue the
change raises, or possibly serve as a demonstration that the artifacts
can be mitigated in some implementation retaining epoch expiry semantics.

As I become aware of more potential issues I'll add more to my laundry
list, and I'll hammer out testcases as I go. My concern with the
scheduler is that this sort of basic functionality may be significantly
disturbed with no one noticing at all until a distro issues a prerelease
and benchmarks go haywire, and furthermore that changes to this kind of
basic behavior may be signs of things going awry, particularly as more
churn happens.

So now that I've clarified my role in all this to date and my point of
view on it, it should be clear that accepting something and working on
some particular scheduler implementation don't make sense as
suggestions to me.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
  2007-04-15 15:26             ` William Lee Irwin III
@ 2007-04-15 15:39             ` Ingo Molnar
  2007-04-15 15:47               ` William Lee Irwin III
  2007-04-16  5:27               ` Peter Williams
  2 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:39 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> Ingo could have publicly spoken with them about his ideas of killing 
> the O(1) scheduler and replacing it with an rbtree-based one, [...]

yes, that's precisely what i did, via a patchset :)

[ I can even tell you when it all started: i was thinking about Mike's
  throttling patches while watching Manchester United beat the crap out
  of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
  Wednesday morning and sent the patch Friday evening. I very much
  believe in low-latency when it comes to development too ;) ]

(if this had been done via a comittee then today we'd probably still be 
trying to find a suitable timeslot for the initial conference call where 
we'd discuss the election of a chair who would be tasked with writing up 
an initial document of feature requests, on which we'd take a vote, 
possibly this year already, because the matter is really urgent you know 
;-)

> [...] and using part of Bill's work to speed up development.

ok, let me make this absolutely clear: i didnt use any bit of plugsched 
- in fact the most difficult bits of the modularization was for areas of 
sched.c that plugsched never even touched AFAIK. (the load-balancer for 
example.)

Plugsched simply does something else: i modularized scheduling policies 
in essence that have to cooperate with each other, while plugsched 
modularized complete schedulers which are compile-time or boot-time 
selected, with no runtime cooperation between them. (one has to be 
selected at a time)

(and i have no trouble at all with crediting Will's work either: a few 
years ago i used Will's PID rework concepts for an NPTL related speedup 
and Will is very much credited for it in today's kernel/pid.c and he 
continued to contribute to it later on.)

(the tree walking bits of sched_fair.c were in fact derived from 
kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:39             ` Ingo Molnar
@ 2007-04-15 15:47               ` William Lee Irwin III
  2007-04-16  5:27               ` Peter Williams
  1 sibling, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Willy Tarreau <w@1wt.eu> wrote:
>> [...] and using part of Bill's work to speed up development.

On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote:
> ok, let me make this absolutely clear: i didnt use any bit of plugsched 
> - in fact the most difficult bits of the modularization was for areas of 
> sched.c that plugsched never even touched AFAIK. (the load-balancer for 
> example.)
> Plugsched simply does something else: i modularized scheduling policies 
> in essence that have to cooperate with each other, while plugsched 
> modularized complete schedulers which are compile-time or boot-time 
> selected, with no runtime cooperation between them. (one has to be 
> selected at a time)
> (and i have no trouble at all with crediting Will's work either: a few 
> years ago i used Will's PID rework concepts for an NPTL related speedup 
> and Will is very much credited for it in today's kernel/pid.c and he 
> continued to contribute to it later on.)
> (the tree walking bits of sched_fair.c were in fact derived from 
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

The extant plugsched patches have nothing to do with cfs; I suspect
what everyone else is going on about is terminological confusion. The
4-year-old sample policy with scheduling classes for the original
plugsched is something you had no way of knowing about, as it was never
publicly posted. There isn't really anything all that interesting going
on here, apart from pointing out that it's been done before.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  5:16   ` Bill Huey
  2007-04-15  8:44     ` Ingo Molnar
@ 2007-04-15 16:11     ` Bernd Eckenfels
  1 sibling, 0 replies; 712+ messages in thread
From: Bernd Eckenfels @ 2007-04-15 16:11 UTC (permalink / raw)
  To: linux-kernel

In article <20070415051645.GA28438@gnuppy.monkey.org> you wrote:
> A development process like this is likely to exclude smart people from wanting
> to contribute to Linux and folks should be conscious about this issues.

Nobody is excluded, you can always have a next iteration.

Gruss
Bernd

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-14 16:59     ` S.Çağlar Onur
@ 2007-04-15 16:13       ` Ingo Molnar
  2007-04-15 16:25         ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 16:13 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Michael Lothian, Christophe Thommeret,
	Christoph Pfister, Jurgen Kofler


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> > hm, could you try to strace it and/or attach gdb to it and figure 
> > out what's wrong? (perhaps involving the Kaffeine developers too?) 
> > As long as it's not a kernel level crash i cannot see how the 
> > scheduler could directly cause this - other than by accident 
> > creating a scheduling pattern that triggers a user-space bug more 
> > often than with other schedulers.
> 
> ...
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
> futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = -1 EINTR (Interrupted system call)
> --- SIGINT (Interrupt) @ 0 (0) ---
> +++ killed by SIGINT +++
> 
> is where freeze occurs. Full log can be found at [1]

> [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine

thanks. This does has the appearance of a userspace race condition of 
some sorts. Can you trigger this hang with the patch below applied to 
the vanilla tree as well? (with no CFS patch applied)

if yes then this would suggest that Kaffeine has some sort of 
child-runs-first problem. (which CFS changes to parent-runs-first. 
Kaffeine starts a couple of threads and the futex calls are a sign of 
thread<->thread communication.)

[ i have also Cc:-ed the Kaffeine folks - maybe your strace gives them
  an idea about what the problem could be :) ]

	Ingo

---
 kernel/sched.c |   70 ++-------------------------------------------------------
 1 file changed, 3 insertions(+), 67 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1653,77 +1653,13 @@ void fastcall sched_fork(struct task_str
  */
 void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 {
-	struct rq *rq, *this_rq;
 	unsigned long flags;
-	int this_cpu, cpu;
+	struct rq *rq;
 
 	rq = task_rq_lock(p, &flags);
 	BUG_ON(p->state != TASK_RUNNING);
-	this_cpu = smp_processor_id();
-	cpu = task_cpu(p);
-
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive. The parent
-	 * (current) is done further down, under its lock.
-	 */
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->prio = effective_prio(p);
-
-	if (likely(cpu == this_cpu)) {
-		if (!(clone_flags & CLONE_VM)) {
-			/*
-			 * The VM isn't cloned, so we're in a good position to
-			 * do child-runs-first in anticipation of an exec. This
-			 * usually avoids a lot of COW overhead.
-			 */
-			if (unlikely(!current->array))
-				__activate_task(p, rq);
-			else {
-				p->prio = current->prio;
-				p->normal_prio = current->normal_prio;
-				list_add_tail(&p->run_list, &current->run_list);
-				p->array = current->array;
-				p->array->nr_active++;
-				inc_nr_running(p, rq);
-			}
-			set_need_resched();
-		} else
-			/* Run child last */
-			__activate_task(p, rq);
-		/*
-		 * We skip the following code due to cpu == this_cpu
-	 	 *
-		 *   task_rq_unlock(rq, &flags);
-		 *   this_rq = task_rq_lock(current, &flags);
-		 */
-		this_rq = rq;
-	} else {
-		this_rq = cpu_rq(this_cpu);
-
-		/*
-		 * Not the local CPU - must adjust timestamp. This should
-		 * get optimised away in the !CONFIG_SMP case.
-		 */
-		p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
-					+ rq->most_recent_timestamp;
-		__activate_task(p, rq);
-		if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
-
-		/*
-		 * Parent and child are on different CPUs, now get the
-		 * parent runqueue to update the parent's ->sleep_avg:
-		 */
-		task_rq_unlock(rq, &flags);
-		this_rq = task_rq_lock(current, &flags);
-	}
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-	task_rq_unlock(this_rq, &flags);
+	__activate_task(p, rq);
+	task_rq_unlock(rq, &flags);
 }
 
 /*

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
  2007-04-15  9:06       ` Ingo Molnar
@ 2007-04-15 16:25       ` Arjan van de Ven
  2007-04-16  5:36         ` Bill Huey
  2 siblings, 1 reply; 712+ messages in thread
From: Arjan van de Ven @ 2007-04-15 16:25 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Thomas Gleixner


> It outlines the problems with Linux kernel development and questionable
> elistism regarding ownership of certain sections of the kernel code.

I have to step in and disagree here....

Linux is not about who writes the code.

Linux is about getting the best solution for a problem. Who wrote which
line of the code is irrelevant in the big picture.

that often means that multiple implementations happen, and that the a
darwinistic process decides that the best solution wins.

This darwinistic process often happens in the form of discussion, and
that discussion can happen with words or with code. In this case it
happened with a code proposal.

To make this specific: it has happened many times to me that when I
solved an issue with code, someone else stepped in and wrote a different
solution (although that was usually for smaller pieces). Was I upset
about that? No! I was happy because my *problem got solved* in the best
possible way.

Now this doesn't mean that people shouldn't be nice to each other, not
cooperate or steal credits, but I don't get the impression that that is
happening here. Ingo is taking part in the discussion with a counter
proposal for discussion *on the mailing list*. What more do you want??
If you or anyone else can improve it or do better, take part of this
discussion and show what you mean either in words or in code.

Your qualification of the discussion as a elitist takeover... I disagree
with that. It's a *discussion*. Now if you agree that Ingo's patch is
better technically, you and others should be happy about that because
your problem is getting solved better. If you don't agree that his patch
is better technically, take part in the technical discussion.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-15 16:13       ` Kaffeine problem with CFS Ingo Molnar
@ 2007-04-15 16:25         ` Ingo Molnar
  2007-04-15 16:55           ` Christoph Pfister
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 16:25 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Michael Lothian, Christophe Thommeret,
	Christoph Pfister, Jurgen Kofler


* Ingo Molnar <mingo@elte.hu> wrote:

> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
> 
> thanks. This does has the appearance of a userspace race condition of 
> some sorts. Can you trigger this hang with the patch below applied to 
> the vanilla tree as well? (with no CFS patch applied)

oops, please use the patch below instead.

	Ingo

---
 kernel/sched.c |   69 ++++-----------------------------------------------------
 1 file changed, 5 insertions(+), 64 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1653,77 +1653,18 @@ void fastcall sched_fork(struct task_str
  */
 void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 {
-	struct rq *rq, *this_rq;
 	unsigned long flags;
-	int this_cpu, cpu;
+	struct rq *rq;
 
 	rq = task_rq_lock(p, &flags);
 	BUG_ON(p->state != TASK_RUNNING);
-	this_cpu = smp_processor_id();
-	cpu = task_cpu(p);
-
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive. The parent
-	 * (current) is done further down, under its lock.
-	 */
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
 
 	p->prio = effective_prio(p);
+	__activate_task(p, rq);
+	if (TASK_PREEMPTS_CURR(p, rq))
+		resched_task(rq->curr);
 
-	if (likely(cpu == this_cpu)) {
-		if (!(clone_flags & CLONE_VM)) {
-			/*
-			 * The VM isn't cloned, so we're in a good position to
-			 * do child-runs-first in anticipation of an exec. This
-			 * usually avoids a lot of COW overhead.
-			 */
-			if (unlikely(!current->array))
-				__activate_task(p, rq);
-			else {
-				p->prio = current->prio;
-				p->normal_prio = current->normal_prio;
-				list_add_tail(&p->run_list, &current->run_list);
-				p->array = current->array;
-				p->array->nr_active++;
-				inc_nr_running(p, rq);
-			}
-			set_need_resched();
-		} else
-			/* Run child last */
-			__activate_task(p, rq);
-		/*
-		 * We skip the following code due to cpu == this_cpu
-	 	 *
-		 *   task_rq_unlock(rq, &flags);
-		 *   this_rq = task_rq_lock(current, &flags);
-		 */
-		this_rq = rq;
-	} else {
-		this_rq = cpu_rq(this_cpu);
-
-		/*
-		 * Not the local CPU - must adjust timestamp. This should
-		 * get optimised away in the !CONFIG_SMP case.
-		 */
-		p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
-					+ rq->most_recent_timestamp;
-		__activate_task(p, rq);
-		if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
-
-		/*
-		 * Parent and child are on different CPUs, now get the
-		 * parent runqueue to update the parent's ->sleep_avg:
-		 */
-		task_rq_unlock(rq, &flags);
-		this_rq = task_rq_lock(current, &flags);
-	}
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-	task_rq_unlock(this_rq, &flags);
+	task_rq_unlock(rq, &flags);
 }
 
 /*

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair  Scheduler [CFS]
  2007-04-15 15:16           ` Gene Heskett
@ 2007-04-15 16:43             ` Con Kolivas
  2007-04-15 16:58               ` Gene Heskett
  0 siblings, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-15 16:43 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 01:16, Gene Heskett wrote:
> On Sunday 15 April 2007, Pekka Enberg wrote:
> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >> The perception here is that there is that there is this expectation that
> >> sections of the Linux kernel are intentionally "churn squated" to
> >> prevent any other ideas from creeping in other than of the owner of that
> >> subsytem
> >
> >Strangely enough, my perception is that Ingo is simply trying to
> >address the issues Mike's testing discovered in RDSL and SD. It's not
> >surprising Ingo made it a separate patch set as Con has repeatedly
> >stated that the "problems" are in fact by design and won't be fixed.
>
> I won't get into the middle of this just yet, not having decided which dog
> I should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for
> about 24 hours, its been generally usable, but gzip still causes lots of 5
> to 10+ second lags when its running.  I'm coming to the conclusion that
> gzip simply doesn't play well with others...

Actually Gene I think you're being bitten here by something I/O bound since 
the cpu usage never tops out. If that's the case and gzip is dumping 
truckloads of writes then you're suffering something that irks me even more 
than the scheduler in linux, and that's how much writes hurt just about 
everything else. Try your testcase with bzip2 instead (since that won't be 
i/o bound), or drop your dirty ratio to as low as possible which helps a 
little bit (5% is the minimum)

echo 5 > /proc/sys/vm/dirty_ratio

and finally try the braindead noop i/o scheduler as well.

echo noop > /sys/block/sda/queue/scheduler

(replace sda with your drive obviously).

I'd wager a big one that's what causes your gzip pain. If it wasn't for the 
fact that I've decided to all but give up ever trying to provide code for 
mainline again, trying my best to make writes hurt less on linux would be my 
next big thing [tm]. 

Oh and for the others watching, (points to vm hackers) I found a bug when 
playing with the dirty ratio code. If you modify it to allow it drop below 5% 
but still above the minimum in the vm code, stalls happen somewhere in the vm 
where nothing much happens for sometimes 20 or 30 seconds worst case 
scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be 
set ultra low because these stalls were gross.

> Amazing to me, the cpu its using stays generally below 80%, and often below
> 60%, even while the kmail composer has a full sentence in its buffer that
> it still hasn't shown me when I switch to the htop screen to check, and
> back to the kmail screen to see if its updated yet.  The screen switch
> doesn't seem to lag so I don't think renicing x would be helpfull.  Those
> are the obvious lags, and I'll build & reboot to the CFS patch at some
> point this morning (whats left of it that is :).  And report in due time of
> course

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-15 16:25         ` Ingo Molnar
@ 2007-04-15 16:55           ` Christoph Pfister
  2007-04-15 22:14             ` S.Çağlar Onur
                               ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Christoph Pfister @ 2007-04-15 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler

Hi,

2007/4/15, Ingo Molnar <mingo@elte.hu>:
>
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine

Could you try xine-ui or gxine? Because I suspect rather xine-lib for
freezing issues. In any way I think a gdb backtrace would be much
nicer - but if you can't reproduce the freeze issue with other xine
based players and want to run kaffeine in gdb, you need to execute
"gdb --args kaffeine --nofork".

> > thanks. This does has the appearance of a userspace race condition of
> > some sorts. Can you trigger this hang with the patch below applied to
> > the vanilla tree as well? (with no CFS patch applied)
>
> oops, please use the patch below instead.
>
>         Ingo
<snip>

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:43             ` Con Kolivas
@ 2007-04-15 16:58               ` Gene Heskett
  2007-04-15 18:00                 ` Mike Galbraith
  0 siblings, 1 reply; 712+ messages in thread
From: Gene Heskett @ 2007-04-15 16:58 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Con Kolivas wrote:
>On Monday 16 April 2007 01:16, Gene Heskett wrote:
>> On Sunday 15 April 2007, Pekka Enberg wrote:
>> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> >> The perception here is that there is that there is this expectation
>> >> that sections of the Linux kernel are intentionally "churn squated" to
>> >> prevent any other ideas from creeping in other than of the owner of
>> >> that subsytem
>> >
>> >Strangely enough, my perception is that Ingo is simply trying to
>> >address the issues Mike's testing discovered in RDSL and SD. It's not
>> >surprising Ingo made it a separate patch set as Con has repeatedly
>> >stated that the "problems" are in fact by design and won't be fixed.
>>
>> I won't get into the middle of this just yet, not having decided which dog
>> I should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for
>> about 24 hours, its been generally usable, but gzip still causes lots of 5
>> to 10+ second lags when its running.  I'm coming to the conclusion that
>> gzip simply doesn't play well with others...
>
>Actually Gene I think you're being bitten here by something I/O bound since
>the cpu usage never tops out. If that's the case and gzip is dumping
>truckloads of writes then you're suffering something that irks me even more
>than the scheduler in linux, and that's how much writes hurt just about
>everything else. Try your testcase with bzip2 instead (since that won't be
>i/o bound), or drop your dirty ratio to as low as possible which helps a
>little bit (5% is the minimum)
>
>echo 5 > /proc/sys/vm/dirty_ratio
>
>and finally try the braindead noop i/o scheduler as well.
>
>echo noop > /sys/block/sda/queue/scheduler
>
>(replace sda with your drive obviously).
>
>I'd wager a big one that's what causes your gzip pain. If it wasn't for the
>fact that I've decided to all but give up ever trying to provide code for
>mainline again, trying my best to make writes hurt less on linux would be my
>next big thing [tm].

Chuckle, possibly but then I'm not anything even remotely close to an expert 
here Con, just reporting what I get.  And I just rebooted to 2.6.21-rc6 + 
sched-mike-5.patch for grins and giggles, or frowns and profanity as the case 
may call for.

>Oh and for the others watching, (points to vm hackers) I found a bug when
>playing with the dirty ratio code. If you modify it to allow it drop below
> 5% but still above the minimum in the vm code, stalls happen somewhere in
> the vm where nothing much happens for sometimes 20 or 30 seconds worst case
> scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to
> be set ultra low because these stalls were gross.

I think I'd need a bit of tutoring on how to do that.  I recall that one other 
time, several weeks back, I thought I would try one of those famous echo this 
>/proc/that ideas that went by on this list, but even though I was root, 
apparently /proc was read-only AFAIWC.

>> Amazing to me, the cpu its using stays generally below 80%, and often
>> below 60%, even while the kmail composer has a full sentence in its buffer
>> that it still hasn't shown me when I switch to the htop screen to check,
>> and back to the kmail screen to see if its updated yet.  The screen switch
>> doesn't seem to lag so I don't think renicing x would be helpfull.  Those
>> are the obvious lags, and I'll build & reboot to the CFS patch at some
>> point this morning (whats left of it that is :).  And report in due time
>> of course

And now I wonder if I applied the right patch.  This one feels good ATM, but I 
don't think its the CFS thingy.  No, I'm sure of it now, none of the patches 
I've saved say a thing about CFS.  Backtrack up the list time I guess, ignore 
me for the nonce.


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Microsoft: Re-inventing square wheels

   -- From a Slashdot.org post

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 13:08             ` Pekka J Enberg
@ 2007-04-15 17:32               ` Mike Galbraith
  2007-04-15 17:59                 ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15 17:32 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> On Sun, 15 Apr 2007, Willy Tarreau wrote:
> > Ingo could have publicly spoken with them about his ideas of killing
> > the O(1) scheduler and replacing it with an rbtree-based one, and using
> > part of Bill's work to speed up development.
> 
> He did exactly that and he did it with a patch. Nothing new here. This is 
> how development on LKML proceeds when you have two or more competing 
> designs. There's absolutely no need to get upset or hurt your feelings 
> over it. It's not malicious, it's how we do Linux development.

Yes.  Exactly.  This is what it's all about, this is what makes it work.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:18                           ` Willy Tarreau
  2007-04-14 18:40                             ` Eric W. Biederman
@ 2007-04-15 17:55                             ` Ingo Molnar
  2007-04-15 18:06                               ` Willy Tarreau
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 17:55 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> Well, since I merged the fair-fork patch, I cannot reproduce (in fact, 
> bash forks 1000 processes, then progressively execs scheddos, but it 
> takes some time). So I'm rebuilding right now. But I think that Linus 
> has an interesting clue about GPM and notification before switching 
> the terminal. I think it was enabled in console mode. I don't know how 
> that translates to frozen xterms, but let's attack the problems one at 
> a time.

to debug this, could you try to apply this add-on as well:

  http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch

with this patch applied you should have a /proc/sched_debug file that 
prints all runnable tasks and other interesting info from the runqueue. 

[ i've refreshed all the patches on the CFS webpage, so if this doesnt 
  apply cleanly to your current tree then you'll probably have to 
  refresh one of the patches.]

The output should look like this:

 Sched Debug Version: v0.01
 now at 226761724575 nsecs

 cpu: 0
   .nr_running            : 3
   .raw_weighted_load     : 384
   .nr_switches           : 13666
   .nr_uninterruptible    : 0
   .next_balance          : 4294947416
   .curr->pid             : 2179
   .rq_clock              : 241337421233
   .fair_clock            : 7503791206
   .wait_runtime          : 2269918379

 runnable tasks:
            task | PID | tree-key |   -delta |  waiting | switches
 -----------------------------------------------------------------
 +            cat  2179 7501930066   -1861140    1861140         2
      loop_silent  2149 7503010354    -780852          0       911
      loop_silent  2148 7503510048    -281158     280753       918

now for your workload the list should be considerably larger. If there's 
starvation going on then the 'switches' field (number of context 
switches) of one of the tasks would never increase while you have this 
'cannot switch consoles' problem.

maybe you'll have to unapply the fair-fork patch to make it trigger 
again. (fair-fork does not fix anything, so it probably just hides a 
real bug.)

(i'm meanwhile busy running your scheddos utilities to reproduce it 
locally as well :)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:32               ` Mike Galbraith
@ 2007-04-15 17:59                 ` Linus Torvalds
  2007-04-15 19:00                   ` Jonathan Lundell
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-15 17:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel,
	Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner



On Sun, 15 Apr 2007, Mike Galbraith wrote:

> On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> > 
> > He did exactly that and he did it with a patch. Nothing new here. This is 
> > how development on LKML proceeds when you have two or more competing 
> > designs. There's absolutely no need to get upset or hurt your feelings 
> > over it. It's not malicious, it's how we do Linux development.
> 
> Yes.  Exactly.  This is what it's all about, this is what makes it work.

I obviously agree, but I will also add that one of the most motivating 
things there *is* in open source is "personal pride".

It's a really good thing, and it means that if somebody shows that your 
code is flawed in some way (by, for example, making a patch that people 
claim gets better behaviour or numbers), any *good* programmer that 
actually cares about his code will obviously suddenly be very motivated to 
out-do the out-doer!

Does this mean that there will be tension and rivalry? Hell yes. But 
that's kind of the point. Life is a game, and if you aren't in it to win, 
what the heck are you still doing here?

As long as it's reasonably civil (I'm not personally a huge believer in 
being too polite or "politically correct", so I think the "reasonably" is 
more important than the "civil" part!), and as long as the end result is 
judged on TECHNICAL MERIT, it's all good.

We don't want to play politics. But encouraging peoples competitive 
feelings? Oh, yes. 

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:58               ` Gene Heskett
@ 2007-04-15 18:00                 ` Mike Galbraith
  2007-04-16  0:18                   ` Gene Heskett
  0 siblings, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-15 18:00 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:

> Chuckle, possibly but then I'm not anything even remotely close to an expert 
> here Con, just reporting what I get.  And I just rebooted to 2.6.21-rc6 + 
> sched-mike-5.patch for grins and giggles, or frowns and profanity as the case 
> may call for.

Erm, that patch is embarrassingly buggy, so profanity should dominate.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:55                             ` Ingo Molnar
@ 2007-04-15 18:06                               ` Willy Tarreau
  2007-04-15 19:20                                 ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-15 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Hi Ingo,

On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, 
> > bash forks 1000 processes, then progressively execs scheddos, but it 
> > takes some time). So I'm rebuilding right now. But I think that Linus 
> > has an interesting clue about GPM and notification before switching 
> > the terminal. I think it was enabled in console mode. I don't know how 
> > that translates to frozen xterms, but let's attack the problems one at 
> > a time.
> 
> to debug this, could you try to apply this add-on as well:
> 
>   http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> 
> with this patch applied you should have a /proc/sched_debug file that 
> prints all runnable tasks and other interesting info from the runqueue. 

I don't know if you have seen my mail from yesterday evening (here). I
found that changing keventd prio fixed the problem. You may be interested
in the description. I sent it at 21:01 (+200).

> [ i've refreshed all the patches on the CFS webpage, so if this doesnt 
>   apply cleanly to your current tree then you'll probably have to 
>   refresh one of the patches.]

Fine, I'll have a look. I already had to rediff the sched-fair-fork
patch last time.

> The output should look like this:
> 
>  Sched Debug Version: v0.01
>  now at 226761724575 nsecs
> 
>  cpu: 0
>    .nr_running            : 3
>    .raw_weighted_load     : 384
>    .nr_switches           : 13666
>    .nr_uninterruptible    : 0
>    .next_balance          : 4294947416
>    .curr->pid             : 2179
>    .rq_clock              : 241337421233
>    .fair_clock            : 7503791206
>    .wait_runtime          : 2269918379
> 
>  runnable tasks:
>             task | PID | tree-key |   -delta |  waiting | switches
>  -----------------------------------------------------------------
>  +            cat  2179 7501930066   -1861140    1861140         2
>       loop_silent  2149 7503010354    -780852          0       911
>       loop_silent  2148 7503510048    -281158     280753       918

Nice.

> now for your workload the list should be considerably larger. If there's 
> starvation going on then the 'switches' field (number of context 
> switches) of one of the tasks would never increase while you have this 
> 'cannot switch consoles' problem.
> 
> maybe you'll have to unapply the fair-fork patch to make it trigger 
> again. (fair-fork does not fix anything, so it probably just hides a 
> real bug.)
> 
> (i'm meanwhile busy running your scheddos utilities to reproduce it 
> locally as well :)

I discovered I had the frame-buffer enabled (I did not notice it first
because I do not have the logo and the resolution is the same as text).
It's matroxfb with a G400, if that can help. It may be possible that
it needs some CPU that it cannot get to clear the display before
switching, I don't know.

However I won't try this right now, I'm deep in userland at the moment.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:59                 ` Linus Torvalds
@ 2007-04-15 19:00                   ` Jonathan Lundell
  2007-04-15 22:52                     ` Con Kolivas
  0 siblings, 1 reply; 712+ messages in thread
From: Jonathan Lundell @ 2007-04-15 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey,
	Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:

> It's a really good thing, and it means that if somebody shows that  
> your
> code is flawed in some way (by, for example, making a patch that  
> people
> claim gets better behaviour or numbers), any *good* programmer that
> actually cares about his code will obviously suddenly be very  
> motivated to
> out-do the out-doer!

"No one who cannot rejoice in the discovery of his own mistakes  
deserves to be called a scholar."

--Don Foster, "literary sleuth", on retracting his attribution of "A  
Funerall Elegye" to Shakespeare (it's more likely John Ford's work).

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 18:06                               ` Willy Tarreau
@ 2007-04-15 19:20                                 ` Ingo Molnar
  2007-04-15 19:35                                   ` William Lee Irwin III
  2007-04-15 19:37                                   ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:20 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> > to debug this, could you try to apply this add-on as well:
> > 
> >   http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> > 
> > with this patch applied you should have a /proc/sched_debug file 
> > that prints all runnable tasks and other interesting info from the 
> > runqueue.
> 
> I don't know if you have seen my mail from yesterday evening (here). I 
> found that changing keventd prio fixed the problem. You may be 
> interested in the description. I sent it at 21:01 (+200).

ah, indeed i missed that mail - the response to the patches was quite 
overwhelming (and i naively thought people dont do Linux hacking over 
the weekends anymore ;).

so Linus was right: this was caused by scheduler starvation. I can see 
one immediate problem already: the 'nice offset' is not divided by 
nr_running as it should. The patch below should fix this but i have yet 
to test it accurately, this change might as well render nice levels 
unacceptably ineffective under high loads.

	Ingo

--------->
---
 kernel/sched_fair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
 	int leftmost = 1;
 	long long key;
 
-	key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+	key = rq->fair_clock - p->wait_runtime;
+	if (unlikely(p->nice_offset))
+		key += p->nice_offset / rq->nr_running;
 
 	p->fair_key = key;
 

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:20                                 ` Ingo Molnar
@ 2007-04-15 19:35                                   ` William Lee Irwin III
  2007-04-15 19:57                                     ` Ingo Molnar
  2007-04-15 19:37                                   ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15 19:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> so Linus was right: this was caused by scheduler starvation. I can see 
> one immediate problem already: the 'nice offset' is not divided by 
> nr_running as it should. The patch below should fix this but i have yet 
> to test it accurately, this change might as well render nice levels 
> unacceptably ineffective under high loads.

I've been suggesting testing CPU bandwidth allocation as influenced by
nice numbers for a while now for a reason.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:20                                 ` Ingo Molnar
  2007-04-15 19:35                                   ` William Lee Irwin III
@ 2007-04-15 19:37                                   ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:37 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Ingo Molnar <mingo@elte.hu> wrote:

> so Linus was right: this was caused by scheduler starvation. I can see 
> one immediate problem already: the 'nice offset' is not divided by 
> nr_running as it should. The patch below should fix this but i have 
> yet to test it accurately, this change might as well render nice 
> levels unacceptably ineffective under high loads.

erm, rather the updated patch below if you want to use this on a 32-bit 
system. But ... i think you should wait until i have all this re-tested.

	Ingo

---
 include/linux/sched.h |    2 +-
 kernel/sched_fair.c   |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -839,7 +839,7 @@ struct task_struct {
 
 	s64 wait_runtime;
 	u64 exec_runtime, fair_key;
-	s64 nice_offset, hog_limit;
+	s32 nice_offset, hog_limit;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
 	int leftmost = 1;
 	long long key;
 
-	key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+	key = rq->fair_clock - p->wait_runtime;
+	if (unlikely(p->nice_offset))
+		key += p->nice_offset / (rq->nr_running + 1);
 
 	p->fair_key = key;
 

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:35                                   ` William Lee Irwin III
@ 2007-04-15 19:57                                     ` Ingo Molnar
  2007-04-15 23:54                                       ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> > so Linus was right: this was caused by scheduler starvation. I can 
> > see one immediate problem already: the 'nice offset' is not divided 
> > by nr_running as it should. The patch below should fix this but i 
> > have yet to test it accurately, this change might as well render 
> > nice levels unacceptably ineffective under high loads.
> 
> I've been suggesting testing CPU bandwidth allocation as influenced by 
> nice numbers for a while now for a reason.

Oh I was very much testing "CPU bandwidth allocation as influenced by 
nice numbers" - it's one of the basic things i do when modifying the 
scheduler. An automated tool, while nice (all automation is nice) 
wouldnt necessarily show such bugs though, because here too it needed 
thousands of running tasks to trigger in practice. Any volunteers? ;)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:05   ` Ingo Molnar
@ 2007-04-15 20:05     ` Matt Mackall
  2007-04-15 20:48       ` Ingo Molnar
  2007-04-16  5:16     ` Con Kolivas
  1 sibling, 1 reply; 712+ messages in thread
From: Matt Mackall @ 2007-04-15 20:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote:
> so the rejection was on these grounds, and i still very much stand by 
> that position here and today: i didnt want to see the Linux scheduler 
> landscape balkanized and i saw no technological reasons for the 
> complication that external modularization brings.

But "balkanization" is a good thing. "Monoculture" is a bad thing.

Look at what happened with I/O scheduling. Opening things up to some
new ideas by making it possible to select your I/O scheduler took us
from 10 years of stagnation to healthy, competitive development, which
gave us a substantially better I/O scheduler.

Look at what's happening right now with TCP congestion algorithms.
We've had decades of tweaking Reno slightly now turned into a vibrant
research area with lots of radical alternatives. A winner will
eventually emerge and it will probably look quite a bit different than
Reno.

Similar things have gone on since the beginning with filesystems on
Linux. Being able to easily compare filesystems head to head has been
immensely valuable in improving our 'core' Linux filesystems.

And what we've had up to now is a scheduler monoculture. Until Andrew
put RSDL in -mm, if people wanted to experiment with other schedulers,
they had to go well off the beaten path to do it. So all the people
who've been hopelessy frustrated with the mainline scheduler go off to
the -ck ghetto, or worse, stick with 2.4.

Whether your motivations have been protectionist or merely
shortsighted, you've stomped pretty heavily on alternative scheduler
development by completely rejecting the whole plugsched concept. If
we'd opened up mainline to a variety of schedulers _3 years ago_, we'd
probably have gotten to where we are today much sooner.

Hopefully, the next time Rik suggests pluggable page replacement
algorithms, folks will actually seriously consider it.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:05     ` Matt Mackall
@ 2007-04-15 20:48       ` Ingo Molnar
  2007-04-15 21:31         ` Matt Mackall
  2007-04-15 23:39         ` William Lee Irwin III
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-15 20:48 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner


* Matt Mackall <mpm@selenic.com> wrote:

> Look at what happened with I/O scheduling. Opening things up to some 
> new ideas by making it possible to select your I/O scheduler took us 
> from 10 years of stagnation to healthy, competitive development, which 
> gave us a substantially better I/O scheduler.

actually, 2-3 years ago we already had IO schedulers, and my opinion 
against plugsched back then (also shared by Nick and Linus) was very 
much considering them. There are at least 4 reasons why I/O schedulers 
are different from CPU schedulers:

1) CPUs are a non-persistent resource shared by _all_ tasks and 
   workloads in the system. Disks are _persistent_ resources very much 
   attached to specific workloads. (If tasks had to be 'persistent' to
   the CPU they were started on we'd have much different scheduling
   technology, and there would be much less complexity.) More analogous 
   to CPU schedulers would perhaps be VM/MM schedulers, and those tend 
   to be hard to modularize in a technologically sane way too. (and 
   unlike disks there's no good generic way to attach VM/MM schedulers 
   to particular workloads.) So it's apples to oranges.

   in practice it comes down to having one good scheduler that runs all 
   workloads on a system reasonably well. And given that a very large 
   portion of system runs mixed workloads, the demand for one good 
   scheduler is pretty high. While i can run with mixed IO schedulers 
   just fine.

2) plugsched did not allow on the fly selection of schedulers, nor did
   it allow a per CPU selection of schedulers. IO schedulers you can 
   change per disk, on the fly, making them much more useful in
   practice. Also, IO schedulers (while definitely not being slow!) are 
   alot less performance sensitive than CPU schedulers.

3) I/O schedulers are pretty damn clean code, and plugsched, at least
   the last version i saw of it, didnt come even close.

4) the good thing that happened to I/O, after years of stagnation isnt
   I/O schedulers. The good thing that happened to I/O is called Jens
   Axboe. If you care about the I/O subystem then print that name out 
   and hang it on the wall. That and only that is what mattered.

all in one, while there are definitely uses (embedded would like to have 
a smaller/different scheduler, etc.), the technical case for 
modularization for the sake of selectability is alot lower for CPU 
schedulers than it is for I/O schedulers.

nor was the non-modularity of some piece of code ever an impediment to 
competition. May i remind you of the pretty competitive SLAB allocator 
landscape, resulting in things like the SLOB allocator, written by 
yourself? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:48       ` Ingo Molnar
@ 2007-04-15 21:31         ` Matt Mackall
  2007-04-16  3:03           ` Nick Piggin
  2007-04-16 15:45           ` William Lee Irwin III
  2007-04-15 23:39         ` William Lee Irwin III
  1 sibling, 2 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-15 21:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 
> * Matt Mackall <mpm@selenic.com> wrote:
> 
> > Look at what happened with I/O scheduling. Opening things up to some 
> > new ideas by making it possible to select your I/O scheduler took us 
> > from 10 years of stagnation to healthy, competitive development, which 
> > gave us a substantially better I/O scheduler.
> 
> actually, 2-3 years ago we already had IO schedulers, and my opinion 
> against plugsched back then (also shared by Nick and Linus) was very 
> much considering them. There are at least 4 reasons why I/O schedulers 
> are different from CPU schedulers:

...

> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>    the last version i saw of it, didnt come even close.

That's irrelevant. Plugsched was an attempt to get alternative
schedulers exposure in mainline. I know, because I remember
encouraging Bill to pursue it. Not only did you veto plugsched (which
may have been a perfectly reasonable thing to do), but you also vetoed
the whole concept of multiple schedulers in the tree too. "We don't
want to balkanize the scheduling landscape".

And that latter part is what I'm claiming has set us back for years.
It's not a technical argument but a strategic one. And it's just not a
good strategy.
 
> 4) the good thing that happened to I/O, after years of stagnation isnt
>    I/O schedulers. The good thing that happened to I/O is called Jens
>    Axboe. If you care about the I/O subystem then print that name out 
>    and hang it on the wall. That and only that is what mattered.

Disagree. Things didn't actually get interesting until Nick showed up
with AS and got it in-tree to demonstrate the huge amount of room we
had for improvement. It took several iterations of AS and CFQ (with a
couple complete rewrites) before CFQ began to look like the winner.
The resulting time-sliced CFQ was fairly heavily influenced by the
ideas in AS.

Similarly, things in scheduler land had been pretty damn boring until
Con finally got Andrew to take one of his schedulers for a spin.

> nor was the non-modularity of some piece of code ever an impediment to 
> competition. May i remind you of the pretty competitive SLAB allocator 
> landscape, resulting in things like the SLOB allocator, written by 
> yourself? ;-)

Thankfully no one came out and said "we don't want to balkanize the
allocator landscape" when I submitted it or I probably would have just
dropped it, rather than painfully dragging it along out of tree for
years. I'm not nearly the glutton for punishment that Con is. :-P

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-15 16:55           ` Christoph Pfister
@ 2007-04-15 22:14             ` S.Çağlar Onur
  2007-04-18  8:27             ` Ingo Molnar
       [not found]             ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com>
  2 siblings, 0 replies; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-15 22:14 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret,
	Jurgen Kofler

[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]

15 Nis 2007 Paz tarihinde, Christoph Pfister şunları yazmıştı: 
> Could you try xine-ui or gxine? Because I suspect rather xine-lib for
> freezing issues. In any way I think a gdb backtrace would be much
> nicer - but if you can't reproduce the freeze issue with other xine
> based players and want to run kaffeine in gdb, you need to execute
> "gdb --args kaffeine --nofork".

I just tested xine-ui and i can easily reproduce exact same problem with 
xine-ui also so you are right, it seems a xine-lib problem trigger by CFS 
changes.

> > > thanks. This does has the appearance of a userspace race condition of
> > > some sorts. Can you trigger this hang with the patch below applied to
> > > the vanilla tree as well? (with no CFS patch applied)
> >
> > oops, please use the patch below instead.

Tomorrow i'll test that patch and also try to get a backtrace.

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (10 preceding siblings ...)
  2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 22:49 ` Ismail Dönmez
  2007-04-15 23:23   ` Arjan van de Ven
  2007-04-16 11:58   ` Ingo Molnar
  2007-04-16 22:00 ` Andi Kleen
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 712+ messages in thread
From: Ismail Dönmez @ 2007-04-15 22:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

Hi,
On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Tested this on top of Linus' GIT tree but the system gets very unresponsive 
during high disk i/o using ext3 as filesystem but even writing a 300mb file 
to a usb disk (iPod actually) has the same affect.

Regards,
ismail

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:00                   ` Jonathan Lundell
@ 2007-04-15 22:52                     ` Con Kolivas
  2007-04-16  2:28                       ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-15 22:52 UTC (permalink / raw)
  To: Jonathan Lundell
  Cc: Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau,
	hui Bill Huey, Ingo Molnar, ck list, Peter Williams,
	linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > It's a really good thing, and it means that if somebody shows that
> > your
> > code is flawed in some way (by, for example, making a patch that
> > people
> > claim gets better behaviour or numbers), any *good* programmer that
> > actually cares about his code will obviously suddenly be very
> > motivated to
> > out-do the out-doer!
>
> "No one who cannot rejoice in the discovery of his own mistakes
> deserves to be called a scholar."

Lovely comment. I realise this is not truly directed at me but clearly in the 
context it has been said people will assume it is directed my way, so while 
we're all spinning lkml quality rhetoric, let me have a right of reply.

One thing I have never tried to do was to ignore bug reports. I'm forever 
joking that I keep pulling code out of my arse to improve what I've done. 
RSDL/SD was no exception; heck it had 40 iterations. The reason I could not 
reply to bug report A with "Oh that is problem B so I'll fix it with code C" 
was, as I've said many many times over, health related. I did indeed try to 
fix many of them without spending hours replying to sometimes unpleasant 
emails. If health wasn't an issue there might have been 1000 iterations of 
SD.

There was only ever _one_ thing that I was absolutely steadfast on as a 
concept that I refused to fix that people might claim was "a mistake I did 
not rejoice in to be a scholar". That was that the _correct_ behaviour for a 
scheduler is to be fair such that proportional slowdown with load is (using 
that awful pun) a feature, not a bug. Now there are people who will still 
disagree violently with me on that. SD attempted to be a fairness first 
virtual-deadline design. If I failed on that front, then so be it (and at 
least one person certainly has said in lovely warm fuzzy friendly 
communication that I'm a global failure on all fronts with SD). But let me 
point out now that Ingo's shiny new scheduler is a fairness-first 
virtual-deadline design which will have proportional slowdown with load. So 
it will have a very similar feature. I dare anyone to claim that proportional 
slowdown with load is a bug, because I will no longer feel like I'm standing 
alone with a BFG9000 trying to defend my standpoint. Others can take up the 
post at last.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
  2007-04-15  4:01     ` William Lee Irwin III
@ 2007-04-15 23:09     ` Pavel Pisa
  2007-04-16  5:47       ` Davide Libenzi
  2 siblings, 1 reply; 712+ messages in thread
From: Pavel Pisa @ 2007-04-15 23:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007 00:38, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the
> slot:
>
> S = Ib + min((T - Tb) / Tg, Ns - 1);
>
> Items with T longer than Ns*Tg will be scheduled in the relative last slot
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
> suite to your needs.
> This is a simple bench between time-ring (TR) and CFS queueing:
>
> http://www.xmailserver.org/smart-queue.c
>
> In my box (Dual Opteron 252):
>
> davide@alien:~$ ./smart-queue -n 8
> CFS = 142.21 cycles/loop
> TR  = 72.33 cycles/loop
> davide@alien:~$ ./smart-queue -n 16
> CFS = 188.74 cycles/loop
> TR  = 83.79 cycles/loop
> davide@alien:~$ ./smart-queue -n 32
> CFS = 221.36 cycles/loop
> TR  = 75.93 cycles/loop
> davide@alien:~$ ./smart-queue -n 64
> CFS = 242.89 cycles/loop
> TR  = 81.29 cycles/loop

Hello all,

I cannot help myself to not report results with GAVL
tree algorithm there as an another race competitor.
I believe, that it is better solution for large priority
queues than RB-tree and even heap trees. It could be
disputable if the scheduler needs such scalability on
the other hand. The AVL heritage guarantees lower height
which results in shorter search times which could
be profitable for other uses in kernel.

GAVL algorithm is AVL tree based, so it does not suffer from
"infinite" priorities granularity there as TR does. It allows
use for generalized case where tree is not fully balanced.
This allows to cut the first item withour rebalancing.
This leads to the degradation of the tree by one more level
(than non degraded AVL gives) in maximum, which is still
considerably better than RB-trees maximum.

http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c

The description behind the code is there

http://cmp.felk.cvut.cz/~pisa/ulan/gavl.pdf

The code is part of much more covering uLUt library

http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf
http://sourceforge.net/project/showfiles.php?group_id=118937&package_id=130840

I have included all required GAVL code directly into smart-queue-v-gavl.c
to provide it for easy testing.

There are tests run on my little dated computer - Duron 600 MHz.
Test are run twice to suppress run order influence.

./smart-queue-v-gavl -n 1 -l 2000000
gavl_cfs = 55.66 cycles/loop
CFS = 88.33 cycles/loop
TR  = 141.78 cycles/loop
CFS = 90.45 cycles/loop
gavl_cfs = 55.38 cycles/loop

./smart-queue-v-gavl -n 2 -l 2000000
gavl_cfs = 82.85 cycles/loop
CFS = 104.18 cycles/loop
TR  = 145.21 cycles/loop
CFS = 102.74 cycles/loop
gavl_cfs = 82.05 cycles/loop

./smart-queue-v-gavl -n 4 -l 2000000
gavl_cfs = 137.45 cycles/loop
CFS = 156.47 cycles/loop
TR  = 142.00 cycles/loop
CFS = 152.65 cycles/loop
gavl_cfs = 139.38 cycles/loop

./smart-queue-v-gavl -n 10 -l 2000000
gavl_cfs = 229.22 cycles/loop           (WORSE)
CFS = 206.26 cycles/loop
TR  = 140.81 cycles/loop
CFS = 208.29 cycles/loop
gavl_cfs = 223.62 cycles/loop           (WORSE)

./smart-queue-v-gavl -n 100 -l 2000000
gavl_cfs = 257.66 cycles/loop
CFS = 329.68 cycles/loop
TR  = 142.20 cycles/loop
CFS = 319.34 cycles/loop
gavl_cfs = 260.02 cycles/loop

./smart-queue-v-gavl -n 1000 -l 2000000
gavl_cfs = 258.41 cycles/loop
CFS = 393.04 cycles/loop
TR  = 134.76 cycles/loop
CFS = 392.20 cycles/loop
gavl_cfs = 260.93 cycles/loop

./smart-queue-v-gavl -n 10000 -l 2000000
gavl_cfs = 259.45 cycles/loop
CFS = 605.89 cycles/loop
TR  = 196.69 cycles/loop
CFS = 622.60 cycles/loop
gavl_cfs = 262.72 cycles/loop

./smart-queue-v-gavl -n 100000 -l 2000000
gavl_cfs = 258.21 cycles/loop
CFS = 845.62 cycles/loop
TR  = 315.37 cycles/loop
CFS = 860.21 cycles/loop
gavl_cfs = 258.94 cycles/loop

The GAVL code has not been tuned by any "likely"/"unlikely"
constructs. It brings even some other overhead from it generic
design which is not necessary for this use - it keeps
permanently even pointer to the last element, ensures,
that the insertion order is preserved for same key values
etc. But it still proves much better scalability then
kernel used RB-tree code. On the other hand, it does not
encode color/height in one of the pointers and requires
additional field for height.

May it be, that difference is due some bug in my testing,
then I would be interrested in correction. The test case
is oversimplified probably. I have already run more different
tests against GAVL code in the past to compare it with
different tree and queues implementations and I have not found
case with real performance degradation. On the other hand, there
are cases for small items counts where GAVL is sometimes
a little worse than others (array based heap-tree for example).

The GAVL code itself is used in more opensource and commercial
projects and we have noticed no problems after one small fix
at the time of the first release in 2004.

Best wishes

                Pavel Pisa
        e-mail: pisa@cmp.felk.cvut.cz
        www:    http://cmp.felk.cvut.cz/~pisa
        work:   http://www.pikron.com

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-15 23:23   ` Arjan van de Ven
  2007-04-15 23:33     ` Ismail Dönmez
  2007-04-16 11:58   ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Arjan van de Ven @ 2007-04-15 23:23 UTC (permalink / raw)
  To: Ismail Dönmez
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner

On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> Hi,
> On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> Tested this on top of Linus' GIT tree but the system gets very unresponsive 
> during high disk i/o using ext3 as filesystem but even writing a 300mb file 
> to a usb disk (iPod actually) has the same affect.

just to make sure; this exact same workload but with the stock scheduler
does not have this effect?

if so, then it could well be that the scheduler is too fair for it's own
good (being really fair inevitably ends up not batching as much as one
should, and batching is needed to get any kind of decent performance out
of disks nowadays)


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:23   ` Arjan van de Ven
@ 2007-04-15 23:33     ` Ismail Dönmez
  0 siblings, 0 replies; 712+ messages in thread
From: Ismail Dönmez @ 2007-04-15 23:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner

On Monday 16 April 2007 02:23:08 Arjan van de Ven wrote:
> On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> > Hi,
> >
> > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > > [CFS]
> > >
> > > i'm pleased to announce the first release of the "Modular Scheduler
> > > Core and Completely Fair Scheduler [CFS]" patchset:
> > >
> > >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same affect.
>
> just to make sure; this exact same workload but with the stock scheduler
> does not have this effect?
>
> if so, then it could well be that the scheduler is too fair for it's own
> good (being really fair inevitably ends up not batching as much as one
> should, and batching is needed to get any kind of decent performance out
> of disks nowadays)

Tried with make install in kdepim (which made system sluggish with CFS) and 
the system is just fine (using CFQ).

Regards,
ismail

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:48       ` Ingo Molnar
  2007-04-15 21:31         ` Matt Mackall
@ 2007-04-15 23:39         ` William Lee Irwin III
  2007-04-16  1:06           ` Peter Williams
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 2) plugsched did not allow on the fly selection of schedulers, nor did
>    it allow a per CPU selection of schedulers. IO schedulers you can 
>    change per disk, on the fly, making them much more useful in
>    practice. Also, IO schedulers (while definitely not being slow!) are 
>    alot less performance sensitive than CPU schedulers.

One of the reasons I never posted my own code is that it never met its
own design goals, which absolutely included switching on the fly. I
think Peter Williams may have done something about that. It was my hope
to be able to do insmod sched_foo.ko until it became clear that the
effort it was intended to assist wasn't going to get even the limited
hardware access required, at which point I largely stopped working on
it.


On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>    the last version i saw of it, didnt come even close.

I'm not sure what happened there. It wasn't a big enough patch to take
hits in this area due to getting overwhelmed by the programming burden
like some other efforts of mine. Maybe things started getting ugly once
on-the-fly switching entered the picture. My guess is that Peter Williams
will have to chime in here, since things have diverged enough from my
one-time contribution 4 years ago.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:57                                     ` Ingo Molnar
@ 2007-04-15 23:54                                       ` William Lee Irwin III
  2007-04-16 11:24                                         ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I've been suggesting testing CPU bandwidth allocation as influenced by 
>> nice numbers for a while now for a reason.

On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> Oh I was very much testing "CPU bandwidth allocation as influenced by 
> nice numbers" - it's one of the basic things i do when modifying the 
> scheduler. An automated tool, while nice (all automation is nice) 
> wouldnt necessarily show such bugs though, because here too it needed 
> thousands of running tasks to trigger in practice. Any volunteers? ;)

Worse comes to worse I might actually get around to doing it myself.
Any more detailed descriptions of the test for a rainy day?


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 18:00                 ` Mike Galbraith
@ 2007-04-16  0:18                   ` Gene Heskett
  0 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-16  0:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Mike Galbraith wrote:
>On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:
>> Chuckle, possibly but then I'm not anything even remotely close to an
>> expert here Con, just reporting what I get.  And I just rebooted to
>> 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and
>> profanity as the case may call for.
>
>Erm, that patch is embarrassingly buggy, so profanity should dominate.
>
>	-Mike

Chuckle, ROTFLMAO even.

I didn't run it that long as I immediately rebuilt and rebooted when I found 
I'd used the wrong patch, and in fact had tested that one and found it 
sub-optimal before I'd built and ran Con's -0.40 version.  As for bugs of the 
type that make it to the screen or logs, I didn't see any.  OTOH, my eyesight 
is slowly going downhill, now 20/25.  It was 20/10 30 years ago.  Now thats 
reason for profanity...

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Unix weanies are as bad at this as anyone.
             -- Larry Wall in <199702111730.JAA28598@wall.org>

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:39         ` William Lee Irwin III
@ 2007-04-16  1:06           ` Peter Williams
  2007-04-16  3:04             ` William Lee Irwin III
  2007-04-16 17:22             ` Chris Friesen
  0 siblings, 2 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-16  1:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 2) plugsched did not allow on the fly selection of schedulers, nor did
>>    it allow a per CPU selection of schedulers. IO schedulers you can 
>>    change per disk, on the fly, making them much more useful in
>>    practice. Also, IO schedulers (while definitely not being slow!) are 
>>    alot less performance sensitive than CPU schedulers.
> 
> One of the reasons I never posted my own code is that it never met its
> own design goals, which absolutely included switching on the fly. I
> think Peter Williams may have done something about that.

I didn't but some students did.

In a previous life, I did implement a runtime configurable CPU 
scheduling mechanism (implemented on True64, Solaris and Linux) that 
allowed schedulers to be loaded as modules at run time.  This was 
released commercially on True64 and Solaris.  So I know that it can be done.

I have thought about doing something similar for the SPA schedulers 
which differ in only small ways from each other but lack motivation.

> It was my hope
> to be able to do insmod sched_foo.ko until it became clear that the
> effort it was intended to assist wasn't going to get even the limited
> hardware access required, at which point I largely stopped working on
> it.
> 
> 
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>>    the last version i saw of it, didnt come even close.
> 
> I'm not sure what happened there. It wasn't a big enough patch to take
> hits in this area due to getting overwhelmed by the programming burden
> like some other efforts of mine. Maybe things started getting ugly once
> on-the-fly switching entered the picture. My guess is that Peter Williams
> will have to chime in here, since things have diverged enough from my
> one-time contribution 4 years ago.

 From my POV, the current version of plugsched is considerably simpler 
than it was when I took the code over from Con as I put considerable 
effort into minimizing code overlap in the various schedulers.

I also put considerable effort into minimizing any changes to the load 
balancing code (something Ingo seems to think is a deficiency) and the 
result is that plugsched allows "intra run queue" scheduling to be 
easily modified WITHOUT effecting load balancing.  To my mind scheduling 
and load balancing are orthogonal and keeping them that way simplifies 
things.

As Ingo correctly points out, plugsched does not allow different 
schedulers to be used per CPU but it would not be difficult to modify it 
so that they could.  Although I've considered doing this over the years 
I decided not to as it would just increase the complexity and the amount 
of work required to keep the patch set going.  About six months ago I 
decided to reduce the amount of work I was doing on plugsched (as it was 
obviously never going to be accepted) and now only publish patches 
against the vanilla kernel's major releases (and the only reason that I 
kept doing that is that the download figures indicated that about 80 
users were interested in the experiment).

Peter
PS I no longer read LKML (due to time constraints) and would appreciate 
it if I could be CC'd on any e-mails suggesting scheduler changes.
PPS I'm just happy to see that Ingo has finally accepted that the 
vanilla scheduler was badly in need of fixing and don't really care who 
fixes it.
PPS Different schedulers for different aims (i.e. server or work 
station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
does about 1% better on kernbench than the next best scheduler in the bunch.
PPPS Con, fairness isn't always best as humans aren't very altruistic 
and we need to give unfair preference to interactive tasks in order to 
stop the users flinging their PCs out the window.  But the current 
scheduler doesn't do this very well and is also not very good at 
fairness so needs to change.  But the changes need to address 
interactive response and fairness not just fairness.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:52                     ` Con Kolivas
@ 2007-04-16  2:28                       ` Nick Piggin
  2007-04-16  3:15                         ` Con Kolivas
       [not found]                         ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
  0 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-16  2:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 08:52:33AM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > > It's a really good thing, and it means that if somebody shows that
> > > your
> > > code is flawed in some way (by, for example, making a patch that
> > > people
> > > claim gets better behaviour or numbers), any *good* programmer that
> > > actually cares about his code will obviously suddenly be very
> > > motivated to
> > > out-do the out-doer!
> >
> > "No one who cannot rejoice in the discovery of his own mistakes
> > deserves to be called a scholar."
> 
> Lovely comment. I realise this is not truly directed at me but clearly in the 
> context it has been said people will assume it is directed my way, so while 
> we're all spinning lkml quality rhetoric, let me have a right of reply.
> 
> One thing I have never tried to do was to ignore bug reports. I'm forever 
> joking that I keep pulling code out of my arse to improve what I've done. 
> RSDL/SD was no exception; heck it had 40 iterations. The reason I could not 
> reply to bug report A with "Oh that is problem B so I'll fix it with code C" 
> was, as I've said many many times over, health related. I did indeed try to 
> fix many of them without spending hours replying to sometimes unpleasant 
> emails. If health wasn't an issue there might have been 1000 iterations of 
> SD.

Well what matters is the code and development. I don't think Ingo's
scheduler is the final word, although I worry that Linus might jump the
gun and merge something "just to give it a test", which we then get
stuck with :P

I don't know how anybody can think Ingo's new scheduler is anything but
a good thing (so long as it has to compete before being merged). And
that's coming from someone who wants *their* scheduler to get merged...
I think mine can compete ;) and if it can't, then I'd rather be using
the scheduler that beats it.


> There was only ever _one_ thing that I was absolutely steadfast on as a 
> concept that I refused to fix that people might claim was "a mistake I did 
> not rejoice in to be a scholar". That was that the _correct_ behaviour for a 
> scheduler is to be fair such that proportional slowdown with load is (using 
> that awful pun) a feature, not a bug.

If something is using more than a fair share of CPU time, over some macro
period, in order to be interactive, then definitely it should get throttled.
I've always maintained (since starting scheduler work) that the 2.6 scheduler
is horrible because it allows these cases where some things can get more CPU
time just by how they behave.

Glad people are starting to come around on that point.


So, on to something productive, we have 3 candidates for a new scheduler so
far. How do we decide which way to go? (and yes, I still think switchable
schedulers is wrong and a copout) This is one area where it is virtually
impossible to discount any decent design on correctness/performance/etc.
and even testing in -mm isn't really enough.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 21:31         ` Matt Mackall
@ 2007-04-16  3:03           ` Nick Piggin
  2007-04-16 14:28             ` Matt Mackall
  2007-04-16 15:45           ` William Lee Irwin III
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-16  3:03 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>  
> > 4) the good thing that happened to I/O, after years of stagnation isnt
> >    I/O schedulers. The good thing that happened to I/O is called Jens
> >    Axboe. If you care about the I/O subystem then print that name out 
> >    and hang it on the wall. That and only that is what mattered.
> 
> Disagree. Things didn't actually get interesting until Nick showed up
> with AS and got it in-tree to demonstrate the huge amount of room we
> had for improvement. It took several iterations of AS and CFQ (with a
> couple complete rewrites) before CFQ began to look like the winner.
> The resulting time-sliced CFQ was fairly heavily influenced by the
> ideas in AS.

Well to be fair, Jens had just implemented deadline, which got me
interested ;)

Actually, I would still like to be able to deprecate deadline for
AS, because AS has a tunable that you can switch to turn off read
anticipation and revert to deadline behaviour (or very close to).

It would have been nice if CFQ were then a layer on top of AS that
implemented priorities (or vice versa). And then AS could be
deprecated and we'd be back to 1 primary scheduler.

Well CFQ seems to be going in the right direction with that, however
some large users still find AS faster for some reason...

Anyway, moral of the story is that I think it would have been nice
if we hadn't proliferated IO schedulers, however in practice it
isn't easy to just layer features on top of each other, and also
keeping deadline helped a lot to be able to debug and examine
performance regressions and actually get code upstream. And this
was true even when it was globally boottine switchable only.

I'd prefer if we kept a single CPU scheduler in mainline, because I
think that simplifies analysis and focuses testing. I think we can
have one that is good enough for everyone. But if the only other
option for progress is that Linus or Andrew just pull one out of a
hat, then I would rather merge all of them. Yes I think Con's
scheduler should get a fair go, ditto for Ingo's, mine, and anyone
else's.


> > nor was the non-modularity of some piece of code ever an impediment to 
> > competition. May i remind you of the pretty competitive SLAB allocator 
> > landscape, resulting in things like the SLOB allocator, written by 
> > yourself? ;-)
> 
> Thankfully no one came out and said "we don't want to balkanize the
> allocator landscape" when I submitted it or I probably would have just
> dropped it, rather than painfully dragging it along out of tree for
> years. I'm not nearly the glutton for punishment that Con is. :-P

I don't think this is a fault of the people or the code involved.
We just didn't have much collective drive to replace the scheduler,
and even less an idea of how to decide between any two of them.

I've kept nicksched around since 2003 or so and no hard feelings ;)


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  1:06           ` Peter Williams
@ 2007-04-16  3:04             ` William Lee Irwin III
  2007-04-16  5:09               ` Peter Williams
  2007-04-16 17:22             ` Chris Friesen
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-16  3:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> One of the reasons I never posted my own code is that it never met its
>> own design goals, which absolutely included switching on the fly. I
>> think Peter Williams may have done something about that.
>> It was my hope
>> to be able to do insmod sched_foo.ko until it became clear that the
>> effort it was intended to assist wasn't going to get even the limited
>> hardware access required, at which point I largely stopped working on
>> it.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> I didn't but some students did.
> In a previous life, I did implement a runtime configurable CPU 
> scheduling mechanism (implemented on True64, Solaris and Linux) that 
> allowed schedulers to be loaded as modules at run time.  This was 
> released commercially on True64 and Solaris.  So I know that it can be done.
> I have thought about doing something similar for the SPA schedulers 
> which differ in only small ways from each other but lack motivation.

Driver models for scheduling are not so far out. AFAICS it's largely a
tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
switching out intra-queue policies vs. switching out whole-system
policies, SMP handling and all. Whether this involves load balancing
depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
scheduler module, for instance, would not have a load balancer at all,
as it has only one global runqueue. There are other sorts of policies
wanting significant changes to SMP handling vs. the stock load
balancing.


William Lee Irwin III wrote:
>> I'm not sure what happened there. It wasn't a big enough patch to take
>> hits in this area due to getting overwhelmed by the programming burden
>> like some other efforts of mine. Maybe things started getting ugly once
>> on-the-fly switching entered the picture. My guess is that Peter Williams
>> will have to chime in here, since things have diverged enough from my
>> one-time contribution 4 years ago.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> From my POV, the current version of plugsched is considerably simpler 
> than it was when I took the code over from Con as I put considerable 
> effort into minimizing code overlap in the various schedulers.
> I also put considerable effort into minimizing any changes to the load 
> balancing code (something Ingo seems to think is a deficiency) and the 
> result is that plugsched allows "intra run queue" scheduling to be 
> easily modified WITHOUT effecting load balancing.  To my mind scheduling 
> and load balancing are orthogonal and keeping them that way simplifies 
> things.

ISTR rearranging things for con in such a fashion that it no longer
worked out of the box (though that wasn't the intention; restructuring it
to be more suited to his purposes was) and that's what he worked off of
afterward. I don't remember very well what changed there as I clearly
invested less effort there than the prior versions. Now that I think of
it, that may have been where the sample policy demonstrating scheduling
classes was lost.


On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> As Ingo correctly points out, plugsched does not allow different 
> schedulers to be used per CPU but it would not be difficult to modify it 
> so that they could.  Although I've considered doing this over the years 
> I decided not to as it would just increase the complexity and the amount 
> of work required to keep the patch set going.  About six months ago I 
> decided to reduce the amount of work I was doing on plugsched (as it was 
> obviously never going to be accepted) and now only publish patches 
> against the vanilla kernel's major releases (and the only reason that I 
> kept doing that is that the download figures indicated that about 80 
> users were interested in the experiment).

That's a rather different goal from what I was going on about with it,
so it's all diverged quite a bit. Where I had a significant need for
mucking with the entire concept of how SMP was handled, this is rather
different. At this point I'm questioning the relevance of my own work,
though it was already relatively marginal as it started life as an
attempt at a sort of debug patch to help gang scheduling (which is in
itself a rather marginally relevant feature to most users) code along.


On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> PS I no longer read LKML (due to time constraints) and would appreciate 
> it if I could be CC'd on any e-mails suggesting scheduler changes.
> PPS I'm just happy to see that Ingo has finally accepted that the 
> vanilla scheduler was badly in need of fixing and don't really care who 
> fixes it.
> PPS Different schedulers for different aims (i.e. server or work 
> station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
> does about 1% better on kernbench than the next best scheduler in the bunch.
> PPPS Con, fairness isn't always best as humans aren't very altruistic 
> and we need to give unfair preference to interactive tasks in order to 
> stop the users flinging their PCs out the window.  But the current 
> scheduler doesn't do this very well and is also not very good at 
> fairness so needs to change.  But the changes need to address 
> interactive response and fairness not just fairness.

Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
better ones. I'd not bother citing kernel compile results.

In any event, I'm not sure what to say about different schedulers for
different aims. My intentions with plugsched were not centered around
production usage or intra-queue policy. I'm relatively indifferent to
the notion of having pluggable CPU schedulers, intra-queue or otherwise,
in mainline. I don't see any particular harm in it, but neither am I
particularly motivated to have it in. I had a rather strong sense of
instrumentality about it, and since it became useless to me (at a
conceptual level; the implementation was never finished ot the point of
dynamic loading of scheduler modules) for assisting development on
large systems via reboot avoidance by dint of it becoming clear that
access to such was never going to happen, I've stopped looking at it.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  2:28                       ` Nick Piggin
@ 2007-04-16  3:15                         ` Con Kolivas
  2007-04-16  3:34                           ` Nick Piggin
       [not found]                         ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
  1 sibling, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-16  3:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Monday 16 April 2007 12:28, Nick Piggin wrote:
> So, on to something productive, we have 3 candidates for a new scheduler so
> far. How do we decide which way to go? (and yes, I still think switchable
> schedulers is wrong and a copout) This is one area where it is virtually
> impossible to discount any decent design on correctness/performance/etc.
> and even testing in -mm isn't really enough.

We're in agreement! YAY!

Actually this is simpler than that. I'm taking SD out of the picture. It has 
served it's purpose of proving that we need to seriously address all the 
scheduling issues and did more than a half decent job at it. Unfortunately I 
also cannot sit around supporting it forever by myself. My own life is more 
important, so consider SD not even running the race any more.

I'm off to continue maintaining permanent-out-of-tree leisurely code at my own 
pace. What's more is, I think I'll just stick to staircase Gen I version blah 
and shelve SD and try to have fond memories of SD as an intellectual 
prompting exercise only.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:15                         ` Con Kolivas
@ 2007-04-16  3:34                           ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-16  3:34 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 01:15:27PM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 12:28, Nick Piggin wrote:
> > So, on to something productive, we have 3 candidates for a new scheduler so
> > far. How do we decide which way to go? (and yes, I still think switchable
> > schedulers is wrong and a copout) This is one area where it is virtually
> > impossible to discount any decent design on correctness/performance/etc.
> > and even testing in -mm isn't really enough.
> 
> We're in agreement! YAY!
> 
> Actually this is simpler than that. I'm taking SD out of the picture. It has 
> served it's purpose of proving that we need to seriously address all the 
> scheduling issues and did more than a half decent job at it. Unfortunately I 
> also cannot sit around supporting it forever by myself. My own life is more 
> important, so consider SD not even running the race any more.
> 
> I'm off to continue maintaining permanent-out-of-tree leisurely code at my own 
> pace. What's more is, I think I'll just stick to staircase Gen I version blah 
> and shelve SD and try to have fond memories of SD as an intellectual 
> prompting exercise only.

Well I would hope that _if_ we decide to switch schedulers, then you
get a chance to field something (and I hope you will decide to and have
time to), and I hope we don't rush into the decision.

We've had the current scheduler for so many years now that it is much
more important to make sure we take the time to do the right thing
rather than absolutely have to merge a new scheduler right now ;)


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:04             ` William Lee Irwin III
@ 2007-04-16  5:09               ` Peter Williams
  2007-04-16 11:04                 ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-16  5:09 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> One of the reasons I never posted my own code is that it never met its
>>> own design goals, which absolutely included switching on the fly. I
>>> think Peter Williams may have done something about that.
>>> It was my hope
>>> to be able to do insmod sched_foo.ko until it became clear that the
>>> effort it was intended to assist wasn't going to get even the limited
>>> hardware access required, at which point I largely stopped working on
>>> it.
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> I didn't but some students did.
>> In a previous life, I did implement a runtime configurable CPU 
>> scheduling mechanism (implemented on True64, Solaris and Linux) that 
>> allowed schedulers to be loaded as modules at run time.  This was 
>> released commercially on True64 and Solaris.  So I know that it can be done.
>> I have thought about doing something similar for the SPA schedulers 
>> which differ in only small ways from each other but lack motivation.
> 
> Driver models for scheduling are not so far out. AFAICS it's largely a
> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
> switching out intra-queue policies vs. switching out whole-system
> policies, SMP handling and all. Whether this involves load balancing
> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
> scheduler module, for instance, would not have a load balancer at all,
> as it has only one global runqueue. There are other sorts of policies
> wanting significant changes to SMP handling vs. the stock load
> balancing.

Well a single run queue removes the need for load balancing but has 
scalability issues on large systems.  Personally, I think something in 
between would be the best solution i.e. multiple run queues but more 
than one CPU per run queue.  I think that this would be a particularly 
good solution to the problems introduced by hyper threading and multi 
core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
package are using the one queue then the case where one CPU is trying to 
run a high priority task and the other a low priority task (i.e. the 
cases that the sleeping dependent mechanism tried to address) is less 
likely to occur.

By the way, I think that it's a very bad idea for the scheduling 
mechanism and the load balancing mechanism to be coupled.  The anomalies 
that will be experienced and the attempts to make ad hoc fixes for them 
will lead to complexity spiralling out of control.

> 
> 
> William Lee Irwin III wrote:
>>> I'm not sure what happened there. It wasn't a big enough patch to take
>>> hits in this area due to getting overwhelmed by the programming burden
>>> like some other efforts of mine. Maybe things started getting ugly once
>>> on-the-fly switching entered the picture. My guess is that Peter Williams
>>> will have to chime in here, since things have diverged enough from my
>>> one-time contribution 4 years ago.
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> From my POV, the current version of plugsched is considerably simpler 
>> than it was when I took the code over from Con as I put considerable 
>> effort into minimizing code overlap in the various schedulers.
>> I also put considerable effort into minimizing any changes to the load 
>> balancing code (something Ingo seems to think is a deficiency) and the 
>> result is that plugsched allows "intra run queue" scheduling to be 
>> easily modified WITHOUT effecting load balancing.  To my mind scheduling 
>> and load balancing are orthogonal and keeping them that way simplifies 
>> things.
> 
> ISTR rearranging things for con in such a fashion that it no longer
> worked out of the box (though that wasn't the intention; restructuring it
> to be more suited to his purposes was) and that's what he worked off of
> afterward. I don't remember very well what changed there as I clearly
> invested less effort there than the prior versions. Now that I think of
> it, that may have been where the sample policy demonstrating scheduling
> classes was lost.

I can't comment here as (as far as I can recall) I never saw your code 
and only became involved when Con posted his version of cpusched.

> 
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> As Ingo correctly points out, plugsched does not allow different 
>> schedulers to be used per CPU but it would not be difficult to modify it 
>> so that they could.  Although I've considered doing this over the years 
>> I decided not to as it would just increase the complexity and the amount 
>> of work required to keep the patch set going.  About six months ago I 
>> decided to reduce the amount of work I was doing on plugsched (as it was 
>> obviously never going to be accepted) and now only publish patches 
>> against the vanilla kernel's major releases (and the only reason that I 
>> kept doing that is that the download figures indicated that about 80 
>> users were interested in the experiment).
> 
> That's a rather different goal from what I was going on about with it,
> so it's all diverged quite a bit.

Yes, pragmatic considerations dictated a change of tack.

> Where I had a significant need for
> mucking with the entire concept of how SMP was handled, this is rather
> different.

Yes, I went with the idea of intra run queue scheduling being orthogonal 
to load balancing for two reasons:

1. I think that coupling them is a bad idea from the complexity POV, and
2. it's enough of a battle fighting for modifications to one bit of the 
code without trying to do it to two simultaneously.

> At this point I'm questioning the relevance of my own work,
> though it was already relatively marginal as it started life as an
> attempt at a sort of debug patch to help gang scheduling (which is in
> itself a rather marginally relevant feature to most users) code along.

The main commercial plug in scheduler used with the run time loadable 
module scheduler that I mentioned earlier did gang scheduling (at the 
insistence of the Tru64 kernel folks).  As this scheduler was a 
hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
("unfairly" really in according to an allocation policy) among higher 
level entities such as users, groups and applications as well as 
processes; it was fairly easy to make it a gang scheduler by modifying 
it to give all of a process's threads the same priority based on the 
process's CPU usage rather than different priorities based on the 
threads' usage rates.  In fact, it would have been possible to select 
between gang and non gang on a per process basis if that was considered 
desirable.

The fact that threads and processes are distinct entities on Tru64 and 
Solaris made this easier to do on them than on Linux.

My experience with this scheduler leads me to believe that to achieve 
gang scheduling and fairness, etc. you need (usage) statistics based 
schedulers.

> 
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> PS I no longer read LKML (due to time constraints) and would appreciate 
>> it if I could be CC'd on any e-mails suggesting scheduler changes.
>> PPS I'm just happy to see that Ingo has finally accepted that the 
>> vanilla scheduler was badly in need of fixing and don't really care who 
>> fixes it.
>> PPS Different schedulers for different aims (i.e. server or work 
>> station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
>> does about 1% better on kernbench than the next best scheduler in the bunch.
>> PPPS Con, fairness isn't always best as humans aren't very altruistic 
>> and we need to give unfair preference to interactive tasks in order to 
>> stop the users flinging their PCs out the window.  But the current 
>> scheduler doesn't do this very well and is also not very good at 
>> fairness so needs to change.  But the changes need to address 
>> interactive response and fairness not just fairness.
> 
> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
> better ones. I'd not bother citing kernel compile results.

spa_svr actually does its best work when the system isn't fully loaded 
as the type of improvement it strives to achieve (minimizing on queue 
wait time) hasn't got much room to manoeuvre when the system is fully 
loaded.  Therefore, the fact that it's 1% better even in these 
circumstances is a good result and also indicates that the overhead for 
keeping the scheduling statistics it uses for its decision making is 
well spent.  Especially, when you consider that the total available room 
for improvement on this benchmark is less than 3%.

To elaborate, the motivation for this scheduler was acquired from the 
observation of scheduling statistics (in particular, on queue wait time) 
on systems running at about 30% to 50% load.  Theoretically, at these 
load levels there should be no such waiting but the statistics show that 
there is considerable waiting (sometimes as high as 30% to 50%).  I put 
this down to "lack of serendipity" e.g.  everyone sleeping at the same 
time and then trying to run at the same time would be complete lack of 
serendipity.  On the other hand, if everyone is synced then there would 
be total serendipity.

Obviously, from the POV of a client, time the server task spends waiting 
on the queue adds to the response time for any request that has been 
made so reduction of this time on a server is a good thing(tm).  Equally 
obviously, trying to achieve this synchronization by asking the tasks to 
cooperate with each other is not a feasible solution and some external 
influence needs to be exerted and this is what spa_svr does -- it nudges 
the scheduling order of the tasks in a way that makes them become well 
synced.

Unfortunately, this is not a good scheduler for an interactive system as 
it minimizes the response times for ALL tasks (and the system as a 
whole) and this can result in increased response time for some 
interactive tasks (clunkiness) which annoys interactive users.  When you 
start fiddling with this scheduler to bring back "interactive 
unfairness" you kill a lot of its superior low overall wait time 
performance.

So this is why I think "horses for courses" schedulers are worth while.


> 
> In any event, I'm not sure what to say about different schedulers for
> different aims. My intentions with plugsched were not centered around
> production usage or intra-queue policy. I'm relatively indifferent to
> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
> in mainline. I don't see any particular harm in it, but neither am I
> particularly motivated to have it in.

If you look at the struct sched_spa_child in the file 
include/linux/sched_spa.h you'll see that the interface for switching 
between the various SPA schedulers is quite small and making them 
runtime switchable would be easy (I haven't done this in cpusched as I 
wanted to keep the same interface for switching schedulers for all 
schedulers: i.e. all run time switchable or none run time switchable; as 
the main aim of plugsched had become a mechanism for evaluating 
different intra queue scheduling designs.)

> I had a rather strong sense of
> instrumentality about it, and since it became useless to me (at a
> conceptual level; the implementation was never finished ot the point of
> dynamic loading of scheduler modules) for assisting development on
> large systems via reboot avoidance by dint of it becoming clear that
> access to such was never going to happen, I've stopped looking at it.

I'll probably stop looking at this problem as well at least for the time 
being until all this new code has settled.

Peter
PS As I no longer read LKML, I haven't yet seen Ingo's or Con's or 
Nick's new schedulers yet so am unable to comment on their technical 
merits with respect to my comments above.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:05   ` Ingo Molnar
  2007-04-15 20:05     ` Matt Mackall
@ 2007-04-16  5:16     ` Con Kolivas
  2007-04-16  5:48       ` Gene Heskett
  1 sibling, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-16  5:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 01:05, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > 2. Since then I've been thinking/working on a cpu scheduler design
> > that takes away all the guesswork out of scheduling and gives very
> > predictable, as fair as possible, cpu distribution and latency while
> > preserving as solid interactivity as possible within those confines.
>
> yeah. I think you were right on target with this call.

Yay thank goodness :) It's time to fix the damn cpu scheduler once and for 
all. Everyone uses this; it's no minor driver or $bigsmp or $bigram or 
$small_embedded_RT_hardware feature.

> I've applied the 
> sched.c change attached at the bottom of this mail to the CFS patch, if
> you dont mind. (or feel free to suggest some other text instead.)

>   *  2003-09-03	Interactivity tuning by Con Kolivas.
>   *  2004-04-02	Scheduler domains code by Nick Piggin
> + *  2007-04-15	Con Kolivas was dead right: fairness matters! :)

LOL that's awful. I'd prefer something meaningful like "Work begun on 
replacing all interactivity tuning with a fair virtual-deadline design by Con 
Kolivas".

While you're at it, it's worth getting rid of a few slightly pointless name 
changes too. Don't rename SCHED_NORMAL yet again, and don't call all your 
things sched_fair blah_fair __blah_fair and so on. It means that anything 
else is by proxy going to be considered unfair. Leave SCHED_NORMAL as is, 
replace the use of the word _fair with _cfs. I don't really care how many 
copyright notices you put into our already noisy bootup but it's redundant 
since there is no choice; we all get the same cpu scheduler.

> > 1. I tried in vain some time ago to push a working extensable
> > pluggable cpu scheduler framework (based on wli's work) for the linux
> > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he
> > didn't like it) as being absolutely the wrong approach and that we
> > should never do that. [...]
>
> i partially replied to that point to Will already, and i'd like to make
> it clear again: yes, i rejected plugsched 2-3 years ago (which already
> drifted away from wli's original codebase) and i would still reject it
> today.

No that was just me being flabbergasted by what appeared to be you posting 
your own plugsched. Note nowhere in the 40 iterations of rsdl->sd did I 
ask/suggest for plugsched. I said in my first announcement my aim was to 
create a scheduling policy robust enough for all situations rather than 
fantastic a lot of the time and awful sometimes. There are plenty of people 
ready to throw out arguments for plugsched now and I don't have the energy to 
continue that fight (I never did really).

But my question still stands about this comment:

>   case, all of SD's logic could be added via a kernel/sched_sd.c module
>   as well, if Con is interested in such an approach. ]

What exactly would be the purpose of such a module that governs nothing in 
particular? Since there'll be no pluggable scheduler by your admission it has 
no control over SCHED_NORMAL, and would require another scheduling policy for 
it to govern which there is no express way to use at the moment and people 
tend to just use the default without great effort. 

> First and foremost, please dont take such rejections too personally - i
> had my own share of rejections (and in fact, as i mentioned it in a
> previous mail, i had a fair number of complete project throwaways:
> 4g:4g, in-kernel Tux, irqrate and many others). I know that they can
> hurt and can demoralize, but if i dont like something it's my job to
> tell that.

Hmm? No that's not what this is about. Remember dynticks which was not 
originally my code but I tried to bring it up to mainline standard which I 
fought with for months? You came along with yet another rewrite from scratch 
and the flaws in the design I was working with were obvious so I instantly 
bowed down to that and never touched my code again. I didn't ask for credit 
back then, but obviously brought the requirement for a no idle tick 
implementation to the table.

> My view about plugsched: first please take a look at the latest
> plugsched code:
>
>   http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch
>
>   26 files changed, 8951 insertions(+), 1495 deletions(-)
>
> As an experiment i've removed all the add-on schedulers (both the core
> and the include files, only kept the vanilla one) from the plugsched
> patch (and the makefile and kconfig complications, etc), to see the
> 'infrastructure cost', and it still gave:
>
>   12 files changed, 1933 insertions(+), 1479 deletions(-)

I do not see extra code per-se as being a bad thing. I've heard said a few 
times before "ever notice how when the correct solution is done it is a lot 
more code than the quick hack that ultimately fails?". Insert long winded 
discussion of perfect is the enemy of good here, _but_ I'm not arguing 
perfect versus good, I'm talking about solid code versus quick fix. Again, 
none of this comment is directed specifically at this implementation of 
plugsched, its code quality or intent, but using "extra code is bad" as an 
argument is not enough.

> By your logic Mike should in fact be quite upset about this: if the 
> new code works out and proves to be useful then it obsoletes a whole lot
> of code of him!

> > [...] However at one stage I virtually begged for support with my
> > attempts and help with the code. Dmitry Adamushko is the only person
> > who actually helped me with the code in the interim, while others
> > poked sticks at it. Sure the sticks helped at times but the sticks
> > always seemed to have their ends kerosene doused and flaming for
> > reasons I still don't get. No other help was forthcoming.


> Hey, i told this to you as recently as 1 month ago as well:
>
>    http://lkml.org/lkml/2007/3/8/54
>
>    "cool! I like this even more than i liked your original staircase
>     scheduler from 2 years ago :)"

Email has an awful knack of disguising intent so I took that on face value 
that you did like the idea :). 

Above when I said "no other help was forthcoming" all I was hoping for was 
really simple obvious bugfixes to help me along while I was laid up in bed 
such as "I like what you're doing but oh your use of memset here is bogus, 
here is a one line patch". I wasn't specifically expecting you to fix my 
code; you've got truckloads of things you need to do. 

It just reminds me that the concept of "release early, release often" doesn't 
actually work in the kernel. What is far more obvious is "release code only 
when it's so close to perfect that noone can argue against it" since most of 
the work is done by one person, otherwise someone will come out with a 
counterpatch that is _complete_ earlier but in all possibility not as good, 
it's just ready sooner. *NOTE* In no way am I saying your code is not as good 
as mine; I would have to say exactly the opposite is true pretty much always 
(<sarcasm>conversely then I doubt if I dropped you in my work environment 
you'd do as good a job as I do</sarcasm>). At one stage wli (again at my 
request) put together a quick hack to check for non-preemptible regions 
within the kernel. From that quick hack you adopted it and turned it into 
that beautiful latency tracer that is the cornerstone of the -rt tree 
testing. However, there are many instances I've seen good evolving code in 
the linux kernel be trumped by not-as-good but already-working alternatives 
written from scratch with no reference to the original work. This is the NIH 
(not invented here) mechanism I see happening that is worth objecting to.

What you may find most amusing is the very first iterations of RSDL looked 
_nothing_ like the mainline scheduler. There were all sorts of different 
structures, mechanisms, one priority array, plans to remove scheduler_tick 
entirely and so on. Most of those were never made for public consumption. I 
spent about half a dozen iterations of RSDL removing all of that and making 
it as close to the mainline design as possible, thus minimising the size of 
the patch, and to make it readily readable for most people familiar with the 
scheduler policy code in sched.c (all 5 of them). I should have just said 
bugger it and started everything from scratch with little to no reference to 
the original scheduler but found myself obliged to try to do things the 
minimal code patch size readable difference thingy that was valued in linux 
kernel development. I think the radically different approach would have been 
better in the long run. Trying to play ball I ruined it.

Either way I've decided for myself, my family, my career and my sanity I'm 
abandoning SD. I will shelve SD and try to have fond memories of SD as an 
intellectual prompting exercise only

> 	Ingo

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:39             ` Ingo Molnar
  2007-04-15 15:47               ` William Lee Irwin III
@ 2007-04-16  5:27               ` Peter Williams
  2007-04-16  6:23                 ` Peter Williams
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-16  5:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Willy Tarreau <w@1wt.eu> wrote:
> 
>> Ingo could have publicly spoken with them about his ideas of killing 
>> the O(1) scheduler and replacing it with an rbtree-based one, [...]
> 
> yes, that's precisely what i did, via a patchset :)
> 
> [ I can even tell you when it all started: i was thinking about Mike's
>   throttling patches while watching Manchester United beat the crap out
>   of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
>   Wednesday morning and sent the patch Friday evening. I very much
>   believe in low-latency when it comes to development too ;) ]
> 
> (if this had been done via a comittee then today we'd probably still be 
> trying to find a suitable timeslot for the initial conference call where 
> we'd discuss the election of a chair who would be tasked with writing up 
> an initial document of feature requests, on which we'd take a vote, 
> possibly this year already, because the matter is really urgent you know 
> ;-)
> 
>> [...] and using part of Bill's work to speed up development.
> 
> ok, let me make this absolutely clear: i didnt use any bit of plugsched 
> - in fact the most difficult bits of the modularization was for areas of 
> sched.c that plugsched never even touched AFAIK. (the load-balancer for 
> example.)

This sounds like your new scheduler intends to increase the coupling 
between scheduling and load balancing.  I think that this would be a 
mistake and lead (down the track) to spiralling complexity as you make 
changes to the code to address the corner conditions that it will create.

> 
> Plugsched simply does something else: i modularized scheduling policies 
> in essence that have to cooperate with each other, while plugsched 
> modularized complete schedulers which are compile-time or boot-time 
> selected, with no runtime cooperation between them. (one has to be 
> selected at a time)

You can't really have more than one scheduler operating in the same 
priority range on the same CPU as they will be fighting each other 
trying to achieve their separate and not necessarily compatible (in fact 
highly likely to be incompatible) aims.  Multiple schedulers on the same 
CPU have to have a pecking order just like SCHED_OTHER and real time 
policies.  It wouldn't be hard to prove that SCHED_RR and SCHED_FIFO is 
a problem in waiting if ever someone tried to use them both on a highly 
real time system.

> 
> (and i have no trouble at all with crediting Will's work either: a few 
> years ago i used Will's PID rework concepts for an NPTL related speedup 
> and Will is very much credited for it in today's kernel/pid.c and he 
> continued to contribute to it later on.)
> 
> (the tree walking bits of sched_fair.c were in fact derived from 
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)
> 
> 	Ingo

Are your new patches available somewhere for easy download or do I have 
to try to dig them out of the mailing list archive?  Or could you mail 
them to me separately?  I'm keen to see how you new scheduler proposal 
works.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:25       ` Arjan van de Ven
@ 2007-04-16  5:36         ` Bill Huey
  2007-04-16  6:17           ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Bill Huey @ 2007-04-16  5:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> Now this doesn't mean that people shouldn't be nice to each other, not
> cooperate or steal credits, but I don't get the impression that that is
> happening here. Ingo is taking part in the discussion with a counter
> proposal for discussion *on the mailing list*. What more do you want??

Con should have been CCed from the first moment this was put into motion
to limit the perception of exclusion. That was mistake number one and big
time failures to understand this dynamic. After it was Con's idea. Why
the hell he was excluded from Ingo's development process is baffling to
me and him (most likely).

He put int a lot of effort into SDL and his experiences with scheduling
should still be seriously considered in this development process even if
he doesn't write a single line of code from this moment on.

What should have happened is that our very busy associate at RH by the
name of Ingo Molnar should have leverage more of Con's and Bill's work
and use them as a proxy for his own ideas. They would have loved to have
contributed more and our very busy Ingo Molnar would have gotten a lot
of his work and ideas implemented without him even opening a single
source file for editting. They would have happily done this work for
Ingo. Ingo could have been used for something else more important like
making KVM less of a freaking ugly hack and we all would have benefitted
from this.

He could have been working on SystemTap so that you stop losing accounts
to Sun and Solaris 10's Dtrace. He could have been working with Riel to
fix your butt ugly page scanning problem causing horrible contention via
the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
rwsem mapping problem that's killing -rt and anything that uses Posix
threads. He could have created a userspace thread control block (TCB)
with Mr. Drepper so that we can turn off preemption in userspace
(userspace per CPU local storage) and implement a very quick non-kernel
crossing implementation of priority ceilings (userspace check for priority
and flags at preempt_schedule() in the TCB) so that our -rt Posix API
doesn't suck donkey shit... Need I say more ?

As programmers like Ingo get spread more thinly, he needs super smart
folks like Bill Irwin and Con to help him out and learn to resist NIH
folk's stuff out of some weird fear. When this happens, folks like Ingo
must learn to "facilitate" development in addition to implementing it
with those kind of folks.

This takes time and practice to entrust folks to do things for him.
Ingo is the best method of getting new Linux kernel ideas and communicate
them to Linus. His value goes beyond just just code and is often the
biggest hammer we have in the Linux community to get stuff into the
kernel. "Facilitation" of others is something that solo programmers must
need when groups like the Linux kernel get larger and large every year.

Understand ? Are we in embarrassing agreement here ?

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:09     ` Pavel Pisa
@ 2007-04-16  5:47       ` Davide Libenzi
  2007-04-17  0:37         ` Pavel Pisa
  0 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-16  5:47 UTC (permalink / raw)
  To: Pavel Pisa
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, 16 Apr 2007, Pavel Pisa wrote:

> I cannot help myself to not report results with GAVL
> tree algorithm there as an another race competitor.
> I believe, that it is better solution for large priority
> queues than RB-tree and even heap trees. It could be
> disputable if the scheduler needs such scalability on
> the other hand. The AVL heritage guarantees lower height
> which results in shorter search times which could
> be profitable for other uses in kernel.
> 
> GAVL algorithm is AVL tree based, so it does not suffer from
> "infinite" priorities granularity there as TR does. It allows
> use for generalized case where tree is not fully balanced.
> This allows to cut the first item withour rebalancing.
> This leads to the degradation of the tree by one more level
> (than non degraded AVL gives) in maximum, which is still
> considerably better than RB-trees maximum.
> 
> http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c

Here are the results on my Opteron 252:

Testing N=1
gavl_cfs = 187.20 cycles/loop
CFS = 194.16 cycles/loop
TR  = 314.87 cycles/loop
CFS = 194.15 cycles/loop
gavl_cfs = 187.15 cycles/loop

Testing N=2
gavl_cfs = 268.94 cycles/loop
CFS = 305.53 cycles/loop
TR  = 313.78 cycles/loop
CFS = 289.58 cycles/loop
gavl_cfs = 266.02 cycles/loop

Testing N=4
gavl_cfs = 452.13 cycles/loop
CFS = 518.81 cycles/loop
TR  = 311.54 cycles/loop
CFS = 516.23 cycles/loop
gavl_cfs = 450.73 cycles/loop

Testing N=8
gavl_cfs = 609.29 cycles/loop
CFS = 644.65 cycles/loop
TR  = 308.11 cycles/loop
CFS = 667.01 cycles/loop
gavl_cfs = 592.89 cycles/loop

Testing N=16
gavl_cfs = 686.30 cycles/loop
CFS = 807.41 cycles/loop
TR  = 317.20 cycles/loop
CFS = 810.24 cycles/loop
gavl_cfs = 688.42 cycles/loop

Testing N=32
gavl_cfs = 756.57 cycles/loop
CFS = 852.14 cycles/loop
TR  = 301.22 cycles/loop
CFS = 876.12 cycles/loop
gavl_cfs = 758.46 cycles/loop

Testing N=64
gavl_cfs = 831.97 cycles/loop
CFS = 997.16 cycles/loop
TR  = 304.74 cycles/loop
CFS = 1003.26 cycles/loop
gavl_cfs = 832.83 cycles/loop

Testing N=128
gavl_cfs = 897.33 cycles/loop
CFS = 1030.36 cycles/loop
TR  = 295.65 cycles/loop
CFS = 1035.29 cycles/loop
gavl_cfs = 892.51 cycles/loop

Testing N=256
gavl_cfs = 963.17 cycles/loop
CFS = 1146.04 cycles/loop
TR  = 295.35 cycles/loop
CFS = 1162.04 cycles/loop
gavl_cfs = 966.31 cycles/loop

Testing N=512
gavl_cfs = 1029.82 cycles/loop
CFS = 1218.34 cycles/loop
TR  = 288.78 cycles/loop
CFS = 1257.97 cycles/loop
gavl_cfs = 1029.83 cycles/loop

Testing N=1024
gavl_cfs = 1091.76 cycles/loop
CFS = 1318.47 cycles/loop
TR  = 287.74 cycles/loop
CFS = 1311.72 cycles/loop
gavl_cfs = 1093.29 cycles/loop

Testing N=2048
gavl_cfs = 1153.03 cycles/loop
CFS = 1398.84 cycles/loop
TR  = 286.75 cycles/loop
CFS = 1438.68 cycles/loop
gavl_cfs = 1149.97 cycles/loop


There seem to be some difference from your numbers. This is with:

gcc version 4.1.2

and -O2. But then and Opteron can behave quite differentyl than a Duron on 
a bench like this ;)



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:16     ` Con Kolivas
@ 2007-04-16  5:48       ` Gene Heskett
  0 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-16  5:48 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Monday 16 April 2007, Con Kolivas wrote:

And I snipped, Sorry fellas.

Con's original submission was to me, quite an improvement.  But I have to say 
it, and no denegration of your efforts is intended Con, but you did 'pull the 
trigger' and get this thing rolling by scratching the itch & drawing 
attention to an ugly lack of user interactivity that had crept into the 2.6 
family.  So from me to Con, a tip of the hat, and a deep bow in your 
direction, thank you.  Now, you have done what you aimed to do, so please get 
well.

I've now been through most of an amanda session using Ingo's "CFS" and I have 
to say that it is another improvement over your 0.40 that's is just as 
obvious as your first patch was against the stock scheduler.  No other 
scheduler yet has allowed the full utilization of the cpu, and maintained 
user interactivity as well as this one has,  my cpu is running about 5 
degrees F hotter just from this effect alone.  gzip, if the rest of the 
system is in between tasks, is consistently showing around 95%, but let 
anything else stick up its hand, like procmail etc, and gzip now dutifully 
steps aside, dropping into the 40% range until procmail and spamd are done, 
at which point there is no rest for the wicked and the cpu never gets a 
chance to cool.

There was, just now, a pause of about 2 seconds, while amanda moved a tarball 
from the holding disk area on /dev/hda to the vtapes disk on /dev/hdd, so 
that would have been an I/O bound situation.

This one Ingo, even without any other patches and I think I did see one go by 
in this thread which I didn't apply, is a definite keeper.  Sweet even.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
A word to the wise is enough.
		-- Miguel de Cervantes

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:36         ` Bill Huey
@ 2007-04-16  6:17           ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-16  6:17 UTC (permalink / raw)
  To: Bill Huey
  Cc: Arjan van de Ven, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Thomas Gleixner

On Sun, Apr 15, 2007 at 10:36:29PM -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> > Now this doesn't mean that people shouldn't be nice to each other, not
> > cooperate or steal credits, but I don't get the impression that that is
> > happening here. Ingo is taking part in the discussion with a counter
> > proposal for discussion *on the mailing list*. What more do you want??
> 
> Con should have been CCed from the first moment this was put into motion
> to limit the perception of exclusion. That was mistake number one and big
> time failures to understand this dynamic. After it was Con's idea. Why
> the hell he was excluded from Ingo's development process is baffling to
> me and him (most likely).

Ingo's scheduler is completely different to any I've seen proposed
for Linux. And after he did an initial implementation, he did post
it to everyone.

Maybe something he said offended someone, but the process followed
is exactly how Linux kernel development works (ie. if you think you
can do better, then write the code). Sometimes you can give suggestions,
but other times if you come up with a different idea then it is
better just to do it yourself.

Con's code is still out there. If it is better than Ingo's then it
should win out. Nobody has a monopoly on schedulers or ideas or
posting patches.


> He put int a lot of effort into SDL and his experiences with scheduling
> should still be seriously considered in this development process even if
> he doesn't write a single line of code from this moment on.
> 
> What should have happened is that our very busy associate at RH by the
> name of Ingo Molnar should have leverage more of Con's and Bill's work
> and use them as a proxy for his own ideas. They would have loved to have
> contributed more and our very busy Ingo Molnar would have gotten a lot
> of his work and ideas implemented without him even opening a single
> source file for editting. They would have happily done this work for
> Ingo. Ingo could have been used for something else more important like
> making KVM less of a freaking ugly hack and we all would have benefitted
> from this.
> 
> He could have been working on SystemTap so that you stop losing accounts
> to Sun and Solaris 10's Dtrace. He could have been working with Riel to
> fix your butt ugly page scanning problem causing horrible contention via
> the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
> rwsem mapping problem that's killing -rt and anything that uses Posix
> threads. He could have created a userspace thread control block (TCB)
> with Mr. Drepper so that we can turn off preemption in userspace
> (userspace per CPU local storage) and implement a very quick non-kernel
> crossing implementation of priority ceilings (userspace check for priority
> and flags at preempt_schedule() in the TCB) so that our -rt Posix API
> doesn't suck donkey shit... Need I say more ?

Well that's some pretty strong criticism of Linux and of someone who
does a great deal to improve things... Let's stick to the topic of
schedulers in this thread and try keeping it constructive.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:27               ` Peter Williams
@ 2007-04-16  6:23                 ` Peter Williams
  2007-04-16  6:40                   ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-16  6:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> 
> Are your new patches available somewhere for easy download or do I have 
> to try to dig them out of the mailing list archive?  Or could you mail 
> them to me separately?  I'm keen to see how you new scheduler proposal 
> works.

Forget about this.  I found the patch.

After a quick look, I like a lot of what I see especially the removal of 
the dual arrays in the run queue.

Some minor suggestions:

1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to 
initialize the task structure in init_task.h.
2. the on_rq field in the task structure is unnecessary as many years of 
experience with ingosched in plugsched indicates that 
!list_empty(&(p)->run_list does the job provided list_del_init() is used 
when dequeueing and there is no noticeable overhead incurred so there's 
no gain by caching the result.  Also it removes the possibility of 
errors creeping in due the value of on_rq being inconsistent with the 
task's actual state.
3. having modular load balancing is a good idea but it should be 
decoupled form the scheduler and provided as a separate interface.  This 
would enable different schedulers to use the same load balancer if they 
desired.
4. why rename SCHED_OTHER to SCHED_FAIR?  SCHED_OTHER's supposed to be 
fair(ish) anyway.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
       [not found]                         ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
@ 2007-04-16  6:27                           ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-16  6:27 UTC (permalink / raw)
  To: Matthew Hawkins; +Cc: linux-kernel, ck list

On Mon, Apr 16, 2007 at 03:57:54PM +1000, Matthew Hawkins wrote:
> On 4/16/07, Nick Piggin <npiggin@suse.de> wrote:
> >
> >So, on to something productive, we have 3 candidates for a new scheduler
> >so
> >far. How do we decide which way to go? (and yes, I still think switchable
> >schedulers is wrong and a copout)
> 
> 
> I'm with you on that one.  It sounds good as a concept however there's
> various kernel structures etc that simply cannot be altered at runtime,
> which throws away the only advantage I can see of plugsched - a test/debug
> framework.
> 
> I think the best way is for those working on this stuff to keep producing
> their separate patches against mainline and people being encouraged to
> test.  THEN
> (and here comes the fun part) subsystem maintainers have to be prepared to
> accept code that is not their own or that of their IRC buddies.  I'm
> noticing this disturbing trend that Linux kernel development is becoming
> more and more like BSD where only the elite few ever get anywhere.  Con
> Kolivas, having a medical not CS degree, bruises the egos of those with CS
> degrees when he comes up with fairly clean, working, and widely-tested
> implementations of things like the staircase scheduler, R(SD)L, SCHED_ISO,
> swap prefetch, etc. when they can't.  We should be encouraging guys like

The thing is, it is really hard for anybody to change anything in page
reclaim or CPU scheduler. A few people saying a change is good for them
doesn't really mean anything because of the huge amount of diversity in
usages.

I've got my own CPU scheduler for 4 years and I and a few others think
it is better than mainline. I've tried to make many many VM changes
that haven't gone in.

Add to that, I don't actually know or care what sort of education most
kernel hackers have. I do know at least one of the more brilliant ones
does not have a CS degree, and I was able to get quite a few things in
before I had a degree (eg. rewrote IO scheduler and multiprocessor
CPU scheduler).


> It's all about the patches, baby

I don't know what would give anyone the idea that it isn't... patches
and numbers.

Nick

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  6:23                 ` Peter Williams
@ 2007-04-16  6:40                   ` Peter Williams
  2007-04-16  7:32                     ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-16  6:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Peter Williams wrote:
>>
>> Are your new patches available somewhere for easy download or do I 
>> have to try to dig them out of the mailing list archive?  Or could you 
>> mail them to me separately?  I'm keen to see how you new scheduler 
>> proposal works.
> 
> Forget about this.  I found the patch.
> 
> After a quick look, I like a lot of what I see especially the removal of 
> the dual arrays in the run queue.
> 
> Some minor suggestions:
> 
> 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to 
> initialize the task structure in init_task.h.
> 2. the on_rq field in the task structure is unnecessary as many years of 
> experience with ingosched in plugsched indicates that 
> !list_empty(&(p)->run_list does the job provided list_del_init() is used 
> when dequeueing and there is no noticeable overhead incurred so there's 
> no gain by caching the result.  Also it removes the possibility of 
> errors creeping in due the value of on_rq being inconsistent with the 
> task's actual state.
> 3. having modular load balancing is a good idea but it should be 
> decoupled form the scheduler and provided as a separate interface.  This 
> would enable different schedulers to use the same load balancer if they 
> desired.
> 4. why rename SCHED_OTHER to SCHED_FAIR?  SCHED_OTHER's supposed to be 
> fair(ish) anyway.

One more quick comment.  The claim that there is no concept of time 
slice in the new scheduler is only true in the sense of the rather 
arcane implementation of time slices extant in the O(1) scheduler.  Your 
new parameter sched_granularity_ns is equivalent to the concept of time 
slice in most other kernels that I've peeked inside and computing 
literature in general (going back over several decades e.g. the magic 
garden).

Welcome to the mainstream,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 13:04   ` Ingo Molnar
@ 2007-04-16  7:16     ` Esben Nielsen
  0 siblings, 0 replies; 712+ messages in thread
From: Esben Nielsen @ 2007-04-16  7:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, 15 Apr 2007, Ingo Molnar wrote:

>
> * Esben Nielsen <nielsen.esben@googlemail.com> wrote:
>
>> I took a brief look at it. Have you tested priority inheritance?
>
> yeah, you are right, it's broken at the moment, i'll fix it. But the
> good news is that i think PI could become cleaner via scheduling
> classes.
>
>> As far as I can see rt_mutex_setprio doesn't have much effect on
>> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task
>> change scheduler class when boosted in rt_mutex_setprio().
>
> i think via scheduling classes we dont have to do the p->policy and
> p->prio based gymnastics anymore, we can just have a clean look at
> p->sched_class and stack the original scheduling class into
> p->real_sched_class. It would probably also make sense to 'privatize'
> p->prio into the scheduling class. That way PI would be a pure property
> of sched_rt, and the PI scheduler would be driven purely by
> p->rt_priority, not by p->prio. That way all the normal_prio() kind of
> complications and interactions with SCHED_OTHER/SCHED_FAIR would be
> eliminated as well. What do you think?
>

Now I have not read your patch into detail. But agree it would be nice to 
have it more "OO" and remove cross references between schedulers. But 
first one should consider wether PI between SCHED_FAIR tasks or not is 
usefull or not. Does PI among dynamic priorities make sense at all? I think it
does: On heavy loaded systems where a nice 19 might not get the CPU for 
very long, a nice -20 task can be priority inverted for a very long 
time.
But I see no need it taking the dynamic part of the effective priorities 
into account. The current/old solution of mapping the static nice values 
into a global priority index which can incorporate the two scheduler 
classes is probably good enough - it just has to be "switched on" a again 
:-)

But what about other scheduler classes which some people want to add in
the future? What about having a "cleaner design"?

My thought was to generalize the concept of 'priority' to be an
object (a struct prio) to be interpreted with help from a scheduler class 
instead of globally interpreted integer.

int compare_prio(struct prio *a, struct prio *b)
{
 	if (a->sched_class->class_prio < b->sched_class->class_prio)
 		return -1;

 	if (a->sched_class->class_prio < b->sched_class->class_prio)
 		return +1;


 	return a->sched_class->compare_prio(a, b);

}

Problem 1: Performance.

Problem 2: Operations on a plist with these generalized priorities are not 
bounded because the number of different priorites are not bounded.

Problem 2 could be solved by using a combined plist (for rt priorities) 
and rbtree (for fair priorities) - making operations logarithmic just as 
the fair scheduler itself. But that would take more memory for every 
rtmutex.

I conclude that is too complicated and go on to the obvious idea:
Use a global priority index where each scheduler class get's it own 
range (rt: 0-99, fair 100-139 :-). Let the scheduler class have a 
function returning it instead of reading it directly from task_struct such
that new scheduler classes can return their own numbers.

Esben


> 	Ingo
>

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  6:40                   ` Peter Williams
@ 2007-04-16  7:32                     ` Ingo Molnar
  2007-04-16  8:54                       ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-16  7:32 UTC (permalink / raw)
  To: Peter Williams
  Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> One more quick comment.  The claim that there is no concept of time 
> slice in the new scheduler is only true in the sense of the rather 
> arcane implementation of time slices extant in the O(1) scheduler.

yeah. AFAIK most other mainstream OSs also still often use similarly 
'arcane' concepts (i'm here ignoring literature, you can find everything 
and its opposite suggested in literature) so i felt the need to point 
out the difference ;) After all Linux is about doing a better mainstream 
OS, it is not about beating the OS literature at lunacy ;-)

The precise statement would be: "there's no concept of giving out a 
time-slice to a task and sticking to it unless a higher-prio task comes 
along, nor is there a concept of having a low-res granularity 
->time_slice thing. There is accurate accounting of how much CPU time a 
task used up, and there is a granularity setting that together gives the 
current task a fairness advantage of a given amount of nanoseconds - 
which has similar [but not equivalent] effects to traditional timeslices 
that most mainstream OSs use".

> Your new parameter sched_granularity_ns is equivalent to the concept 
> of time slice in most other kernels that I've peeked inside and 
> computing literature in general (going back over several decades e.g. 
> the magic garden).

note that you can set it to 0 and the box still functions - so 
sched_granularity_ns, while useful for performance/bandwidth workloads, 
isnt truly inherent to the design.

So in the announcement i just opted for a short sentence: "there's no 
concept of timeslices", albeit like most short stentences it's not a 
technically 100% accurate statement - but still it conveyed the intended 
information more effectively to the interested lkml reader than the 
longer version could ever have =B-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  7:32                     ` Ingo Molnar
@ 2007-04-16  8:54                       ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-16  8:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> One more quick comment.  The claim that there is no concept of time 
>> slice in the new scheduler is only true in the sense of the rather 
>> arcane implementation of time slices extant in the O(1) scheduler.
> 
> yeah. AFAIK most other mainstream OSs also still often use similarly 
> 'arcane' concepts (i'm here ignoring literature, you can find everything 
> and its opposite suggested in literature) so i felt the need to point 
> out the difference ;) After all Linux is about doing a better mainstream 
> OS, it is not about beating the OS literature at lunacy ;-)
> 
> The precise statement would be: "there's no concept of giving out a 
> time-slice to a task and sticking to it unless a higher-prio task comes 
> along,

I would have said "no concept of using tile slices to implement nice" 
which always seemed strange to me.

If it really does what you just said then a (malicious or otherwise) CPU 
intensive task that never sleeps, once it got the CPU, would completely 
hog the CPU.

> nor is there a concept of having a low-res granularity 
> ->time_slice thing. There is accurate accounting of how much CPU time a 
> task used up, and there is a granularity setting that together gives the 
> current task a fairness advantage of a given amount of nanoseconds - 
> which has similar [but not equivalent] effects to traditional timeslices 
> that most mainstream OSs use".

Most traditional OSes have more or less fixed time slices and do the 
scheduling by fiddling the dynamic priority.

Using total CPU used will also come to grief when used for long running 
tasks.  Eventually, even very low bandwidth tasks will accumulate enough 
total CPU to look busy.  The CPU bandwidth the task is using is what 
needs to be controlled.

Or have I not looked closely enough at what sched_granularity_ns does? 
Is it really a control for the decay rate of a CPU usage bandwidth metric?

> 
>> Your new parameter sched_granularity_ns is equivalent to the concept 
>> of time slice in most other kernels that I've peeked inside and 
>> computing literature in general (going back over several decades e.g. 
>> the magic garden).
> 
> note that you can set it to 0 and the box still functions - so 
> sched_granularity_ns, while useful for performance/bandwidth workloads, 
> isnt truly inherent to the design.

Just like my SPA schedulers.  But if you set it to zero you'll get a 
fairly high context switch rate with associated overhead, won't you?

> 
> So in the announcement i just opted for a short sentence: "there's no 
> concept of timeslices", albeit like most short stentences it's not a 
> technically 100% accurate statement - but still it conveyed the intended 
> information more effectively to the interested lkml reader than the 
> longer version could ever have =B-)

I hope that I implied that I was being picky :-) (I meant to -- imply I 
was being picky, that is).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  9:06       ` Ingo Molnar
@ 2007-04-16 10:00         ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-16 10:00 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Peter Williams, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> guys, please calm down. Judging by the number of contributions to 
> sched.c the main folks who are not 'observers' here and who thus have 
> an unalienable right to be involved in a nasty flamewar about 
> scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else 
> is just a happy bystander, ok? ;-)

just to make sure: this is a short (and incomplete) list of contributors 
related to scheduler interactivity code. The full list of contributors 
to sched.c includes many other people as well: Peter, Suresh, Christoph, 
Kenneth and many others. Even the git logs, which only span 2 years out 
of 15, already list 79 individual contributors to kernel/sched.c.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:09               ` Peter Williams
@ 2007-04-16 11:04                 ` William Lee Irwin III
  2007-04-16 12:55                   ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-16 11:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> Driver models for scheduling are not so far out. AFAICS it's largely a
>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>> switching out intra-queue policies vs. switching out whole-system
>> policies, SMP handling and all. Whether this involves load balancing
>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>> scheduler module, for instance, would not have a load balancer at all,
>> as it has only one global runqueue. There are other sorts of policies
>> wanting significant changes to SMP handling vs. the stock load
>> balancing.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Well a single run queue removes the need for load balancing but has 
> scalability issues on large systems.  Personally, I think something in 
> between would be the best solution i.e. multiple run queues but more 
> than one CPU per run queue.  I think that this would be a particularly 
> good solution to the problems introduced by hyper threading and multi 
> core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
> package are using the one queue then the case where one CPU is trying to 
> run a high priority task and the other a low priority task (i.e. the 
> cases that the sleeping dependent mechanism tried to address) is less 
> likely to occur.

This wasn't meant to sing the praises of the 2.4.x scheduler; it was
rather meant to point out that the 2.4.x scheduler, among others, is
unimplementable within the framework if it assumes per-cpu runqueues.
More plausibly useful single-queue schedulers would likely use a vastly
different policy and attempt to carry out all queue manipulations via
lockless operations.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> By the way, I think that it's a very bad idea for the scheduling 
> mechanism and the load balancing mechanism to be coupled.  The anomalies 
> that will be experienced and the attempts to make ad hoc fixes for them 
> will lead to complexity spiralling out of control.

This is clearly unavoidable in the case of gang scheduling. There is
simply no other way to schedule N tasks which must all be run
simultaneously when they run at all on N cpus of the system without
such coupling and furthermore at an extremely intimate level,
particularly when multiple competing gangs must be scheduled in such
a fashion.


William Lee Irwin III wrote:
>> Where I had a significant need for
>> mucking with the entire concept of how SMP was handled, this is rather
>> different.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Yes, I went with the idea of intra run queue scheduling being orthogonal 
> to load balancing for two reasons:
> 1. I think that coupling them is a bad idea from the complexity POV, and
> 2. it's enough of a battle fighting for modifications to one bit of the 
> code without trying to do it to two simultaneously.

As nice as that sounds, such a code structure would've precluded the
entire raison d'etre of the patch, i.e. gang scheduling.


William Lee Irwin III wrote:
>> At this point I'm questioning the relevance of my own work,
>> though it was already relatively marginal as it started life as an
>> attempt at a sort of debug patch to help gang scheduling (which is in
>> itself a rather marginally relevant feature to most users) code along.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> The main commercial plug in scheduler used with the run time loadable 
> module scheduler that I mentioned earlier did gang scheduling (at the 
> insistence of the Tru64 kernel folks).  As this scheduler was a 
> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
> ("unfairly" really in according to an allocation policy) among higher 
> level entities such as users, groups and applications as well as 
> processes; it was fairly easy to make it a gang scheduler by modifying 
> it to give all of a process's threads the same priority based on the 
> process's CPU usage rather than different priorities based on the 
> threads' usage rates.  In fact, it would have been possible to select 
> between gang and non gang on a per process basis if that was considered 
> desirable.
> The fact that threads and processes are distinct entities on Tru64 and 
> Solaris made this easier to do on them than on Linux.
> My experience with this scheduler leads me to believe that to achieve 
> gang scheduling and fairness, etc. you need (usage) statistics based 
> schedulers.

This does not appear to make sense unless it's based on an incorrect
use of the term "gang scheduling." I'm referring to a gang as a set of
tasks (typically restricted to threads of the same process) which must
all be considered runnable or unrunnable simultaneously, and are for
the sake of performance required to all actually be run simultaneously.
This means a gang of N threads, when run, must run on N processors at
once. A time and a set of processors must be chosen for any time
interval where the gang is running. This interacts with load balancing
by needing to choose the cpus to run the gang on, and also arranging
for a set of cpus available for the gang to use to exist by means of
migrating tasks off the chosen cpus.


William Lee Irwin III wrote:
>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>> better ones. I'd not bother citing kernel compile results.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> spa_svr actually does its best work when the system isn't fully loaded 
> as the type of improvement it strives to achieve (minimizing on queue 
> wait time) hasn't got much room to manoeuvre when the system is fully 
> loaded.  Therefore, the fact that it's 1% better even in these 
> circumstances is a good result and also indicates that the overhead for 
> keeping the scheduling statistics it uses for its decision making is 
> well spent.  Especially, when you consider that the total available room 
> for improvement on this benchmark is less than 3%.

None of these benchmarks require the system to be fully loaded. They
are, on the other hand, vastly more realistic simulated workloads than
kernel compiles, and furthermore are actually developed as benchmarks,
with in some cases even measurements of variance, iteration to
convergence, and similar such things that make them actually scientific.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> To elaborate, the motivation for this scheduler was acquired from the 
> observation of scheduling statistics (in particular, on queue wait time) 
> on systems running at about 30% to 50% load.  Theoretically, at these 
> load levels there should be no such waiting but the statistics show that 
> there is considerable waiting (sometimes as high as 30% to 50%).  I put 
> this down to "lack of serendipity" e.g.  everyone sleeping at the same 
> time and then trying to run at the same time would be complete lack of 
> serendipity.  On the other hand, if everyone is synced then there would 
> be total serendipity.
> Obviously, from the POV of a client, time the server task spends waiting 
> on the queue adds to the response time for any request that has been 
> made so reduction of this time on a server is a good thing(tm).  Equally 
> obviously, trying to achieve this synchronization by asking the tasks to 
> cooperate with each other is not a feasible solution and some external 
> influence needs to be exerted and this is what spa_svr does -- it nudges 
> the scheduling order of the tasks in a way that makes them become well 
> synced.

This all sounds like a relatively good idea. So it's good for throughput
vs. latency or otherwise not particularly interactive. No big deal, just
use it where it makes sense.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Unfortunately, this is not a good scheduler for an interactive system as 
> it minimizes the response times for ALL tasks (and the system as a 
> whole) and this can result in increased response time for some 
> interactive tasks (clunkiness) which annoys interactive users.  When you 
> start fiddling with this scheduler to bring back "interactive 
> unfairness" you kill a lot of its superior low overall wait time 
> performance.
> So this is why I think "horses for courses" schedulers are worth while.

I have no particular objection to using an appropriate scheduler for the
system's workload. I also have little or no preference as to how that's
accomplished overall. But I really think that if we want to push
pluggable scheduling it should load schedulers as kernel modules on the
fly and so on versus pure /proc/ tunables and a compiled-in set of
alternatives.


William Lee Irwin III wrote:
>> In any event, I'm not sure what to say about different schedulers for
>> different aims. My intentions with plugsched were not centered around
>> production usage or intra-queue policy. I'm relatively indifferent to
>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>> in mainline. I don't see any particular harm in it, but neither am I
>> particularly motivated to have it in.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> If you look at the struct sched_spa_child in the file 
> include/linux/sched_spa.h you'll see that the interface for switching 
> between the various SPA schedulers is quite small and making them 
> runtime switchable would be easy (I haven't done this in cpusched as I 
> wanted to keep the same interface for switching schedulers for all 
> schedulers: i.e. all run time switchable or none run time switchable; as 
> the main aim of plugsched had become a mechanism for evaluating 
> different intra queue scheduling designs.)

I remember actually looking at this, and I would almost characterize
the differences between the SPA schedulers as a tunable parameter. I
have a different concept of what pluggability means from how the SPA
schedulers were switched, but no particular objection to the method
given the commonalities between them.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:54                                       ` William Lee Irwin III
@ 2007-04-16 11:24                                         ` Ingo Molnar
  2007-04-16 13:46                                           ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:24 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> > Oh I was very much testing "CPU bandwidth allocation as influenced by 
> > nice numbers" - it's one of the basic things i do when modifying the 
> > scheduler. An automated tool, while nice (all automation is nice) 
> > wouldnt necessarily show such bugs though, because here too it needed 
> > thousands of running tasks to trigger in practice. Any volunteers? ;)
> 
> Worse comes to worse I might actually get around to doing it myself. 
> Any more detailed descriptions of the test for a rainy day?

the main complication here is that the handling of nice levels is still 
typically a 2nd or 3rd degree design factor when writing schedulers. The 
reason isnt carelessness, the reason is simply that users typically only 
care about a single nice level: the one that all tasks run under by 
default.

Also, often there's just one or two good ways to attack the problem 
within a given scheduler approach and the quality of nice levels often 
suffers under other, more important design factors like performance.

This means that for example for the vanilla scheduler the distribution 
of CPU power depends on HZ, on the number of tasks and on the scheduling 
pattern. The distribution of CPU power amongst nice levels is basically 
a function of _everything_. That makes any automated test pretty 
challenging. Both with SD and with CFS there's a good chance to actually 
formalize the meaning of nice levels, but i'd not go as far as to 
mandate any particular behavior by rigidly saying "pass this automated 
tool, else ...", other than "make nice levels resonable". All the other 
more formal CPU resource limitation techniques are then a matter of 
CKRM-alike patches, which offer much more finegrained mechanisms than 
pure nice levels anyway.

so to answer your question: it's pretty much freely defined. Make up 
your mind about it and figure out the ways how people use nice levels 
and think about which aspects of that experience are worth testing for 
intelligently.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:49 ` Ismail Dönmez
  2007-04-15 23:23   ` Arjan van de Ven
@ 2007-04-16 11:58   ` Ingo Molnar
  2007-04-16 12:02     ` Ismail Dönmez
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:58 UTC (permalink / raw)
  To: Ismail Dönmez
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Ismail Dönmez <ismail@pardus.org.tr> wrote:

> Tested this on top of Linus' GIT tree but the system gets very 
> unresponsive during high disk i/o using ext3 as filesystem but even 
> writing a 300mb file to a usb disk (iPod actually) has the same 
> affect.

hm. Is this an SMP system+kernel by any chance?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:58   ` Ingo Molnar
@ 2007-04-16 12:02     ` Ismail Dönmez
  0 siblings, 0 replies; 712+ messages in thread
From: Ismail Dönmez @ 2007-04-16 12:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 14:58:54 Ingo Molnar wrote:
> * Ismail Dönmez <ismail@pardus.org.tr> wrote:
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same
> > affect.
>
> hm. Is this an SMP system+kernel by any chance?

Nope the kernel and the system is UP.

Regards,
ismail


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:04                 ` William Lee Irwin III
@ 2007-04-16 12:55                   ` Peter Williams
  2007-04-16 23:10                     ` Michael K. Edwards
       [not found]                     ` <20070416135915.GK8915@holomorphy.com>
  0 siblings, 2 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-16 12:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Driver models for scheduling are not so far out. AFAICS it's largely a
>>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>>> switching out intra-queue policies vs. switching out whole-system
>>> policies, SMP handling and all. Whether this involves load balancing
>>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>>> scheduler module, for instance, would not have a load balancer at all,
>>> as it has only one global runqueue. There are other sorts of policies
>>> wanting significant changes to SMP handling vs. the stock load
>>> balancing.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Well a single run queue removes the need for load balancing but has 
>> scalability issues on large systems.  Personally, I think something in 
>> between would be the best solution i.e. multiple run queues but more 
>> than one CPU per run queue.  I think that this would be a particularly 
>> good solution to the problems introduced by hyper threading and multi 
>> core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
>> package are using the one queue then the case where one CPU is trying to 
>> run a high priority task and the other a low priority task (i.e. the 
>> cases that the sleeping dependent mechanism tried to address) is less 
>> likely to occur.
> 
> This wasn't meant to sing the praises of the 2.4.x scheduler; it was
> rather meant to point out that the 2.4.x scheduler, among others, is
> unimplementable within the framework if it assumes per-cpu runqueues.
> More plausibly useful single-queue schedulers would likely use a vastly
> different policy and attempt to carry out all queue manipulations via
> lockless operations.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> By the way, I think that it's a very bad idea for the scheduling 
>> mechanism and the load balancing mechanism to be coupled.  The anomalies 
>> that will be experienced and the attempts to make ad hoc fixes for them 
>> will lead to complexity spiralling out of control.
> 
> This is clearly unavoidable in the case of gang scheduling. There is
> simply no other way to schedule N tasks which must all be run
> simultaneously when they run at all on N cpus of the system without
> such coupling and furthermore at an extremely intimate level,
> particularly when multiple competing gangs must be scheduled in such
> a fashion.

I can't see the logic here or why you would want to do such a thing.  It 
certainly doesn't coincide with what I interpret "gang scheduling" to mean.

> 
> 
> William Lee Irwin III wrote:
>>> Where I had a significant need for
>>> mucking with the entire concept of how SMP was handled, this is rather
>>> different.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Yes, I went with the idea of intra run queue scheduling being orthogonal 
>> to load balancing for two reasons:
>> 1. I think that coupling them is a bad idea from the complexity POV, and
>> 2. it's enough of a battle fighting for modifications to one bit of the 
>> code without trying to do it to two simultaneously.
> 
> As nice as that sounds, such a code structure would've precluded the
> entire raison d'etre of the patch, i.e. gang scheduling.

Not for what I understand "gang scheduling" to mean.  As I understand it 
the constraints of gang scheduling are no where near as strict as you 
seem to think they are.  And for what it's worth I don't think that what 
you think it means is in any sense a reasonable target.

> 
> 
> William Lee Irwin III wrote:
>>> At this point I'm questioning the relevance of my own work,
>>> though it was already relatively marginal as it started life as an
>>> attempt at a sort of debug patch to help gang scheduling (which is in
>>> itself a rather marginally relevant feature to most users) code along.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> The main commercial plug in scheduler used with the run time loadable 
>> module scheduler that I mentioned earlier did gang scheduling (at the 
>> insistence of the Tru64 kernel folks).  As this scheduler was a 
>> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
>> ("unfairly" really in according to an allocation policy) among higher 
>> level entities such as users, groups and applications as well as 
>> processes; it was fairly easy to make it a gang scheduler by modifying 
>> it to give all of a process's threads the same priority based on the 
>> process's CPU usage rather than different priorities based on the 
>> threads' usage rates.  In fact, it would have been possible to select 
>> between gang and non gang on a per process basis if that was considered 
>> desirable.
>> The fact that threads and processes are distinct entities on Tru64 and 
>> Solaris made this easier to do on them than on Linux.
>> My experience with this scheduler leads me to believe that to achieve 
>> gang scheduling and fairness, etc. you need (usage) statistics based 
>> schedulers.
> 
> This does not appear to make sense unless it's based on an incorrect
> use of the term "gang scheduling."

It's become obvious that we mean different things.

> I'm referring to a gang as a set of
> tasks (typically restricted to threads of the same process) which must
> all be considered runnable or unrunnable simultaneously, and are for
> the sake of performance required to all actually be run simultaneously.
> This means a gang of N threads, when run, must run on N processors at
> once. A time and a set of processors must be chosen for any time
> interval where the gang is running. This interacts with load balancing
> by needing to choose the cpus to run the gang on, and also arranging
> for a set of cpus available for the gang to use to exist by means of
> migrating tasks off the chosen cpus.

Sounds like a job for the load balancer NOT the scheduler.

Also I can't see you meeting such strict constraints without making the 
tasks all SCHED_FIFO.

> 
> 
> William Lee Irwin III wrote:
>>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>>> better ones. I'd not bother citing kernel compile results.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> spa_svr actually does its best work when the system isn't fully loaded 
>> as the type of improvement it strives to achieve (minimizing on queue 
>> wait time) hasn't got much room to manoeuvre when the system is fully 
>> loaded.  Therefore, the fact that it's 1% better even in these 
>> circumstances is a good result and also indicates that the overhead for 
>> keeping the scheduling statistics it uses for its decision making is 
>> well spent.  Especially, when you consider that the total available room 
>> for improvement on this benchmark is less than 3%.
> 
> None of these benchmarks require the system to be fully loaded. They
> are, on the other hand, vastly more realistic simulated workloads than
> kernel compiles, and furthermore are actually developed as benchmarks,
> with in some cases even measurements of variance, iteration to
> convergence, and similar such things that make them actually scientific.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> To elaborate, the motivation for this scheduler was acquired from the 
>> observation of scheduling statistics (in particular, on queue wait time) 
>> on systems running at about 30% to 50% load.  Theoretically, at these 
>> load levels there should be no such waiting but the statistics show that 
>> there is considerable waiting (sometimes as high as 30% to 50%).  I put 
>> this down to "lack of serendipity" e.g.  everyone sleeping at the same 
>> time and then trying to run at the same time would be complete lack of 
>> serendipity.  On the other hand, if everyone is synced then there would 
>> be total serendipity.
>> Obviously, from the POV of a client, time the server task spends waiting 
>> on the queue adds to the response time for any request that has been 
>> made so reduction of this time on a server is a good thing(tm).  Equally 
>> obviously, trying to achieve this synchronization by asking the tasks to 
>> cooperate with each other is not a feasible solution and some external 
>> influence needs to be exerted and this is what spa_svr does -- it nudges 
>> the scheduling order of the tasks in a way that makes them become well 
>> synced.
> 
> This all sounds like a relatively good idea. So it's good for throughput
> vs. latency or otherwise not particularly interactive. No big deal, just
> use it where it makes sense.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Unfortunately, this is not a good scheduler for an interactive system as 
>> it minimizes the response times for ALL tasks (and the system as a 
>> whole) and this can result in increased response time for some 
>> interactive tasks (clunkiness) which annoys interactive users.  When you 
>> start fiddling with this scheduler to bring back "interactive 
>> unfairness" you kill a lot of its superior low overall wait time 
>> performance.
>> So this is why I think "horses for courses" schedulers are worth while.
> 
> I have no particular objection to using an appropriate scheduler for the
> system's workload. I also have little or no preference as to how that's
> accomplished overall. But I really think that if we want to push
> pluggable scheduling it should load schedulers as kernel modules on the
> fly and so on versus pure /proc/ tunables and a compiled-in set of
> alternatives.
> 
> 
> William Lee Irwin III wrote:
>>> In any event, I'm not sure what to say about different schedulers for
>>> different aims. My intentions with plugsched were not centered around
>>> production usage or intra-queue policy. I'm relatively indifferent to
>>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>>> in mainline. I don't see any particular harm in it, but neither am I
>>> particularly motivated to have it in.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> If you look at the struct sched_spa_child in the file 
>> include/linux/sched_spa.h you'll see that the interface for switching 
>> between the various SPA schedulers is quite small and making them 
>> runtime switchable would be easy (I haven't done this in cpusched as I 
>> wanted to keep the same interface for switching schedulers for all 
>> schedulers: i.e. all run time switchable or none run time switchable; as 
>> the main aim of plugsched had become a mechanism for evaluating 
>> different intra queue scheduling designs.)
> 
> I remember actually looking at this, and I would almost characterize
> the differences between the SPA schedulers as a tunable parameter. I
> have a different concept of what pluggability means from how the SPA
> schedulers were switched, but no particular objection to the method
> given the commonalities between them.

Yes, that's the way I look at them (in fact, in Zaphod that's exactly 
what they were -- i.e. Zaphod could be made to behave like various SPA 
schedulers by fiddling its run time parameters).  They illustrate (to my 
mind) that once you get rid of the O(1) scheduler and replace it with a 
simple mechanism such as SPA (where there's a small number of points 
where the scheduling discipline gets to do its thing rather than being 
interspersed willy nilly throughout the rest of the code) adding run 
time switchable "horses for courses" scheduler disciplines becomes 
simple.  I think that the simplifications in Ingo's new scheduler (whose 
scheduling classes now look a lot like Solaris's and its predecessor 
OSes' scheduler classes) may make it possible to have switchable 
scheduling disciplines within a scheduling class.

I think that something similar (i.e. switchability) could be done for 
load balancing so that different load balancers could be used when 
required.  By keeping this load balancing functionality orthogonal to 
the intra run queue scheduling disciplines you increase the number of 
options available.

As I see it, if the scheduling discipline in use does its job properly 
within a run queue and the load balancer does its job of keeping the 
weighted load/demand on each run queue roughly equal (except where it 
has to do otherwise for your version of "gang scheduling") then the 
overall outcome will meet expectations.  Note that I talk of run queues 
not CPUs as I think a shift to multiple CPUs per run queue may be a good 
idea.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:24                                         ` Ingo Molnar
@ 2007-04-16 13:46                                           ` William Lee Irwin III
  0 siblings, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-16 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Worse comes to worse I might actually get around to doing it myself. 
>> Any more detailed descriptions of the test for a rainy day?

On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> the main complication here is that the handling of nice levels is still 
> typically a 2nd or 3rd degree design factor when writing schedulers. The 
> reason isnt carelessness, the reason is simply that users typically only 
> care about a single nice level: the one that all tasks run under by 
> default.

I'm a bit unconvinced here. Support for prioritization is a major
scheduler feature IMHO.


On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> Also, often there's just one or two good ways to attack the problem 
> within a given scheduler approach and the quality of nice levels often 
> suffers under other, more important design factors like performance.
> This means that for example for the vanilla scheduler the distribution 
> of CPU power depends on HZ, on the number of tasks and on the scheduling 
> pattern. The distribution of CPU power amongst nice levels is basically 
> a function of _everything_. That makes any automated test pretty 
> challenging. Both with SD and with CFS there's a good chance to actually 
> formalize the meaning of nice levels, but i'd not go as far as to 
> mandate any particular behavior by rigidly saying "pass this automated 
> tool, else ...", other than "make nice levels resonable". All the other 
> more formal CPU resource limitation techniques are then a matter of 
> CKRM-alike patches, which offer much more finegrained mechanisms than 
> pure nice levels anyway.

Some of the issues with respect to the number of tasks and scheduling
patterns can be made part of the test; one can furthermore insist that
the system be quiescent in a variety of ways. I'm not convinced that
formalization of nice levels is a bad idea. They're the standard UNIX
prioritization facility, and it should work with some definite value
of "work."

Even supposing one doesn't care to bolt down the semantics of nice
levels, there should at least be some awareness of what those semantics
are and when and how they're changing. So in that respect a test for
CPU bandwidth distribution according to nice level remains valuable
even supposing that the semantics aren't required to be rigidly fixed.

As far as CKRM goes, I'm wild about it. I wish things would get in
shape to be merged (if they're not already) and merged ASAP on that
front. I think with so much agreement in concept we can work with
changing out implementations as-needed with it sitting in mainline once
the the user API/ABI is decided upon, and I think it already is.

I'm not entirely convinced CKRM answers this, though. If the scheduler
can't support nice levels, how is it supposed to support prioritization
or CPU bandwidth allocation according to CKRM configurations? I'm
relatively certain schedulers must be able to support prioritization
with deterministic CPU bandwidth as essential functionality. This is,
of course, not to say my certainty about things sets the standards for
what testcases are considered meaningful and valid.


On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> so to answer your question: it's pretty much freely defined. Make up 
> your mind about it and figure out the ways how people use nice levels 
> and think about which aspects of that experience are worth testing for 
> intelligently.

Looking for usage cases is a good idea; I'll do that before coding any
testcase for nice semantics.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:03           ` Nick Piggin
@ 2007-04-16 14:28             ` Matt Mackall
  2007-04-17  3:31               ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Matt Mackall @ 2007-04-16 14:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> I'd prefer if we kept a single CPU scheduler in mainline, because I
> think that simplifies analysis and focuses testing.

I think you'll find something like 80-90% of the testing will be done
on the default choice, even if other choices exist. So you really
won't have much of a problem here.

But when the only choice for other schedulers is to go out-of-tree,
then only 1% of the people will try it out and those people are
guaranteed to be the ones who saw scheduling problems in mainline.
So the alternative won't end up getting any testing on many of the
workloads that work fine in mainstream so their feedback won't tell
you very much at all.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 21:31         ` Matt Mackall
  2007-04-16  3:03           ` Nick Piggin
@ 2007-04-16 15:45           ` William Lee Irwin III
  1 sibling, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-16 15:45 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> That's irrelevant. Plugsched was an attempt to get alternative
> schedulers exposure in mainline. I know, because I remember
> encouraging Bill to pursue it. Not only did you veto plugsched (which
> may have been a perfectly reasonable thing to do), but you also vetoed
> the whole concept of multiple schedulers in the tree too. "We don't
> want to balkanize the scheduling landscape".
> And that latter part is what I'm claiming has set us back for years.
> It's not a technical argument but a strategic one. And it's just not a
> good strategy.
[... excellent post trimmed...]

These are some rather powerful arguments. I think I'll actually start
looking at plugsched again.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:26             ` William Lee Irwin III
@ 2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
                                   ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Chris Friesen @ 2007-04-16 15:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:

> The sorts of like explicit decisions I'd like to be made for these are:
> (1) In a mixture of tasks with varying nice numbers, a given nice number
> 	corresponds to some share of CPU bandwidth. Implementations
> 	should not have the freedom to change this arbitrarily according
> 	to some intention.

The first question that comes to my mind is whether nice levels should 
be linear or not.  I would lean towards nonlinear as it allows a wider 
range (although of course at the expense of precision).  Maybe something 
like "each nice level gives X times the cpu of the previous"?  I think a 
value of X somewhere between 1.15 and 1.25 might be reasonable.

What about also having something that looks at latency, and how latency 
changes with niceness?

What about specifying the timeframe over which the cpu bandwidth is 
measured?  I currently have a system where the application designers 
would like it to be totally fair over a period of 1 second.  As you can 
imagine, mainline doesn't do very well in this case.

Chris




^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
@ 2007-04-16 16:13                 ` William Lee Irwin III
  2007-04-17  0:04                 ` Peter Williams
  2007-04-17 13:07                 ` James Bruce
  2 siblings, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-16 16:13 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>	corresponds to some share of CPU bandwidth. Implementations
>>	should not have the freedom to change this arbitrarily according
>>	to some intention.

On Mon, Apr 16, 2007 at 09:55:14AM -0600, Chris Friesen wrote:
> The first question that comes to my mind is whether nice levels should 
> be linear or not.  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.
> What about also having something that looks at latency, and how latency 
> changes with niceness?
> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.  As you can 
> imagine, mainline doesn't do very well in this case.

It's unclear how latency enters the picture as the semantics of nice
levels relevant to such are essentially priority preemption, which is
not particularly easy to mess up. I suppose tests to ensure priority
preemption occurs properly are in order.

I don't really have a preference regarding specific semantics for nice
numbers, just that they should be deterministic and specified somewhere.
It's not really for us to decide what those semantics are as it's more
of a userspace ABI/API issue.

The timeframe is also relevant, but I suspect it's more of a performance
metric than a strict requirement.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  1:06           ` Peter Williams
  2007-04-16  3:04             ` William Lee Irwin III
@ 2007-04-16 17:22             ` Chris Friesen
  2007-04-17  0:54               ` Peter Williams
  1 sibling, 1 reply; 712+ messages in thread
From: Chris Friesen @ 2007-04-16 17:22 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:

> To my mind scheduling 
> and load balancing are orthogonal and keeping them that way simplifies 
> things.

Scuse me if I jump in here, but doesn't the load balancer need some way 
to figure out a) when to run, and b) which tasks to pull and where to 
push them?

I suppose you could abstract this into a per-scheduler API, but to me at 
least these are the hard parts of the load balancer...

Chris

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 22:00 ` Andi Kleen
@ 2007-04-16 21:05   ` Ingo Molnar
  2007-04-16 21:21     ` Andi Kleen
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-16 21:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > i'm pleased to announce the first release of the "Modular Scheduler 
> > Core and Completely Fair Scheduler [CFS]" patchset:
> > 
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> I would suggest to drop the tsc.c change. The "small errors" can be 
> really large on some systems and you can also see large backward 
> jumps.

actually, i designed the CFS code assuming a per-CPU TSC (with no global 
synchronization), not assuming any globally sync TSC. In fact i wrote it 
on such systems: a CoreDuo2 box that has stops the TSC in C3 and the 
different cores have wildly different TSC values and a dual-core 
Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() 
change for now.

> BTW with all this CPU time measurement it would be really nice to 
> report it to the user too. It seems a bit bizarre that the scheduler 
> keeps track of ns, but top only knows jiffies with large sampling 
> errors.

yeah - i'll fix that too if someone doesnt beat me at it.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 21:05   ` Ingo Molnar
@ 2007-04-16 21:21     ` Andi Kleen
  0 siblings, 0 replies; 712+ messages in thread
From: Andi Kleen @ 2007-04-16 21:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, linux-kernel

> actually, i designed the CFS code assuming a per-CPU TSC (with no global 
> synchronization), not assuming any globally sync TSC. In fact i wrote it 

That already worked in the old scheduler (just in a hackish way)

> on such systems: a CoreDuo2 box that has stops the TSC in C3 and the 
> different cores have wildly different TSC values and a dual-core 
> Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() 
> change for now.

The problem is not CPU synchronized TSC, but TSC with varying frequency
on a single CPU like on the A64.

The old implementation can lose really badly on that because it mixes
measurements at different frequencies together without individual scaling.

The error gets worse the longer the system runs.

>> BTW with all this CPU time measurement it would be really nice to
>> report it to the user too. It seems a bit bizarre that the scheduler
>> keeps track of ns, but top only knows jiffies with large sampling
>> errors.

> yeah - i'll fix that too if someone doesnt beat me at it.

I've been pondering for some time if doubling the NMI watchdog as a 
ring 0 counter for this is worth it. So far I'm still undecided
(and it's moot now since it's disabled by default :/)

-Andi


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (11 preceding siblings ...)
  2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-16 22:00 ` Andi Kleen
  2007-04-16 21:05   ` Ingo Molnar
  2007-04-17  7:56 ` Andy Whitcroft
  2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
  14 siblings, 1 reply; 712+ messages in thread
From: Andi Kleen @ 2007-04-16 22:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar <mingo@elte.hu> writes:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

I would suggest to drop the tsc.c change. The "small errors" can be
really large on some systems and you can also see large backward jumps.
I have a proper (but complicated) solution pending
in ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/sched-clock-share

BTW with all this CPU time measurement it would be really nice to
report it to the user too. It seems a bit bizarre that the scheduler
keeps track of ns, but top only knows jiffies with large sampling errors.

-Andi

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 12:55                   ` Peter Williams
@ 2007-04-16 23:10                     ` Michael K. Edwards
  2007-04-17  3:55                       ` Nick Piggin
       [not found]                     ` <20070416135915.GK8915@holomorphy.com>
  1 sibling, 1 reply; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-16 23:10 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> Note that I talk of run queues
> not CPUs as I think a shift to multiple CPUs per run queue may be a good
> idea.

This observation of Peter's is the best thing to come out of this
whole foofaraw.  Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of "CPU scheduling" with "run queue scheduling" across a complex,
possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.

For instance, there are architectures where several "CPUs"
(instruction stream decoders feeding execution pipelines) share parts
of a cache hierarchy ("chip-level multitasking").  On these machines,
you may want to co-schedule a "real" processing task on one pipeline
with a "cache warming" task on the other pipeline -- but only for
tasks whose memory access patterns have been sufficiently analyzed to
write the "cache warming" task code.  Some other tasks may want to
idle the second pipeline so they can use the full cache-to-RAM
bandwidth.  Yet other tasks may be genuinely CPU-intensive (or I/O
bound but so context-heavy that it's not worth yielding the CPU during
quick I/Os), and hence perfectly happy to run concurrently with an
unrelated task on the other pipeline.

There are other architectures where several "hardware threads" fight
over parts of a cache hierarchy (sometimes bizarrely described as
"sharing" the cache, kind of the way most two-year-olds "share" toys).
 On these machines, one instruction pipeline can't help the other
along cache-wise, but it sure can hurt.  A scheduler designed, tested,
and tuned principally on one of these architectures (hint:
"hyperthreading") will probably leave a lot of performance on the
floor on processors in the former category.

In the not-so-distant future, we're likely to see architectures with
dynamically reconfigurable interconnect between instruction issue
units and execution resources.  (This is already quite feasible on,
say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
as many Nios II cores as fit on the chip.)  Restoring task context may
involve not just MMU swaps and FPU instructions (with state-dependent
hidden costs) but processsor reconfiguration.  Achieving "fairness"
according to any standard that a platform integrator cares about (let
alone an end user) will require a fairly detailed model of the hidden
costs associated with different sorts of task switch.

So if you are interested in schedulers for some reason other than a
paycheck, let the distros worry about 5% improvements on x86[_64].
Get hold of some different "hardware" -- say:
  - a Xilinx ML410 if you've got $3K to blow and want to explore
reconfigurable processors;
  - a SunFire T2000 if you've got $11K and want to mess with a CMT
system that's actually shipping;
  - a QEMU-simulated massively SMP x86 if you're poor but clever
enough to implement funky cross-core cache effects yourself; or
  - a cycle-accurate simulator from Gaisler or Virtio if you want a
real research project.
Then go explore some more interesting regions of parameter space and
see what the demands on mainline Linux will look like in a few years.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
@ 2007-04-17  0:04                 ` Peter Williams
  2007-04-17 13:07                 ` James Bruce
  2 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-17  0:04 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Willy Tarreau, Pekka Enberg,
	hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> William Lee Irwin III wrote:
> 
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>     corresponds to some share of CPU bandwidth. Implementations
>>     should not have the freedom to change this arbitrarily according
>>     to some intention.
> 
> The first question that comes to my mind is whether nice levels should 
> be linear or not.

No. That squishes one end of the table too much.  It needs to be 
(approximately) piecewise linear around nice == 0.  Here's the mapping I 
use in my entitlement based schedulers:

#define NICE_TO_LP(nice) ((nice >=0) ? (20 - (nice)) : (20 + (nice) * 
(nice)))

It has the (good) feature that a nice == 19 task has 1/20th the 
entitlement of a nice == 0 task and a nice == -20 task has 21 times the 
entitlement of a nice == 0 task.  It's not strictly linear for negative 
nice values but is very cheap to calculate and quite easy to invert if 
necessary.

>  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.
> 
> What about also having something that looks at latency, and how latency 
> changes with niceness?
> 
> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.

Have you tried the spa_ebs scheduler?  The half life is no longer a run 
time configurable parameter (as making it highly adjustable results in 
less efficient code) but it could be adjusted to be approximately 
equivalent to 0.5 seconds by changing some constants in the code.

>  As you can 
> imagine, mainline doesn't do very well in this case.

You should look back through the plugsched patches where many of these 
ideas have been experimented with.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  6:43   ` Mike Galbraith
  2007-04-15  8:36     ` Bill Huey
@ 2007-04-17  0:06     ` Peter Williams
  2007-04-17  2:29       ` Mike Galbraith
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  0:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

Mike Galbraith wrote:
> On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
>> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
>>> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
>>> [CFS]
>>>
>>> i'm pleased to announce the first release of the "Modular Scheduler Core
>>> and Completely Fair Scheduler [CFS]" patchset:
>>>
>>>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>>>
>>> This project is a complete rewrite of the Linux task scheduler. My goal
>>> is to address various feature requests and to fix deficiencies in the
>>> vanilla scheduler that were suggested/found in the past few years, both
>>> for desktop scheduling and for server scheduling workloads.
>> The casual observer will be completely confused by what on earth has happened 
>> here so let me try to demystify things for them.
> 
> [...]
> 
> Demystify what?   The casual observer need only read either your attempt
> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.

Make that "someone with the necessary clout".

> Now progress can happen, which was _not_ happening before.
> 

This is true.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:47       ` Davide Libenzi
@ 2007-04-17  0:37         ` Pavel Pisa
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Pisa @ 2007-04-17  0:37 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 07:47, Davide Libenzi wrote:
> On Mon, 16 Apr 2007, Pavel Pisa wrote:
> > I cannot help myself to not report results with GAVL
> > tree algorithm there as an another race competitor.
> > I believe, that it is better solution for large priority
> > queues than RB-tree and even heap trees. It could be
> > disputable if the scheduler needs such scalability on
> > the other hand. The AVL heritage guarantees lower height
> > which results in shorter search times which could
> > be profitable for other uses in kernel.
> >
> > GAVL algorithm is AVL tree based, so it does not suffer from
> > "infinite" priorities granularity there as TR does. It allows
> > use for generalized case where tree is not fully balanced.
> > This allows to cut the first item withour rebalancing.
> > This leads to the degradation of the tree by one more level
> > (than non degraded AVL gives) in maximum, which is still
> > considerably better than RB-trees maximum.
> >
> > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c
>
> Here are the results on my Opteron 252:
>
> Testing N=1
> gavl_cfs = 187.20 cycles/loop
> CFS = 194.16 cycles/loop
> TR  = 314.87 cycles/loop
> CFS = 194.15 cycles/loop
> gavl_cfs = 187.15 cycles/loop
>
> Testing N=2
> gavl_cfs = 268.94 cycles/loop
> CFS = 305.53 cycles/loop
> TR  = 313.78 cycles/loop
> CFS = 289.58 cycles/loop
> gavl_cfs = 266.02 cycles/loop
>
> Testing N=4
> gavl_cfs = 452.13 cycles/loop
> CFS = 518.81 cycles/loop
> TR  = 311.54 cycles/loop
> CFS = 516.23 cycles/loop
> gavl_cfs = 450.73 cycles/loop
>
> Testing N=8
> gavl_cfs = 609.29 cycles/loop
> CFS = 644.65 cycles/loop
> TR  = 308.11 cycles/loop
> CFS = 667.01 cycles/loop
> gavl_cfs = 592.89 cycles/loop
>
> Testing N=16
> gavl_cfs = 686.30 cycles/loop
> CFS = 807.41 cycles/loop
> TR  = 317.20 cycles/loop
> CFS = 810.24 cycles/loop
> gavl_cfs = 688.42 cycles/loop
>
> Testing N=32
> gavl_cfs = 756.57 cycles/loop
> CFS = 852.14 cycles/loop
> TR  = 301.22 cycles/loop
> CFS = 876.12 cycles/loop
> gavl_cfs = 758.46 cycles/loop
>
> Testing N=64
> gavl_cfs = 831.97 cycles/loop
> CFS = 997.16 cycles/loop
> TR  = 304.74 cycles/loop
> CFS = 1003.26 cycles/loop
> gavl_cfs = 832.83 cycles/loop
>
> Testing N=128
> gavl_cfs = 897.33 cycles/loop
> CFS = 1030.36 cycles/loop
> TR  = 295.65 cycles/loop
> CFS = 1035.29 cycles/loop
> gavl_cfs = 892.51 cycles/loop
>
> Testing N=256
> gavl_cfs = 963.17 cycles/loop
> CFS = 1146.04 cycles/loop
> TR  = 295.35 cycles/loop
> CFS = 1162.04 cycles/loop
> gavl_cfs = 966.31 cycles/loop
>
> Testing N=512
> gavl_cfs = 1029.82 cycles/loop
> CFS = 1218.34 cycles/loop
> TR  = 288.78 cycles/loop
> CFS = 1257.97 cycles/loop
> gavl_cfs = 1029.83 cycles/loop
>
> Testing N=1024
> gavl_cfs = 1091.76 cycles/loop
> CFS = 1318.47 cycles/loop
> TR  = 287.74 cycles/loop
> CFS = 1311.72 cycles/loop
> gavl_cfs = 1093.29 cycles/loop
>
> Testing N=2048
> gavl_cfs = 1153.03 cycles/loop
> CFS = 1398.84 cycles/loop
> TR  = 286.75 cycles/loop
> CFS = 1438.68 cycles/loop
> gavl_cfs = 1149.97 cycles/loop
>
>
> There seem to be some difference from your numbers. This is with:
>
> gcc version 4.1.2
>
> and -O2. But then and Opteron can behave quite differentyl than a Duron on
> a bench like this ;)

Thanks for testing, but yours numbers are more correct
than my first report. My numbers seemed to be over-optimistic even
to me, In the fact I have been surprised that difference is so high.
But I have tested bad version of code without GAVL_FAFTER option
set. The code pushed to the web page has been the correct one.
I have not get to look into case until now because I have busy day
to prepare some Linux based labs at university.

Without GAVL_FAFTER option, insert operation does fail
if item with same key is already inserted (intended feature of
the code) and as result of that, not all items have been inserted
in the test. The meaning of GAVL_FAFTER is find/insert after
all items with the same key value. Default behavior is
operate on unique keys in tree and reject duplicates.

My results are even worse for GAVL than yours.
It is possible to try tweak code and optimize it more
(likely/unlikely/do not keep last ptr etc) for this actual usage.
May it be, that I try this exercise, but I do not expect that
the result after tuning would be so much better, that it would
outweight some redesign work. I could see some advantages of AVL
still, but it has its own drawbacks with need of separate height
field and little worse delete in the middle timing.

So excuse me for disturbance. I have been only curious how
GAVL code would behave in the comparison of other algorithms
and I did not kept my premature enthusiasm under the lock.

Best wishes

             Pavel Pisa 


./smart-queue-v-gavl -n 4
gavl_cfs = 279.02 cycles/loop
CFS = 200.87 cycles/loop
TR  = 229.55 cycles/loop
CFS = 201.23 cycles/loop
gavl_cfs = 276.08 cycles/loop

./smart-queue-v-gavl -n 8
gavl_cfs = 310.92 cycles/loop
CFS = 288.45 cycles/loop
TR  = 192.46 cycles/loop
CFS = 284.94 cycles/loop
gavl_cfs = 357.02 cycles/loop

./smart-queue-v-gavl -n 16
gavl_cfs = 350.45 cycles/loop
CFS = 354.01 cycles/loop
TR  = 189.79 cycles/loop
CFS = 320.08 cycles/loop
gavl_cfs = 387.43 cycles/loop

./smart-queue-v-gavl -n 32
gavl_cfs = 419.23 cycles/loop
CFS = 406.88 cycles/loop
TR  = 198.10 cycles/loop
CFS = 398.15 cycles/loop
gavl_cfs = 412.57 cycles/loop

./smart-queue-v-gavl -n 64
gavl_cfs = 442.81 cycles/loop
CFS = 429.62 cycles/loop
TR  = 235.40 cycles/loop
CFS = 389.54 cycles/loop
gavl_cfs = 433.56 cycles/loop

./smart-queue-v-gavl -n 128
gavl_cfs = 358.20 cycles/loop
CFS = 605.49 cycles/loop
TR  = 236.01 cycles/loop
CFS = 458.50 cycles/loop
gavl_cfs = 455.05 cycles/loop

./smart-queue-v-gavl -n 256
gavl_cfs = 529.72 cycles/loop
CFS = 530.98 cycles/loop
TR  = 193.75 cycles/loop
CFS = 533.75 cycles/loop
gavl_cfs = 471.47 cycles/loop

./smart-queue-v-gavl -n 512
gavl_cfs = 525.80 cycles/loop
CFS = 550.63 cycles/loop
TR  = 188.71 cycles/loop
CFS = 549.81 cycles/loop
gavl_cfs = 494.73 cycles/loop

./smart-queue-v-gavl -n 1024
gavl_cfs = 544.91 cycles/loop
CFS = 561.68 cycles/loop
TR  = 230.97 cycles/loop
CFS = 522.68 cycles/loop
gavl_cfs = 542.40 cycles/loop

./smart-queue-v-gavl -n 2048
gavl_cfs = 567.46 cycles/loop
CFS = 581.85 cycles/loop
TR  = 229.69 cycles/loop
CFS = 585.41 cycles/loop
gavl_cfs = 563.22 cycles/loop

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 17:22             ` Chris Friesen
@ 2007-04-17  0:54               ` Peter Williams
  2007-04-17 15:52                 ` Chris Friesen
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  0:54 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
> 
>> To my mind scheduling and load balancing are orthogonal and keeping 
>> them that way simplifies things.
> 
> Scuse me if I jump in here, but doesn't the load balancer need some way 
> to figure out a) when to run, and b) which tasks to pull and where to 
> push them?

Yes but both of these are independent of the scheduler discipline in force.

> 
> I suppose you could abstract this into a per-scheduler API, but to me at 
> least these are the hard parts of the load balancer...

Load balancing needs to be based on the static priorities (i.e. nice or 
real time priority) of the runnable tasks not the dynamic priorities. 
If the load balancer manages to keep the weighted (according to static 
priority) load and distribution of priorities within the loads on the 
CPUs roughly equal and the scheduler does a good job of ensuring 
fairness, interactive responsiveness etc. for the tasks within a CPU 
then the result will be good system performance within the constraints 
set by the sys admins use of real time priorities and nice.

The smpnice modifications to the load balancer were meant to give it the 
appropriate behaviour and what we need to fix now is the intra CPU 
scheduling.

Even if the load balancer isn't yet perfect perfecting it can be done 
separately to fixing the scheduler preferably with as little 
interdependency as possible.  Probably the only contribution to load 
balancing that the scheduler really needs to make is the calculating of 
the average weighted load on each of the CPUs (or run queues if there's 
more than one CPU per runqueue).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  0:06     ` Peter Williams
@ 2007-04-17  2:29       ` Mike Galbraith
  2007-04-17  3:40         ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-17  2:29 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> >
> > Demystify what?   The casual observer need only read either your attempt
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> 
> Make that "someone with the necessary clout".

No, I was brutally honest to both of us, but quite correct.

> > Now progress can happen, which was _not_ happening before.
> > 
> 
> This is true.

Yup, and progress _is_ happening now, quite rapidly.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 14:28             ` Matt Mackall
@ 2007-04-17  3:31               ` Nick Piggin
  2007-04-17 17:35                 ` Matt Mackall
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  3:31 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > think that simplifies analysis and focuses testing.
> 
> I think you'll find something like 80-90% of the testing will be done
> on the default choice, even if other choices exist. So you really
> won't have much of a problem here.
> 
> But when the only choice for other schedulers is to go out-of-tree,
> then only 1% of the people will try it out and those people are
> guaranteed to be the ones who saw scheduling problems in mainline.
> So the alternative won't end up getting any testing on many of the
> workloads that work fine in mainstream so their feedback won't tell
> you very much at all.

Yeah I concede that perhaps it is the only way to get things going
any further. But how do we decide if and when the current scheduler
should be demoted from default, and which should replace it?

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  2:29       ` Mike Galbraith
@ 2007-04-17  3:40         ` Nick Piggin
  2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  4:17           ` Peter Williams
  0 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  3:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> > Mike Galbraith wrote:
> > >
> > > Demystify what?   The casual observer need only read either your attempt
> > > at writing a scheduler, or my attempts at fixing the one we have, to see
> > > that it was high time for someone with the necessary skills to step in.
> > 
> > Make that "someone with the necessary clout".
> 
> No, I was brutally honest to both of us, but quite correct.
> 
> > > Now progress can happen, which was _not_ happening before.
> > > 
> > 
> > This is true.
> 
> Yup, and progress _is_ happening now, quite rapidly.

Progress as in progress on Ingo's scheduler. I still don't know how we'd
decide when to replace the mainline scheduler or with what.

I don't think we can say Ingo's is better than the alternatives, can we?
If there is some kind of bakeoff, then I'd like one of Con's designs to
be involved, and mine, and Peter's...

Maybe the progress is that more key people are becoming open to the idea
of changing the scheduler.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely  FairScheduler [CFS]
  2007-04-17  4:01           ` Mike Galbraith
@ 2007-04-17  3:43             ` David Lang
  2007-04-17  4:14             ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
  2007-04-20 20:36             ` Bill Davidsen
  2 siblings, 0 replies; 712+ messages in thread
From: David Lang @ 2007-04-17  3:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, Mike Galbraith wrote:

> Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely
>     FairScheduler [CFS]
> 
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>
>>> Yup, and progress _is_ happening now, quite rapidly.
>>
>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>> decide when to replace the mainline scheduler or with what.
>>
>> I don't think we can say Ingo's is better than the alternatives, can we?
>
> No, that would require massive performance testing of all alternatives.
>
>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>> be involved, and mine, and Peter's...
>
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers.  If they're close in numbers, do you go with
> the one which starts the least flamewars or what?

it's especially hard if the people doing the testing need to find the latest 
patch and apply it.

even having a compile-time option to switch between them at least means that the 
testers can have confidence that the various patches haven't bitrotted.

boot time options would be even better, but I understand from previous 
discussions I've watched that this is performance critical enough that the 
overhead of this would throw off the results.

David Lang

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 23:10                     ` Michael K. Edwards
@ 2007-04-17  3:55                       ` Nick Piggin
  2007-04-17  4:25                         ` Peter Williams
  2007-04-17  8:24                         ` William Lee Irwin III
  0 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  3:55 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Peter Williams, William Lee Irwin III, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >Note that I talk of run queues
> >not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >idea.
> 
> This observation of Peter's is the best thing to come out of this
> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
> going to be necessary, within a couple of years, to replace the whole
> idea of "CPU scheduling" with "run queue scheduling" across a complex,
> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
> point in churning the mainline scheduler through a design that isn't
> significantly more flexible than any of those now under discussion.

Why? If you do that, then your load balancer just becomes less flexible
because it is harder to have tasks run on one or the other.

You can have single-runqueue-per-domain behaviour (or close to) just by
relaxing all restrictions on idle load balancing within that domain. It
is harder to go the other way and place any per-cpu affinity or
restirctions with multiple cpus on a single runqueue.


> For instance, there are architectures where several "CPUs"
> (instruction stream decoders feeding execution pipelines) share parts
> of a cache hierarchy ("chip-level multitasking").  On these machines,
> you may want to co-schedule a "real" processing task on one pipeline
> with a "cache warming" task on the other pipeline -- but only for
> tasks whose memory access patterns have been sufficiently analyzed to
> write the "cache warming" task code.  Some other tasks may want to
> idle the second pipeline so they can use the full cache-to-RAM
> bandwidth.  Yet other tasks may be genuinely CPU-intensive (or I/O
> bound but so context-heavy that it's not worth yielding the CPU during
> quick I/Os), and hence perfectly happy to run concurrently with an
> unrelated task on the other pipeline.

We can do all that now with load balancing, affinities or by shutting
down threads dynamically.


> There are other architectures where several "hardware threads" fight
> over parts of a cache hierarchy (sometimes bizarrely described as
> "sharing" the cache, kind of the way most two-year-olds "share" toys).
> On these machines, one instruction pipeline can't help the other
> along cache-wise, but it sure can hurt.  A scheduler designed, tested,
> and tuned principally on one of these architectures (hint:
> "hyperthreading") will probably leave a lot of performance on the
> floor on processors in the former category.
> 
> In the not-so-distant future, we're likely to see architectures with
> dynamically reconfigurable interconnect between instruction issue
> units and execution resources.  (This is already quite feasible on,
> say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
> as many Nios II cores as fit on the chip.)  Restoring task context may
> involve not just MMU swaps and FPU instructions (with state-dependent
> hidden costs) but processsor reconfiguration.  Achieving "fairness"
> according to any standard that a platform integrator cares about (let
> alone an end user) will require a fairly detailed model of the hidden
> costs associated with different sorts of task switch.
> 
> So if you are interested in schedulers for some reason other than a
> paycheck, let the distros worry about 5% improvements on x86[_64].
> Get hold of some different "hardware" -- say:
>  - a Xilinx ML410 if you've got $3K to blow and want to explore
> reconfigurable processors;
>  - a SunFire T2000 if you've got $11K and want to mess with a CMT
> system that's actually shipping;
>  - a QEMU-simulated massively SMP x86 if you're poor but clever
> enough to implement funky cross-core cache effects yourself; or
>  - a cycle-accurate simulator from Gaisler or Virtio if you want a
> real research project.
> Then go explore some more interesting regions of parameter space and
> see what the demands on mainline Linux will look like in a few years.

There are no doubt improvements to be made, but they are generally
intended to be able to be done within the sched-domains framework. I
am not aware of a particular need that would be impossible to do using
that topology hierarchy and per-CPU runqueues, and there are added
complications involved with multiple CPUs per runqueue.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:40         ` Nick Piggin
@ 2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  3:43             ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
                               ` (2 more replies)
  2007-04-17  4:17           ` Peter Williams
  1 sibling, 3 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-17  4:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
 
> > Yup, and progress _is_ happening now, quite rapidly.
> 
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
> 
> I don't think we can say Ingo's is better than the alternatives, can we?

No, that would require massive performance testing of all alternatives.

> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...

The trouble with a bakeoff is that it's pretty darn hard to get people
to test in the first place, and then comes weighting the subjective and
hard performance numbers.  If they're close in numbers, do you go with
the one which starts the least flamewars or what?

> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.

Could be.  All was quiet for quite a while, but when RSDL showed up, it
aroused enough interest to show that scheduling woes is on folks radar.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  3:43             ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
@ 2007-04-17  4:14             ` Nick Piggin
  2007-04-17  6:26               ` Peter Williams
  2007-04-17  9:51               ` Ingo Molnar
  2007-04-20 20:36             ` Bill Davidsen
  2 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  4:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 06:01:29AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>  
> > > Yup, and progress _is_ happening now, quite rapidly.
> > 
> > Progress as in progress on Ingo's scheduler. I still don't know how we'd
> > decide when to replace the mainline scheduler or with what.
> > 
> > I don't think we can say Ingo's is better than the alternatives, can we?
> 
> No, that would require massive performance testing of all alternatives.
> 
> > If there is some kind of bakeoff, then I'd like one of Con's designs to
> > be involved, and mine, and Peter's...
> 
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers.  If they're close in numbers, do you go with
> the one which starts the least flamewars or what?

I don't know how you'd do it. I know you wouldn't count people telling you
how good they are (getting people to tell you how bad they are, and whether
others do better in a given situation might be slightly move viable).

But we have to choose somehow. I'd hope that is going to be based solely
on the results and technical properties of the code, so... if we were to
somehow determine that the results are exactly the same, we'd go for the
the simpler one, wouldn't we?


> > Maybe the progress is that more key people are becoming open to the idea
> > of changing the scheduler.
> 
> Could be.  All was quiet for quite a while, but when RSDL showed up, it
> aroused enough interest to show that scheduling woes is on folks radar.

Well I know people have had woes with the scheduler for ever (I guess that
isn't going to change :P). I think people generally lost a bit of interest
in trying to improve the situation because of the upstream problem.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:40         ` Nick Piggin
  2007-04-17  4:01           ` Mike Galbraith
@ 2007-04-17  4:17           ` Peter Williams
  2007-04-17  4:29             ` Nick Piggin
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  4:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>> Mike Galbraith wrote:
>>>> Demystify what?   The casual observer need only read either your attempt
>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>> that it was high time for someone with the necessary skills to step in.
>>> Make that "someone with the necessary clout".
>> No, I was brutally honest to both of us, but quite correct.
>>
>>>> Now progress can happen, which was _not_ happening before.
>>>>
>>> This is true.
>> Yup, and progress _is_ happening now, quite rapidly.
> 
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
> 
> I don't think we can say Ingo's is better than the alternatives, can we?
> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...

I myself was thinking of this as the chance for a much needed 
simplification of the scheduling code and if this can be done with the 
result being "reasonable" it then gives us the basis on which to propose 
improvements based on the ideas of others such as you mention.

As the size of the cpusched indicates, trying to evaluate alternative 
proposals based on the current O(1) scheduler is fraught.  Hopefully, 
this initiative can fix this problem.  Then we just need Ingo to listen 
to suggestions and he's showing signs of being willing to do this :-)

> 
> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.

That too.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:55                       ` Nick Piggin
@ 2007-04-17  4:25                         ` Peter Williams
  2007-04-17  4:34                           ` Nick Piggin
  2007-04-17  8:24                         ` William Lee Irwin III
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  4:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>> Note that I talk of run queues
>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>> idea.
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.
> 
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.
> 
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.

Allowing N (where N can be one or greater) CPUs per run queue actually 
increases flexibility as you can still set N to 1 to get the current 
behaviour.

One advantage of allowing multiple CPUs per run queue would be at the 
smaller end of the system scale i.e. a PC with a single hyper threading 
chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
if both CPUs used the one runqueue and all the nasty side effects that 
come with hyper threading would be minimized at the same time.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:17           ` Peter Williams
@ 2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
                                 ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  4:29 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> >>On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> >>>Mike Galbraith wrote:
> >>>>Demystify what?   The casual observer need only read either your attempt
> >>>>at writing a scheduler, or my attempts at fixing the one we have, to see
> >>>>that it was high time for someone with the necessary skills to step in.
> >>>Make that "someone with the necessary clout".
> >>No, I was brutally honest to both of us, but quite correct.
> >>
> >>>>Now progress can happen, which was _not_ happening before.
> >>>>
> >>>This is true.
> >>Yup, and progress _is_ happening now, quite rapidly.
> >
> >Progress as in progress on Ingo's scheduler. I still don't know how we'd
> >decide when to replace the mainline scheduler or with what.
> >
> >I don't think we can say Ingo's is better than the alternatives, can we?
> >If there is some kind of bakeoff, then I'd like one of Con's designs to
> >be involved, and mine, and Peter's...
> 
> I myself was thinking of this as the chance for a much needed 
> simplification of the scheduling code and if this can be done with the 
> result being "reasonable" it then gives us the basis on which to propose 
> improvements based on the ideas of others such as you mention.
> 
> As the size of the cpusched indicates, trying to evaluate alternative 
> proposals based on the current O(1) scheduler is fraught.  Hopefully, 

I don't know why. The problem is that you can't really evaluate good
proposals by looking at the code (you can say that one is bad, ie. the
current one, which has a huge amount of temporal complexity and is
explicitly unfair), but it is pretty hard to say one behaves well.

And my scheduler for example cuts down the amount of policy code and
code size significantly. I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...

For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:25                         ` Peter Williams
@ 2007-04-17  4:34                           ` Nick Piggin
  2007-04-17  6:03                             ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  4:34 UTC (permalink / raw)
  To: Peter Williams
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> >>On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >>>Note that I talk of run queues
> >>>not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >>>idea.
> >>This observation of Peter's is the best thing to come out of this
> >>whole foofaraw.  Looking at what's happening in CPU-land, I think it's
> >>going to be necessary, within a couple of years, to replace the whole
> >>idea of "CPU scheduling" with "run queue scheduling" across a complex,
> >>possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
> >>point in churning the mainline scheduler through a design that isn't
> >>significantly more flexible than any of those now under discussion.
> >
> >Why? If you do that, then your load balancer just becomes less flexible
> >because it is harder to have tasks run on one or the other.
> >
> >You can have single-runqueue-per-domain behaviour (or close to) just by
> >relaxing all restrictions on idle load balancing within that domain. It
> >is harder to go the other way and place any per-cpu affinity or
> >restirctions with multiple cpus on a single runqueue.
> 
> Allowing N (where N can be one or greater) CPUs per run queue actually 
> increases flexibility as you can still set N to 1 to get the current 
> behaviour.

But you add extra code for that on top of what we have, and are also
prevented from making per-cpu assumptions.

And you can get N CPUs per runqueue behaviour by having them in a domain
with no restrictions on idle balancing. So where does your increased
flexibilty come from?

> One advantage of allowing multiple CPUs per run queue would be at the 
> smaller end of the system scale i.e. a PC with a single hyper threading 
> chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
> if both CPUs used the one runqueue and all the nasty side effects that 
> come with hyper threading would be minimized at the same time.

I don't know about that -- the current load balancer already minimises
the nasty multi threading effects. SMT is very important for IBM's chips
for example, and they've never had any problem with that side of it
since it was introduced and bugs ironed out (at least, none that I've
heard).



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
@ 2007-04-17  5:53               ` Willy Tarreau
  2007-04-17  6:10                 ` Nick Piggin
  2007-04-17  6:09               ` William Lee Irwin III
  2007-04-17  6:23               ` Peter Williams
  2 siblings, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-17  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Hi Nick,

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
(...)
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> 
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

Of course, if this is the case, the question will be raised. But as a
general rule, I don't see much potential in O(1) to finely tune scheduling
according to several criteria. In O(logN), you can adjust scheduling in
realtime at a very low cost. Better processing of varying priorities or
fork() comes to mind.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:34                           ` Nick Piggin
@ 2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
                                                 ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-17  6:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>>>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>>>> Note that I talk of run queues
>>>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>>>> idea.
>>>> This observation of Peter's is the best thing to come out of this
>>>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>>>> going to be necessary, within a couple of years, to replace the whole
>>>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>>>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>>>> point in churning the mainline scheduler through a design that isn't
>>>> significantly more flexible than any of those now under discussion.
>>> Why? If you do that, then your load balancer just becomes less flexible
>>> because it is harder to have tasks run on one or the other.
>>>
>>> You can have single-runqueue-per-domain behaviour (or close to) just by
>>> relaxing all restrictions on idle load balancing within that domain. It
>>> is harder to go the other way and place any per-cpu affinity or
>>> restirctions with multiple cpus on a single runqueue.
>> Allowing N (where N can be one or greater) CPUs per run queue actually 
>> increases flexibility as you can still set N to 1 to get the current 
>> behaviour.
> 
> But you add extra code for that on top of what we have, and are also
> prevented from making per-cpu assumptions.
> 
> And you can get N CPUs per runqueue behaviour by having them in a domain
> with no restrictions on idle balancing. So where does your increased
> flexibilty come from?
> 
>> One advantage of allowing multiple CPUs per run queue would be at the 
>> smaller end of the system scale i.e. a PC with a single hyper threading 
>> chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
>> if both CPUs used the one runqueue and all the nasty side effects that 
>> come with hyper threading would be minimized at the same time.
> 
> I don't know about that -- the current load balancer already minimises
> the nasty multi threading effects. SMT is very important for IBM's chips
> for example, and they've never had any problem with that side of it
> since it was introduced and bugs ironed out (at least, none that I've
> heard).
> 

There's a lot of ugly code in the load balancer that is only there to 
overcome the side effects of SMT and dual core.  A lot of it was put 
there by Intel employees trying to make load balancing more friendly to 
their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
better way of achieving that end.  I may (of course) be wrong but I 
think that the idea deserves more consideration than you're willing to 
give it.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
@ 2007-04-17  6:09               ` William Lee Irwin III
  2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:23               ` Peter Williams
  2 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> I myself was thinking of this as the chance for a much needed 
>> simplification of the scheduling code and if this can be done with the 
>> result being "reasonable" it then gives us the basis on which to propose 
>> improvements based on the ideas of others such as you mention.
>> As the size of the cpusched indicates, trying to evaluate alternative 
>> proposals based on the current O(1) scheduler is fraught.  Hopefully, 

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

All things are not equal; they all have different properties. I like
getting rid of the queue-swapping artifacts as ebs and cfs have done,
as the artifacts introduced there are nasty IMNSHO. I'm queueing up
a demonstration of epoch expiry scheduling artifacts as a testcase,
albeit one with no pass/fail semantics for its results, just detecting
scheduler properties.

That said, inequality/inequivalence is not a superiority/inferiority
ranking per se. What needs to come out of these discussions is a set
of standards which a candidate for mainline must pass to be considered
correct and a set of performance metrics by which to rank them. Video
game framerates and some sort of way to automate window wiggle tests
sound like good ideas, but automating such is beyond my userspace
programming abilities. An organization able to devote manpower to
devising such testcases will likely have to get involved for them to
happen, I suspect.

On a random note, limitations on kernel address space make O(lg(n))
effectively O(1), albeit with large upper bounds on the worst case
and an expected case much faster than the worst case.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  5:53               ` Willy Tarreau
@ 2007-04-17  6:10                 ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  6:10 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote:
> Hi Nick,
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> (...)
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > 
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> Of course, if this is the case, the question will be raised. But as a
> general rule, I don't see much potential in O(1) to finely tune scheduling
> according to several criteria.

What do you mean? By what criteria?

> In O(logN), you can adjust scheduling in
> realtime at a very low cost. Better processing of varying priorities or
> fork() comes to mind.

The main problem as I see it is choosing which task to run next and
how much time to run it for. And given that there are typically far less
than 58 (the number of priorities in nicksched) runnable tasks for a
desktop system, I don't find it at all constraining to quantize my "next
runnable" criteria onto that size of key.

Even if you do expect a huge number of runnable tasks, you would hope
for fewer interactive ones toward the higher end of the priority scale.

Handwaving or even detailed design descriptions is simply not the best
way to decide on a new scheduler, is all I'm saying.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
@ 2007-04-17  6:14                               ` William Lee Irwin III
  2007-04-17  6:23                               ` Nick Piggin
  2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:14 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Michael K. Edwards, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 
> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

This may be a good one to ask Ingo about, as he did significant
performance work on per-core runqueues for SMT. While I did write
per-node runqueue code for NUMA at some point in the past, I did no
tuning or other performance work on it, only functionality.

I've actually dealt with kernels using elder versions of Ingo's code
for per-core runqueues on SMT, but was never called upon to examine
that particular code for either performance or stability, so I'm
largely ignorant of what the perceived outcome of it was.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:09               ` William Lee Irwin III
@ 2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:26                   ` William Lee Irwin III
  2007-04-17  6:50                   ` Davide Libenzi
  0 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  6:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> >> I myself was thinking of this as the chance for a much needed 
> >> simplification of the scheduling code and if this can be done with the 
> >> result being "reasonable" it then gives us the basis on which to propose 
> >> improvements based on the ideas of others such as you mention.
> >> As the size of the cpusched indicates, trying to evaluate alternative 
> >> proposals based on the current O(1) scheduler is fraught.  Hopefully, 
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> > I don't know why. The problem is that you can't really evaluate good
> > proposals by looking at the code (you can say that one is bad, ie. the
> > current one, which has a huge amount of temporal complexity and is
> > explicitly unfair), but it is pretty hard to say one behaves well.
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> All things are not equal; they all have different properties. I like

Exactly. So we have to explore those properties and evaluate performance
(in all meanings of the word). That's only logical.


> On a random note, limitations on kernel address space make O(lg(n))
> effectively O(1), albeit with large upper bounds on the worst case
> and an expected case much faster than the worst case.

Yeah. O(n!) is also O(1) if you can put an upper bound on n ;)


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
@ 2007-04-17  6:23                               ` Nick Piggin
  2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  6:23 UTC (permalink / raw)
  To: Peter Williams
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >
> >But you add extra code for that on top of what we have, and are also
> >prevented from making per-cpu assumptions.
> >
> >And you can get N CPUs per runqueue behaviour by having them in a domain
> >with no restrictions on idle balancing. So where does your increased
> >flexibilty come from?
> >
> >>One advantage of allowing multiple CPUs per run queue would be at the 
> >>smaller end of the system scale i.e. a PC with a single hyper threading 
> >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
> >>if both CPUs used the one runqueue and all the nasty side effects that 
> >>come with hyper threading would be minimized at the same time.
> >
> >I don't know about that -- the current load balancer already minimises
> >the nasty multi threading effects. SMT is very important for IBM's chips
> >for example, and they've never had any problem with that side of it
> >since it was introduced and bugs ironed out (at least, none that I've
> >heard).
> >
> 
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 

I agree that some of that has exploded complexity. I have some
thoughts about better approaches for some of those things, but
basically been stuck working on VM problems for a while.


> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

Put it this way: it is trivial to group the load balancing stats
of N CPUs with their own runqueues. Just put them under a domain
and take the sum. The domain essentially takes on the same function
as a single queue with N CPUs under it. Anything _further_ you can
do with individual runqueues (like naturally adding an affinity
pressure ranging from nothing to absolute) are things that you
don't trivially get with 1:N approach. AFAIKS.

So I will definitely give any idea consideration, but I just need to
be shown where the benefit comes from.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
  2007-04-17  6:09               ` William Lee Irwin III
@ 2007-04-17  6:23               ` Peter Williams
  2007-04-17  6:44                 ` Nick Piggin
  2007-04-17  8:44                 ` Ingo Molnar
  2 siblings, 2 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-17  6:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>>>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>>>> Mike Galbraith wrote:
>>>>>> Demystify what?   The casual observer need only read either your attempt
>>>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>>>> that it was high time for someone with the necessary skills to step in.
>>>>> Make that "someone with the necessary clout".
>>>> No, I was brutally honest to both of us, but quite correct.
>>>>
>>>>>> Now progress can happen, which was _not_ happening before.
>>>>>>
>>>>> This is true.
>>>> Yup, and progress _is_ happening now, quite rapidly.
>>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>>> decide when to replace the mainline scheduler or with what.
>>>
>>> I don't think we can say Ingo's is better than the alternatives, can we?
>>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>>> be involved, and mine, and Peter's...
>> I myself was thinking of this as the chance for a much needed 
>> simplification of the scheduling code and if this can be done with the 
>> result being "reasonable" it then gives us the basis on which to propose 
>> improvements based on the ideas of others such as you mention.
>>
>> As the size of the cpusched indicates, trying to evaluate alternative 
>> proposals based on the current O(1) scheduler is fraught.  Hopefully, 
> 
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.

I meant that it's indicative of the amount of work that you have to do 
to implement a new scheduling discipline for evaluation.

> 
> And my scheduler for example cuts down the amount of policy code and
> code size significantly.

Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.  That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 
essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.

Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().

This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.

> I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...

I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.

> 
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)

To digress, my main concern is that load balancing is being lumped in 
with this new change.  It's becoming "accept this beg lump of new code 
or nothing".  I'd rather see a good fix to the intra runqueue/CPU 
scheduler problem implemented first and then if there really are any 
outstanding problems with the load balancer attack them later.  Them all 
being mixed up together gives me a nasty deja vu of impending disaster.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:14             ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
@ 2007-04-17  6:26               ` Peter Williams
  2007-04-17  9:51               ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-17  6:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> Well I know people have had woes with the scheduler for ever (I guess that
> isn't going to change :P). I think people generally lost a bit of interest
> in trying to improve the situation because of the upstream problem.

Yes.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:15                 ` Nick Piggin
@ 2007-04-17  6:26                   ` William Lee Irwin III
  2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  6:50                   ` Davide Libenzi
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>> All things are not equal; they all have different properties. I like

On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

Any chance you'd be willing to put down a few thoughts on what sorts
of standards you'd like to set for both correctness (i.e. the bare
minimum a scheduler implementation must do to be considered valid
beyond not oopsing) and performance metrics (i.e. things that produce
numbers for each scheduler you can compare to say "this scheduler is
better than this other scheduler at this.").


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:23               ` Peter Williams
@ 2007-04-17  6:44                 ` Nick Piggin
  2007-04-17  7:48                   ` Peter Williams
  2007-04-17  8:44                 ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  6:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >And my scheduler for example cuts down the amount of policy code and
> >code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.

Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.


>  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  This 

My patch doesn't implement any such excusing.


> essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as I 
> can gather).  Right about then in the development of the O(1) scheduler 
> alternative solutions should have been sought.

Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.


> Other hints that it was a bad idea was the need to transfer time slices 
> between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays. If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.

You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.


> This disregard for the dual array mechanism has prevented me from 
> looking at the rest of your scheduler in any great detail so I can't 
> comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?


> >I haven't looked at Con's ones for a while,
> >but I believe they are also much more straightforward than mainline...
> 
> I like Con's scheduler (partly because it uses a single array) but 
> mainly because it's nice and simple.  However, his earlier schedulers 
> were prone to starvation (admittedly, only if you went out of your way 
> to make it happen) and I tried to convince him to use the anti 
> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> haven't looked at his latest scheduler that sparked all this furore so 
> can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.


> >For example, let's say all else is equal between them, then why would
> >we go with the O(logN) implementation rather than the O(1)?
> 
> In the highly unlikely event that you can't separate them on technical 
> grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.

But yeah, see my earlier comment: simplicity would be a factor too.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:26                   ` William Lee Irwin III
@ 2007-04-17  6:50                   ` Davide Libenzi
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:11                     ` Nick Piggin
  1 sibling, 2 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-17  6:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, Nick Piggin wrote:

> > All things are not equal; they all have different properties. I like
> 
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final "key" calculation code will end up quite 
different from what it looks now.
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:26                   ` William Lee Irwin III
@ 2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  8:23                       ` William Lee Irwin III
  2007-04-17 21:39                       ` Matt Mackall
  0 siblings, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >> All things are not equal; they all have different properties. I like
> 
> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> Any chance you'd be willing to put down a few thoughts on what sorts
> of standards you'd like to set for both correctness (i.e. the bare
> minimum a scheduler implementation must do to be considered valid
> beyond not oopsing) and performance metrics (i.e. things that produce
> numbers for each scheduler you can compare to say "this scheduler is
> better than this other scheduler at this.").

Yeah I guess that's the hard part :)

For correctness, I guess fairness is an easy one. I think that unfairness
is basically a bug and that it would be very unfortunate to merge something
unfair. But this is just within the context of a single runqueue... for
better or worse, we allow some unfairness in multiprocessors for performance
reasons of course.

Latency. Given N tasks in the system, an arbitrary task should get
onto the CPU in a bounded amount of time (excluding events like freak
IRQ holdoffs and such, obviously -- ie. just considering the context
of the scheduler's state machine).

I wouldn't like to see a significant drop in any micro or macro
benchmarks or even worse real workloads, but I could accept some if it
means haaving a fair scheduler by default.

Now it isn't actually too hard to achieve the above, I think. The hard bit
is trying to compare interactivity. Ideally, we'd be able to get scripted
dumps of login sessions, and measure scheduling latencies of key proceses
(sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
problems with any scheduler, and we could compare all of them against it.
Wishful thinking!

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:50                   ` Davide Libenzi
@ 2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
                                         ` (3 more replies)
  2007-04-17  7:11                     ` Nick Piggin
  1 sibling, 4 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  7:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.

The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.


On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:50                   ` Davide Libenzi
  2007-04-17  7:09                     ` William Lee Irwin III
@ 2007-04-17  7:11                     ` Nick Piggin
  2007-04-17  7:21                       ` Davide Libenzi
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Nick Piggin wrote:
> 
> > > All things are not equal; they all have different properties. I like
> > 
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

To be clear, I'm not saying O(logN) itself is a big problem. Type

  plot [10:100] x with lines, log(x) with lines, 1 with lines

into gnuplot. I was just trying to point out that we need to evalute
things. Considering how long we've had this scheduler with its known
deficiencies, let's pick a new one wisely.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:11                     ` Nick Piggin
@ 2007-04-17  7:21                       ` Davide Libenzi
  0 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-17  7:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, Nick Piggin wrote:

> To be clear, I'm not saying O(logN) itself is a big problem. Type
> 
>   plot [10:100] x with lines, log(x) with lines, 1 with lines

Haha, Nick, I know how a log() looks like :)
The Time Ring I posted as example (that nothing is other than a 
ring-based bucket sort), keeps O(1) if you can concede some timer 
clustering.


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
@ 2007-04-17  7:22                       ` Peter Williams
  2007-04-17  7:23                       ` Nick Piggin
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-17  7:22 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
>> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
>> And even this code does that pretty nicely. The deadline designs looks 
>> good, although I think the final "key" calculation code will end up quite 
>> different from what it looks now.
> 
> The additive nice_offset breaks nice levels. A multiplicative priority
> weighting of a different, nonnegative metric of cpu utilization from
> what's now used is required for nice levels to work. I've been trying
> to point this out politely by strongly suggesting testing whether nice
> levels work.
> 
> 
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I would suggest to thoroughly test all your alternatives before deciding. 
>> Some code and design may look very good and small at the beginning, but 
>> when you start patching it to cover all the dark spots, you effectively 
>> end up with another thing (in both design and code footprint).
>> About O(1), I never thought it was a must (besides a good marketing 
>> material), and O(log(N)) *may* be just fine (to be verified, of course).
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

At this point, I'd like direct everyone's attention to the simloads package:

<http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz>

which contains a set of programs designed to be used in the construction 
of CPU scheduler tests.  Of particular use is the aspin program which 
can be used to launch tasks with specified sleep/wake characteristics.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
@ 2007-04-17  7:23                       ` Nick Piggin
  2007-04-17  7:27                       ` Davide Libenzi
  2007-04-17  7:33                       ` Ingo Molnar
  3 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote:
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

Definitely. It would be really good if we could have interactivity
regression tests too (see my earlier wishful email). The problem
with a lot of the scripted interactivity tests I see is that they
don't really capture the complexities of the interactions between,
say, an interactive X session. Others just go straight for trying
to exploit the design by making lots of high priority processes
runnablel at once. This just provides an unrealistic decoy and you
end up trying to tune for the wrong thing.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
  2007-04-17  7:23                       ` Nick Piggin
@ 2007-04-17  7:27                       ` Davide Libenzi
  2007-04-17  7:33                         ` Nick Piggin
  2007-04-17  7:33                       ` Ingo Molnar
  3 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-17  7:27 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I would suggest to thoroughly test all your alternatives before deciding. 
> > Some code and design may look very good and small at the beginning, but 
> > when you start patching it to cover all the dark spots, you effectively 
> > end up with another thing (in both design and code footprint).
> > About O(1), I never thought it was a must (besides a good marketing 
> > material), and O(log(N)) *may* be just fine (to be verified, of course).
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

What I meant was, that the rules (requirements and associated test cases) 
for this new Scheduler Amazing Race should be set forward, and not kept a 
moving target to fit&follow one or the other implementation.


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
                                         ` (2 preceding siblings ...)
  2007-04-17  7:27                       ` Davide Libenzi
@ 2007-04-17  7:33                       ` Ingo Molnar
  2007-04-17  7:40                         ` Nick Piggin
  2007-04-17  9:05                         ` William Lee Irwin III
  3 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > Linus ;) And even this code does that pretty nicely. The deadline 
> > designs looks good, although I think the final "key" calculation 
> > code will end up quite different from what it looks now.
> 
> The additive nice_offset breaks nice levels. A multiplicative priority 
> weighting of a different, nonnegative metric of cpu utilization from 
> what's now used is required for nice levels to work. I've been trying 
> to point this out politely by strongly suggesting testing whether nice 
> levels work.

granted, CFS's nice code is still incomplete, but you err quite 
significantly with this extreme statement that they are "broken".

nice levels certainly work to a fair degree even in the current code and 
much of the focus is elsewhere - just try it. (In fact i claim that 
CFS's nice levels often work _better_ than the mainline scheduler's nice 
level support, for the testcases that matter to users.)

The precise behavior of nice levels, as i pointed it out in previous 
mails, is largely 'uninteresting' and it has changed multiple times in 
the past 10 years.

What matters to users is mainly: whether X reniced to -10 does get 
enough CPU time and whether stuff reniced to +19 doesnt take away too 
much CPU time from the rest of the system. _How_ a Linux scheduler 
achieves this is an internal matter and certainly CFS does it in a hacky 
way at the moment.

All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
could come up with is just a fancy academic technicality that has no 
real significance to any of the testers who are trying CFS right now.

Sure we prefer final solutions that are clean and make sense (because 
such things are the easiest to maintain long-term), and often such final 
solutions are quite close to academic concepts, and i think Davide 
correctly observed this by saying that "the final key calculation code 
will end up quite different from what it looks now", but your 
extreme-end claim of 'breakage' for something that is just plain 
incomplete is not really a fair characterisation at this point.

Anyone who thinks that there exists only two kinds of code: 100% correct 
and 100% incorrect with no shades of grey inbetween is in reality a sort 
of an extremist: whom, depending on mood and affection, we could call 
either a 'coding purist' or a 'coding taliban' ;-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:27                       ` Davide Libenzi
@ 2007-04-17  7:33                         ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:33 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I would suggest to thoroughly test all your alternatives before deciding. 
> > > Some code and design may look very good and small at the beginning, but 
> > > when you start patching it to cover all the dark spots, you effectively 
> > > end up with another thing (in both design and code footprint).
> > > About O(1), I never thought it was a must (besides a good marketing 
> > > material), and O(log(N)) *may* be just fine (to be verified, of course).
> > 
> > The trouble with thorough testing right now is that no one agrees on
> > what the tests should be and a number of the testcases are not in great
> > shape. An agreed-upon set of testcases for basic correctness should be
> > devised and the implementations of those testcases need to be
> > maintainable code and the tests set up for automated testing and
> > changing their parameters without recompiling via command-line options.
> > 
> > Once there's a standard regression test suite for correctness, one
> > needs to be devised for performance, including interactive performance.
> > The primary difficulty I see along these lines is finding a way to
> > automate tests of graphics and input device response performance. Others,
> > like how deterministically priorities are respected over progressively
> > smaller time intervals and noninteractive workload performance are
> > nowhere near as difficult to arrange and in many cases already exist.
> > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
> 
> What I meant was, that the rules (requirements and associated test cases) 
> for this new Scheduler Amazing Race should be set forward, and not kept a 
> moving target to fit&follow one or the other implementation.

Exactly. Well I don't mind if it is a moving target as such, just as
long as the decisions are rational (no "blah is more important
because I say so").

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:33                       ` Ingo Molnar
@ 2007-04-17  7:40                         ` Nick Piggin
  2007-04-17  7:58                           ` Ingo Molnar
  2007-04-17  9:05                         ` William Lee Irwin III
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > > Linus ;) And even this code does that pretty nicely. The deadline 
> > > designs looks good, although I think the final "key" calculation 
> > > code will end up quite different from what it looks now.
> > 
> > The additive nice_offset breaks nice levels. A multiplicative priority 
> > weighting of a different, nonnegative metric of cpu utilization from 
> > what's now used is required for nice levels to work. I've been trying 
> > to point this out politely by strongly suggesting testing whether nice 
> > levels work.
> 
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".
> 
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)
> 
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.
> 
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system.

I agree there.


> _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.
> 
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.
> 
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.
> 
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

Only if you are an extremist-naming extremist with no shades of grey.
Others, like myself, also include 'coding al-qaeda' and 'coding john
howard' in that scale.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:44                 ` Nick Piggin
@ 2007-04-17  7:48                   ` Peter Williams
  2007-04-17  7:56                     ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  7:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> And my scheduler for example cuts down the amount of policy code and
>>> code size significantly.
>> Yours is one of the smaller patches mainly because you perpetuate (or 
>> you did in the last one I looked at) the (horrible to my eyes) dual 
>> array (active/expired) mechanism.
> 
> Actually, I wasn't comparing with other out of tree schedulers (but it
> is good to know mine is among the smaller ones). I was comparing with
> the mainline scheduler, which also has the dual arrays.
> 
> 
>>  That this idea was bad should have 
>> been apparent to all as soon as the decision was made to excuse some 
>> tasks from being moved from the active array to the expired array.  This 
> 
> My patch doesn't implement any such excusing.
> 
> 
>> essentially meant that there would be circumstances where extreme 
>> unfairness (to the extent of starvation in some cases) -- the very 
>> things that the mechanism was originally designed to ensure (as far as I 
>> can gather).  Right about then in the development of the O(1) scheduler 
>> alternative solutions should have been sought.
> 
> Fairness has always been my first priority, and I consider it a bug
> if it is possible for any process to get more CPU time than a CPU hog
> over the long term. Or over another task doing the same thing, for
> that matter.
> 
> 
>> Other hints that it was a bad idea was the need to transfer time slices 
>> between children and parents during fork() and exit().
> 
> I don't see how that has anything to do with dual arrays.

It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.

> If you put
> a new child at the back of the queue, then your various interactive
> shell commands that typically do a lot of dependant forking get slowed
> right down behind your compile job. If you give a new child its own
> timeslice irrespective of the parent, then you have things like 'make'
> (which doesn't use a lot of CPU time) spawning off lots of high
> priority children.

This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.

> 
> You need to do _something_ (Ingo's does). I don't see why this would
> be tied with a dual array. FWIW, mine doesn't do anything on exit()
> like most others, but it may need more tuning in this area.
> 
> 
>> This disregard for the dual array mechanism has prevented me from 
>> looking at the rest of your scheduler in any great detail so I can't 
>> comment on any other ideas that may be in there.
> 
> Well I wasn't really asking you to review it. As I said, everyone
> has their own idea of what a good design does, and review can't really
> distinguish between the better of two reasonable designs.
> 
> A fair evaluation of the alternatives seems like a good idea though.
> Nobody is actually against this, are they?

No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.

> 
> 
>>> I haven't looked at Con's ones for a while,
>>> but I believe they are also much more straightforward than mainline...
>> I like Con's scheduler (partly because it uses a single array) but 
>> mainly because it's nice and simple.  However, his earlier schedulers 
>> were prone to starvation (admittedly, only if you went out of your way 
>> to make it happen) and I tried to convince him to use the anti 
>> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
>> haven't looked at his latest scheduler that sparked all this furore so 
>> can't comment on it.
> 
> I agree starvation or unfairness is unacceptable for a new scheduler.
> 
> 
>>> For example, let's say all else is equal between them, then why would
>>> we go with the O(logN) implementation rather than the O(1)?
>> In the highly unlikely event that you can't separate them on technical 
>> grounds, Occam's razor recommends choosing the simplest solution. :-)
> 
> O(logN) vs O(1) is technical grounds.

In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.

> 
> But yeah, see my earlier comment: simplicity would be a factor too.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:48                   ` Peter Williams
@ 2007-04-17  7:56                     ` Nick Piggin
  2007-04-17 13:16                       ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  7:56 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >>Other hints that it was a bad idea was the need to transfer time slices 
> >>between children and parents during fork() and exit().
> >
> >I don't see how that has anything to do with dual arrays.
> 
> It's totally to do with the dual arrays.  The only real purpose of the 
> time slice in O(1) (regardless of what its perceived purpose was) was to 
> control the switching between the arrays.

The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.


> >If you put
> >a new child at the back of the queue, then your various interactive
> >shell commands that typically do a lot of dependant forking get slowed
> >right down behind your compile job. If you give a new child its own
> >timeslice irrespective of the parent, then you have things like 'make'
> >(which doesn't use a lot of CPU time) spawning off lots of high
> >priority children.
> 
> This is an artefact of trying to control nice using time slices while 
> using them for controlling array switching and whatever else they were 
> being used for.  Priority (static and dynamic) is the the best way to 
> implement nice.

I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


> >You need to do _something_ (Ingo's does). I don't see why this would
> >be tied with a dual array. FWIW, mine doesn't do anything on exit()
> >like most others, but it may need more tuning in this area.
> >
> >
> >>This disregard for the dual array mechanism has prevented me from 
> >>looking at the rest of your scheduler in any great detail so I can't 
> >>comment on any other ideas that may be in there.
> >
> >Well I wasn't really asking you to review it. As I said, everyone
> >has their own idea of what a good design does, and review can't really
> >distinguish between the better of two reasonable designs.
> >
> >A fair evaluation of the alternatives seems like a good idea though.
> >Nobody is actually against this, are they?
> 
> No.  It would be nice if the basic ideas that each scheduler tries to 
> implement could be extracted and explained though.  This could lead to a 
> melding of ideas that leads to something quite good.
> 
> >
> >
> >>>I haven't looked at Con's ones for a while,
> >>>but I believe they are also much more straightforward than mainline...
> >>I like Con's scheduler (partly because it uses a single array) but 
> >>mainly because it's nice and simple.  However, his earlier schedulers 
> >>were prone to starvation (admittedly, only if you went out of your way 
> >>to make it happen) and I tried to convince him to use the anti 
> >>starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> >>haven't looked at his latest scheduler that sparked all this furore so 
> >>can't comment on it.
> >
> >I agree starvation or unfairness is unacceptable for a new scheduler.
> >
> >
> >>>For example, let's say all else is equal between them, then why would
> >>>we go with the O(logN) implementation rather than the O(1)?
> >>In the highly unlikely event that you can't separate them on technical 
> >>grounds, Occam's razor recommends choosing the simplest solution. :-)
> >
> >O(logN) vs O(1) is technical grounds.
> 
> In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
> greater than O(logN)'s k factor multiplied by logMaxN.

Yes, or even significantly greater around typical large sizes of N.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (12 preceding siblings ...)
  2007-04-16 22:00 ` Andi Kleen
@ 2007-04-17  7:56 ` Andy Whitcroft
  2007-04-17  9:32   ` Nick Piggin
  2007-04-18 10:22   ` Ingo Molnar
  2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
  14 siblings, 2 replies; 712+ messages in thread
From: Andy Whitcroft @ 2007-04-17  7:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> 
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]
> 
> Highlights are:
> 
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.
> 
>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
> 
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!
> 
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]
> 
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).
> 
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns
> 
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.
> 
>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.
> 
>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.
> 
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )
> 
>  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
>    way than the vanilla scheduler does. It uses 100 runqueues (for all
>    100 RT priority levels, instead of 140 in the vanilla scheduler)
>    and it needs no expired array.
> 
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.
> 
> the core scheduler got smaller by more than 700 lines:
> 
>  kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
>  1 file changed, 372 insertions(+), 1082 deletions(-)
> 
> and even adding all the scheduling modules, the total size impact is
> relatively small:
> 
>  18 files changed, 1454 insertions(+), 1133 deletions(-)
> 
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
> 
>    text    data     bss     dec     hex filename
>   23366    4001      24   27391    6aff kernel/sched.o.vanilla
>   24159    2705      56   26920    6928 kernel/sched.o.CFS
> 
> (this is mainly due to the benefit of getting rid of the expired array
> and its data structure overhead.)
> 
> thanks go to Thomas Gleixner and Arjan van de Ven for review of this
> patchset.
> 
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

Pushed this through the test.kernel.org and nothing new blew up.
Notably the kernbench figures are within expectations even on the bigger
numa systems, commonly badly affected by balancing problems in the
schedular.

I see there is a second one out, I'll push that one through too.

-apw

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:40                         ` Nick Piggin
@ 2007-04-17  7:58                           ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner


* Nick Piggin <npiggin@suse.de> wrote:

> > Anyone who thinks that there exists only two kinds of code: 100% 
> > correct and 100% incorrect with no shades of grey inbetween is in 
> > reality a sort of an extremist: whom, depending on mood and 
> > affection, we could call either a 'coding purist' or a 'coding 
> > taliban' ;-)
> 
> Only if you are an extremist-naming extremist with no shades of grey. 
> Others, like myself, also include 'coding al-qaeda' and 'coding john 
> howard' in that scale.

heh ;) You, you ... nitpicking extremist! ;)

And beware that you just commited another act of extremism too:

> I agree there.

because you just went to the extreme position of saying that "i agree 
with this portion 100%", instead of saying "this seems to be 91.5% 
correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200".

and the nasty thing is, that in reality even shades of grey, if you 
print them out, are just a set of extreme black dots on an extreme white 
sheet of paper! ;)

[ so i guess we've got to consider the scope of extremism too: the 
  larger the scope, the more limiting and hence the more dangerous it
  is. ]

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
       [not found]                                 ` <20070417064109.GP8915@holomorphy.com>
@ 2007-04-17  8:00                                   ` Peter Williams
  2007-04-17 10:41                                     ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17  8:00 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote:
>> This doesn't make any sense to me.
>> For a start, exact simultaneous operation would be impossible to achieve 
>> except with highly specialized architecture such as the long departed 
>> transputer.  And secondly, I can't see why it's necessary.
> 
> We're not going to make any headway here, so we might as well drop the
> thread.

Yes, we were starting to go around in circles weren't we?

> 
> There are other things to talk about anyway, for instance I'm seeing
> interest in plugsched come about from elsewhere and am taking an
> interest in getting it into shape wrt. various design goals therefore.
> 
> Probably the largest issue of note is getting scheduler drivers
> loadable as kernel modules. Addressing the points Ingo made that can
> be addressed are also lined up for this effort.
> 
> Comments on which directions you'd like this to go in these respects
> would be appreciated, as I regard you as the current "project owner."

I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.

I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.

Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).

Maybe if one of the students still reads LKML he will provide a pointer.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:01                     ` Nick Piggin
@ 2007-04-17  8:23                       ` William Lee Irwin III
  2007-04-17 22:23                         ` Davide Libenzi
  2007-04-17 21:39                       ` Matt Mackall
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  8:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>> Any chance you'd be willing to put down a few thoughts on what sorts
>> of standards you'd like to set for both correctness (i.e. the bare
>> minimum a scheduler implementation must do to be considered valid
>> beyond not oopsing) and performance metrics (i.e. things that produce
>> numbers for each scheduler you can compare to say "this scheduler is
>> better than this other scheduler at this.").

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Yeah I guess that's the hard part :)
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

Requiring that identical tasks be allocated equal shares of CPU
bandwidth is the easy part here. ringtest.c exercises another aspect
of fairness that is extremely important. Generalizing ringtest.c is
a good idea for fairness testing.

But another aspect of fairness is that "controlled unfairness" is also
intended to exist, in no small part by virtue of nice levels, but also
in the form of favoring tasks that are considered interactive somehow.
Testing various forms of controlled unfairness to ensure that they are
indeed controlled and otherwise have the semantics intended is IMHO the
more difficult aspect of fairness testing.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

ISTR Davide Libenzi having a scheduling latency test a number of years
ago. Resurrecting that and tuning it to the needs of this kind of
testing sounds relevant here. The test suite Peter Willliams mentioned
would also help.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> I wouldn't like to see a significant drop in any micro or macro
> benchmarks or even worse real workloads, but I could accept some if it
> means haaving a fair scheduler by default.

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Now it isn't actually too hard to achieve the above, I think. The hard bit
> is trying to compare interactivity. Ideally, we'd be able to get scripted
> dumps of login sessions, and measure scheduling latencies of key proceses
> (sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
> problems with any scheduler, and we could compare all of them against it.
> Wishful thinking!

That's a pretty good idea. I'll queue up writing something of that form
as well.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:55                       ` Nick Piggin
  2007-04-17  4:25                         ` Peter Williams
@ 2007-04-17  8:24                         ` William Lee Irwin III
  1 sibling, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  8:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, Peter Williams, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.

The big sticking point here is order-sensitivity. One can point to
stringent sched_yield() ordering but that's not so important in and of
itself. The more significant case is RT applications which are order-
sensitive. Per-cpu runqueues rather significantly disturb the ordering
requirements of applications that care about it.

In terms of a plugging framework, the per-cpu arrangement precludes or
makes extremely awkward scheduling policies that don't have per-cpu
runqueues, for instance, the 2.4.x policy. There is also the alternate
SMP scalability strategy of a lockless scheduler with a single global
queue, which is more performance-oriented.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:23               ` Peter Williams
  2007-04-17  6:44                 ` Nick Piggin
@ 2007-04-17  8:44                 ` Ingo Molnar
  2007-04-19  2:20                   ` Peter Williams
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> > And my scheduler for example cuts down the amount of policy code and 
> > code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  
> This essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as 
> I can gather).  Right about then in the development of the O(1) 
> scheduler alternative solutions should have been sought.

in hindsight i'd agree. But back then we were clearly not ready for 
fine-grained accurate statistics + trees (cpus are alot faster at more 
complex arithmetics today, plus people still believed that low-res can 
be done well enough), and taking out any of these two concepts from CFS 
would result in a similarly complex runqueue implementation. Also, the 
array switch was just thought to be of another piece of 'if the 
heuristics go wrong, we fall back to an array switch' logic, right in 
line with the other heuristics. And you have to accept it, mainline's 
ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
plus for developers, so it had (and probably still has) quite some 
inertia.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:33                       ` Ingo Molnar
  2007-04-17  7:40                         ` Nick Piggin
@ 2007-04-17  9:05                         ` William Lee Irwin III
  2007-04-17  9:24                           ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  9:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> The additive nice_offset breaks nice levels. A multiplicative priority 
>> weighting of a different, nonnegative metric of cpu utilization from 
>> what's now used is required for nice levels to work. I've been trying 
>> to point this out politely by strongly suggesting testing whether nice 
>> levels work.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".

I used the word relatively loosely. Nothing extreme is going on.
Maybe the phrasing exaggerated the force of the opinion. I'm sorry
about having misspoke so.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)

Al Boldi's testcase appears to reveal some issues. I'm plotting a
testcase of my own if I can ever get past responding to email.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.

I expect that whether a scheduler can handle such prioritization has a
rather strong predictive quality regarding whether it can handle, say,
CKRM controls. I remain convinced that there should be some target
behavior and that some attempt should be made to achieve it. I don't
think any particular behavior is best, just that the behavior should
be well-defined.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system. _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.

It's not so far out. Basically just changing the key calculation in a
relatively simple manner should get things into relatively good shape.
It can, of course, be done other ways (I did it a rather different way
in vdls, though that method is not likely to be considered desirable).

I can't really write a testcase for such loose semantics, so the above
description is useless to me. These squishy sorts of definitions of
semantics are also uninformative to users, who, I would argue, do have
some interest in what nice levels mean. There have been at least a small
number of concerns about the strength of nice levels, and it would
reveal issues surrounding that area earlier if there were an objective
one could test to see if it were achieved.

It's furthermore a user-visible change in system call semantics we
should be more careful about changing out from beneath users.

So I see a lot of good reasons to pin down nice numbers. Incompleteness
is not a particularly mortal sin, but the proliferation of competing
schedulers is creating a need for standards, and that's what I'm really
on about.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.

I could say "percent cpu" if it sounds less like formal jargon, which
"CPU bandwidth utilization" isn't really.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.

It wasn't meant to be quite as strong a statement as it came out.
Sorry about that.


On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

I've made no such claims. Also rest assured that the tone of the
critique is not hostile, and wasn't meant to sound that way.

Also, given the general comments it appears clear that some statistical
metric of deviation from the intended behavior furthermore qualified by
timescale is necessary, so this appears to be headed toward a sort of
performance metric as opposed to a pass/fail test anyway. However, to
even measure this at all, some statement of intention is required. I'd
prefer that there be a Linux-standard semantics for nice so results are
more directly comparable and so that users also get similar nice
behavior from the scheduler as it varies over time and possibly
implementations if users should care to switch them out with some
scheduler patch or other.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:05                         ` William Lee Irwin III
@ 2007-04-17  9:24                           ` Ingo Molnar
  2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 22:08                             ` Matt Mackall
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:24 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> [...] Also rest assured that the tone of the critique is not hostile, 
> and wasn't meant to sound that way.

ok :) (And i guess i was too touchy - sorry about coming out swinging.)

> Also, given the general comments it appears clear that some 
> statistical metric of deviation from the intended behavior furthermore 
> qualified by timescale is necessary, so this appears to be headed 
> toward a sort of performance metric as opposed to a pass/fail test 
> anyway. However, to even measure this at all, some statement of 
> intention is required. I'd prefer that there be a Linux-standard 
> semantics for nice so results are more directly comparable and so that 
> users also get similar nice behavior from the scheduler as it varies 
> over time and possibly implementations if users should care to switch 
> them out with some scheduler patch or other.

yeah. If you could come up with a sane definition that also translates 
into low overhead on the algorithm side that would be great! The only 
good generic definition i could come up with (nice levels are isolated 
buckets with a constant maximum relative percentage of CPU time 
available to every active bucket) resulted in having a per-nice-level 
array of rbtree roots, which did not look worth the hassle at first 
sight :-)

until now the main approach for nice levels in Linux was always: 
"implement your main scheduling logic for nice 0 and then look for some 
low-overhead method that can be glued to it that does something that 
behaves like nice levels". Feel free to turn that around into a more 
natural approach, but the algorithm should remain fairly simple i think.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56 ` Andy Whitcroft
@ 2007-04-17  9:32   ` Nick Piggin
  2007-04-17  9:59     ` Ingo Molnar
  2007-04-18 10:22   ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-17  9:32 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote:
> > 
> > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > more than welcome,
> 
> Pushed this through the test.kernel.org and nothing new blew up.
> Notably the kernbench figures are within expectations even on the bigger
> numa systems, commonly badly affected by balancing problems in the
> schedular.
> 
> I see there is a second one out, I'll push that one through too.

Well I just sent some feedback on cfs-v2, but realised it went off-list,
so I'll resend here because others may find it interesting too. Sorry
about jamming it in here, but it is relevant to performance...

Anyway, roughly in the context of good cfs-v2 interactivity, I wrote:

Well I'm not too surprised. I am disappointed that it uses such small
timeslices (or whatever they are called) as the default.

Using small timeslices is actually a pretty easy way to ensure everything
stays smooth even under load, but is bad for efficiency. Sure you can say
you'll have desktop and server tunings, but... With nicksched I'm testing
a default timeslice of *300ms* even on the desktop, wheras Ingo's seems
to be effectively 3ms :P So if you compare default tunings, it isn't
exactly fair!

Kbuild times on a 2x Xeon:

2.6.21-rc7
508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

2.6.21-rc7-cfs-v2
534.80user 30.92system 2:23.64elapsed 393%CPU
534.75user 31.01system 2:23.70elapsed 393%CPU
534.66user 31.07system 2:23.76elapsed 393%CPU
534.56user 30.91system 2:23.76elapsed 393%CPU
534.66user 31.07system 2:23.67elapsed 393%CPU
535.43user 30.62system 2:23.72elapsed 393%CPU

2.6.21-rc7-nicksched
505.60user 32.31system 2:17.91elapsed 390%CPU
506.55user 32.42system 2:17.66elapsed 391%CPU
506.41user 32.30system 2:17.85elapsed 390%CPU
506.48user 32.36system 2:17.77elapsed 391%CPU
506.10user 32.40system 2:17.81elapsed 390%CPU
506.69user 32.16system 2:17.78elapsed 391%CPU


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
  2007-04-17  6:23                               ` Nick Piggin
@ 2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:36 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Michael K. Edwards, William Lee Irwin III,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly 
> to their systems.  What I'm suggesting is that an N CPUs per runqueue 
> is a better way of achieving that end.  I may (of course) be wrong but 
> I think that the idea deserves more consideration than you're willing 
> to give it.

i actually implemented that some time ago and i'm afraid it was ugly as 
hell and pretty fragile. Load-balancing gets simpler, but task picking 
gets alot uglier.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:14             ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
  2007-04-17  6:26               ` Peter Williams
@ 2007-04-17  9:51               ` Ingo Molnar
  2007-04-17 13:44                 ` Peter Williams
  2007-04-20 20:47                 ` Bill Davidsen
  1 sibling, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Peter Williams, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner


* Nick Piggin <npiggin@suse.de> wrote:

> > > Maybe the progress is that more key people are becoming open to 
> > > the idea of changing the scheduler.
> > 
> > Could be.  All was quiet for quite a while, but when RSDL showed up, 
> > it aroused enough interest to show that scheduling woes is on folks 
> > radar.
> 
> Well I know people have had woes with the scheduler for ever (I guess 
> that isn't going to change :P). [...]

yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.

The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.

( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that < 1% was 
  mostly self-selecting. )

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:24                           ` Ingo Molnar
@ 2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 10:01                               ` Ingo Molnar
  2007-04-17 11:31                               ` William Lee Irwin III
  2007-04-17 22:08                             ` Matt Mackall
  1 sibling, 2 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Also, given the general comments it appears clear that some 
>> statistical metric of deviation from the intended behavior furthermore 
>> qualified by timescale is necessary, so this appears to be headed 
>> toward a sort of performance metric as opposed to a pass/fail test 
[...]

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great! The only 
> good generic definition i could come up with (nice levels are isolated 
> buckets with a constant maximum relative percentage of CPU time 
> available to every active bucket) resulted in having a per-nice-level 
> array of rbtree roots, which did not look worth the hassle at first 
> sight :-)

Interesting! That's what vdls did, except its fundamental data structure
was more like a circular buffer data structure (resembling Davide
Libenzi's timer ring in concept, but with all the details different).
I'm not entirely sure how that would've turned out performancewise if
I'd done any tuning at all. I was mostly interested in doing something
like what I heard Bob Mullens did in 1976 for basic pedagogical value
about schedulers to prepare for writing patches for gang scheduling as
opposed to creating a viable replacement for the mainline scheduler.

I'm relatively certain a different key calculation will suffice, but
it may disturb other desired semantics since they really need to be
nonnegative for multiplying by a scaling factor corresponding to its
nice number to work properly. Well, as the cfs code now stands, it
would correspond to negative keys. Dividing positive keys by the nice
scaling factor is my first thought of how to extend the method to the
current key semantics. Or such are my thoughts on the subject.

I expect that all that's needed is to fiddle with those numbers a bit.
There's quite some capacity for expression there given the precision.


On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> until now the main approach for nice levels in Linux was always: 
> "implement your main scheduling logic for nice 0 and then look for some 
> low-overhead method that can be glued to it that does something that 
> behaves like nice levels". Feel free to turn that around into a more 
> natural approach, but the algorithm should remain fairly simple i think.

Part of my insistence was because it seemed to be relatively close to a
one-liner, though I'm not entirely sure what particular computation to
use to handle the signedness of the keys. I guess I could pick some
particular nice semantics myself and then sweep the extant schedulers
to use them after getting a testcase hammered out.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:32   ` Nick Piggin
@ 2007-04-17  9:59     ` Ingo Molnar
  2007-04-17 11:11       ` Nick Piggin
  2007-04-18  8:55       ` Nick Piggin
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan


* Nick Piggin <npiggin@suse.de> wrote:

> 2.6.21-rc7-cfs-v2
> 534.80user 30.92system 2:23.64elapsed 393%CPU
> 534.75user 31.01system 2:23.70elapsed 393%CPU
> 534.66user 31.07system 2:23.76elapsed 393%CPU
> 534.56user 30.91system 2:23.76elapsed 393%CPU
> 534.66user 31.07system 2:23.67elapsed 393%CPU
> 535.43user 30.62system 2:23.72elapsed 393%CPU

Thanks for testing this! Could you please try this also with:

   echo 100000000 > /proc/sys/kernel/sched_granularity

on the same system, so that we can get a complete set of numbers? Just 
to make sure that lowering the preemption frequency indeed has the 
expected result of moving kernbench numbers back to mainline levels. (if 
not then that would indicate some CFS buglet)

could you maybe even try a more extreme setting of:

   echo 500000000 > /proc/sys/kernel/sched_granularity

for kicks? This would allow us to see how much kernbench we lose due to 
preemption granularity. Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:57                             ` William Lee Irwin III
@ 2007-04-17 10:01                               ` Ingo Molnar
  2007-04-17 11:31                               ` William Lee Irwin III
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-17 10:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>
> > until now the main approach for nice levels in Linux was always: 
> > "implement your main scheduling logic for nice 0 and then look for 
> > some low-overhead method that can be glued to it that does something 
> > that behaves like nice levels". Feel free to turn that around into a 
> > more natural approach, but the algorithm should remain fairly simple 
> > i think.
> 
> Part of my insistence was because it seemed to be relatively close to 
> a one-liner, though I'm not entirely sure what particular computation 
> to use to handle the signedness of the keys. I guess I could pick some 
> particular nice semantics myself and then sweep the extant schedulers 
> to use them after getting a testcase hammered out.

i'd love to have a oneliner solution :-)

wrt. signedness: note that in v2 i have made rq_running signed, and most 
calculations (especially those related to nice) are signed values. (On 
64-bit systems this all isnt a big issue - most of the arithmetics 
gymnastics in CFS are done to keep deltas within 32 bits, so that 
divisions and multiplications are sane.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:00                                   ` Peter Williams
@ 2007-04-17 10:41                                     ` William Lee Irwin III
  2007-04-17 13:48                                       ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 10:41 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
>> Comments on which directions you'd like this to go in these respects
>> would be appreciated, as I regard you as the current "project owner."

On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I'd do scan through LKML from about 18 months ago looking for mention of 
> runtime configurable version of plugsched.  Some students at a 
> university (in Germany, I think) posted some patches adding this feature 
> to plugsched around about then.

Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I never added them to plugsched proper as I knew (from previous 
> experience when the company I worked for posted patches with similar 
> functionality) that Linux would like this idea less than he did the 
> current plugsched mechanism.

Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> Unfortunately, my own cache of the relevant e-mails got overwritten 
> during a Fedora Core upgrade (I've since moved /var onto a separate 
> drive to avoid a repetition) or I would dig them out and send them to 
> you.  I'd provided with copies of the company's patches to use as a 
> guide to how to overcome the problems associated with changing 
> schedulers on a running system (a few non trivial locking issues pop up).
> Maybe if one of the students still reads LKML he will provide a pointer.

I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:59     ` Ingo Molnar
@ 2007-04-17 11:11       ` Nick Piggin
  2007-04-18  8:55       ` Nick Piggin
  1 sibling, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-17 11:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
> 
> Thanks for testing this! Could you please try this also with:
> 
>    echo 100000000 > /proc/sys/kernel/sched_granularity
> 
> on the same system, so that we can get a complete set of numbers? Just 
> to make sure that lowering the preemption frequency indeed has the 
> expected result of moving kernbench numbers back to mainline levels. (if 
> not then that would indicate some CFS buglet)
> 
> could you maybe even try a more extreme setting of:
> 
>    echo 500000000 > /proc/sys/kernel/sched_granularity
> 
> for kicks? This would allow us to see how much kernbench we lose due to 
> preemption granularity. Thanks!

Yeah but I just powered down the test-box, so I'll have to get onto
that tomorrow.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 10:01                               ` Ingo Molnar
@ 2007-04-17 11:31                               ` William Lee Irwin III
  1 sibling, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 11:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:57:49AM -0700, William Lee Irwin III wrote:
> Interesting! That's what vdls did, except its fundamental data structure
> was more like a circular buffer data structure (resembling Davide
> Libenzi's timer ring in concept, but with all the details different).
> I'm not entirely sure how that would've turned out performancewise if
> I'd done any tuning at all. I was mostly interested in doing something
> like what I heard Bob Mullens did in 1976 for basic pedagogical value
> about schedulers to prepare for writing patches for gang scheduling as
> opposed to creating a viable replacement for the mainline scheduler.

Con helped me dredge up the vdls bits, so here is the last version I
before I got tired of toying with the idea. It's not all that clean,
with a fair amount of debug code floating around and a number of
idiocies (it seems there was a plot to use a heap somewhere I forgot
about entirely, never mind other cruft), but I thought I should at least
say something more provable than "there was a patch I never posted."

Enjoy!


-- wli

diff -prauN linux-2.6.0-test11/fs/proc/array.c sched-2.6.0-test11-5/fs/proc/array.c
--- linux-2.6.0-test11/fs/proc/array.c	2003-11-26 12:44:26.000000000 -0800
+++ sched-2.6.0-test11-5/fs/proc/array.c	2003-12-17 07:37:11.000000000 -0800
@@ -162,7 +162,7 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1000000000/1024),
+		0UL, /* was ->sleep_avg */
 	       	p->tgid,
 		p->pid, p->pid ? p->real_parent->pid : 0,
 		p->pid && p->ptrace ? p->parent->pid : 0,
@@ -345,7 +345,7 @@ int proc_pid_stat(struct task_struct *ta
 	read_unlock(&tasklist_lock);
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d %d\n",
 		task->pid,
 		task->comm,
 		state,
@@ -390,8 +390,8 @@ int proc_pid_stat(struct task_struct *ta
 		task->cnswap,
 		task->exit_signal,
 		task_cpu(task),
-		task->rt_priority,
-		task->policy);
+		task_prio(task),
+		task_sched_policy(task));
 	if(mm)
 		mmput(mm);
 	return res;
diff -prauN linux-2.6.0-test11/include/asm-i386/thread_info.h sched-2.6.0-test11-5/include/asm-i386/thread_info.h
--- linux-2.6.0-test11/include/asm-i386/thread_info.h	2003-11-26 12:43:06.000000000 -0800
+++ sched-2.6.0-test11-5/include/asm-i386/thread_info.h	2003-12-17 04:55:22.000000000 -0800
@@ -114,6 +114,8 @@ static inline struct thread_info *curren
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_IRET		5	/* return with iret */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		17
+#define TIF_PREEMPT		18
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff -prauN linux-2.6.0-test11/include/linux/binomial.h sched-2.6.0-test11-5/include/linux/binomial.h
--- linux-2.6.0-test11/include/linux/binomial.h	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/binomial.h	2003-12-20 15:53:33.000000000 -0800
@@ -0,0 +1,16 @@
+/*
+ * Simple binomial heaps.
+ */
+
+struct binomial {
+	unsigned priority, degree;
+	struct binomial *parent, *child, *sibling;
+};
+
+
+struct binomial *binomial_minimum(struct binomial **);
+void binomial_union(struct binomial **, struct binomial **, struct binomial **);
+void binomial_insert(struct binomial **, struct binomial *);
+struct binomial *binomial_extract_min(struct binomial **);
+void binomial_decrease(struct binomial **, struct binomial *, unsigned);
+void binomial_delete(struct binomial **, struct binomial *);
diff -prauN linux-2.6.0-test11/include/linux/init_task.h sched-2.6.0-test11-5/include/linux/init_task.h
--- linux-2.6.0-test11/include/linux/init_task.h	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/init_task.h	2003-12-18 05:51:16.000000000 -0800
@@ -56,6 +56,12 @@
 	.siglock	= SPIN_LOCK_UNLOCKED, 		\
 }
 
+#define INIT_SCHED_INFO(info)					\
+{								\
+	.run_list	= LIST_HEAD_INIT((info).run_list),	\
+	.policy		= 1 /* SCHED_POLICY_TS */,		\
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -67,14 +73,10 @@
 	.usage		= ATOMIC_INIT(2),				\
 	.flags		= 0,						\
 	.lock_depth	= -1,						\
-	.prio		= MAX_PRIO-20,					\
-	.static_prio	= MAX_PRIO-20,					\
-	.policy		= SCHED_NORMAL,					\
+	.sched_info	= INIT_SCHED_INFO(tsk.sched_info),		\
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
-	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
-	.time_slice	= HZ,						\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	.ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children),		\
 	.ptrace_list	= LIST_HEAD_INIT(tsk.ptrace_list),		\
diff -prauN linux-2.6.0-test11/include/linux/sched.h sched-2.6.0-test11-5/include/linux/sched.h
--- linux-2.6.0-test11/include/linux/sched.h	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/sched.h	2003-12-23 03:47:45.000000000 -0800
@@ -126,6 +126,8 @@ extern unsigned long nr_iowait(void);
 #define SCHED_NORMAL		0
 #define SCHED_FIFO		1
 #define SCHED_RR		2
+#define SCHED_BATCH		3
+#define SCHED_IDLE		4
 
 struct sched_param {
 	int sched_priority;
@@ -281,10 +283,14 @@ struct signal_struct {
 
 #define MAX_USER_RT_PRIO	100
 #define MAX_RT_PRIO		MAX_USER_RT_PRIO
-
-#define MAX_PRIO		(MAX_RT_PRIO + 40)
-
-#define rt_task(p)		((p)->prio < MAX_RT_PRIO)
+#define NICE_QLEN		128
+#define MIN_TS_PRIO		MAX_RT_PRIO
+#define MAX_TS_PRIO		(40*NICE_QLEN)
+#define MIN_BATCH_PRIO		(MAX_RT_PRIO + MAX_TS_PRIO)
+#define MAX_BATCH_PRIO		100
+#define MAX_PRIO		(MIN_BATCH_PRIO + MAX_BATCH_PRIO)
+#define USER_PRIO(prio)		((prio) - MAX_RT_PRIO)
+#define MAX_USER_PRIO		USER_PRIO(MAX_PRIO)
 
 /*
  * Some day this will be a full-fledged user tracking system..
@@ -330,6 +336,36 @@ struct k_itimer {
 struct io_context;			/* See blkdev.h */
 void exit_io_context(void);
 
+struct rt_data {
+	int prio, rt_policy;
+	unsigned long quantum, ticks; 
+};
+
+/* XXX: do %cpu estimation for ts wakeup levels */
+struct ts_data {
+	int nice;
+	unsigned long ticks, frac_cpu;
+	unsigned long sample_start, sample_ticks;
+};
+
+struct bt_data {
+	int prio;
+	unsigned long ticks;
+};
+
+union class_data {
+	struct rt_data rt;
+	struct ts_data ts;
+	struct bt_data bt;
+};
+
+struct sched_info {
+	int idx;			/* queue index, used by all classes */
+	unsigned long policy;		/* scheduling policy */
+	struct list_head run_list;	/* list links for priority queues */
+	union class_data cl_data;	/* class-specific data */
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -339,18 +375,9 @@ struct task_struct {
 
 	int lock_depth;		/* Lock depth */
 
-	int prio, static_prio;
-	struct list_head run_list;
-	prio_array_t *array;
-
-	unsigned long sleep_avg;
-	long interactive_credit;
-	unsigned long long timestamp;
-	int activated;
+	struct sched_info sched_info;
 
-	unsigned long policy;
 	cpumask_t cpus_allowed;
-	unsigned int time_slice, first_time_slice;
 
 	struct list_head tasks;
 	struct list_head ptrace_children;
@@ -391,7 +418,6 @@ struct task_struct {
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
 
-	unsigned long rt_priority;
 	unsigned long it_real_value, it_prof_value, it_virt_value;
 	unsigned long it_real_incr, it_prof_incr, it_virt_incr;
 	struct timer_list real_timer;
@@ -520,12 +546,14 @@ extern void node_nr_running_init(void);
 #define node_nr_running_init() {}
 #endif
 
-extern void set_user_nice(task_t *p, long nice);
-extern int task_prio(task_t *p);
-extern int task_nice(task_t *p);
-extern int task_curr(task_t *p);
-extern int idle_cpu(int cpu);
-
+void set_user_nice(task_t *task, long nice);
+int task_prio(task_t *task);
+int task_nice(task_t *task);
+int task_sched_policy(task_t *task);
+void set_task_sched_policy(task_t *task, int policy);
+int rt_task(task_t *task);
+int task_curr(task_t *task);
+int idle_cpu(int cpu);
 void yield(void);
 
 /*
@@ -844,6 +872,21 @@ static inline int need_resched(void)
 	return unlikely(test_thread_flag(TIF_NEED_RESCHED));
 }
 
+static inline void set_task_queued(task_t *task)
+{
+	set_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline void clear_task_queued(task_t *task)
+{
+	clear_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline int task_queued(task_t *task)
+{
+	return test_tsk_thread_flag(task, TIF_QUEUED);
+}
+
 extern void __cond_resched(void);
 static inline void cond_resched(void)
 {
diff -prauN linux-2.6.0-test11/kernel/Makefile sched-2.6.0-test11-5/kernel/Makefile
--- linux-2.6.0-test11/kernel/Makefile	2003-11-26 12:43:24.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/Makefile	2003-12-17 03:30:08.000000000 -0800
@@ -6,7 +6,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o intermodule.o extable.o params.o posix-timers.o
+	    rcupdate.o intermodule.o extable.o params.o posix-timers.o sched/
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
diff -prauN linux-2.6.0-test11/kernel/exit.c sched-2.6.0-test11-5/kernel/exit.c
--- linux-2.6.0-test11/kernel/exit.c	2003-11-26 12:45:29.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/exit.c	2003-12-17 07:04:02.000000000 -0800
@@ -225,7 +225,7 @@ void reparent_to_init(void)
 	/* Set the exit signal to SIGCHLD so we signal init on exit */
 	current->exit_signal = SIGCHLD;
 
-	if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0))
+	if (task_nice(current) < 0)
 		set_user_nice(current, 0);
 	/* cpus_allowed? */
 	/* rt_priority? */
diff -prauN linux-2.6.0-test11/kernel/fork.c sched-2.6.0-test11-5/kernel/fork.c
--- linux-2.6.0-test11/kernel/fork.c	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/fork.c	2003-12-23 06:22:59.000000000 -0800
@@ -836,6 +836,9 @@ struct task_struct *copy_process(unsigne
 	atomic_inc(&p->user->__count);
 	atomic_inc(&p->user->processes);
 
+	clear_tsk_thread_flag(p, TIF_SIGPENDING);
+	clear_tsk_thread_flag(p, TIF_QUEUED);
+
 	/*
 	 * If multiple threads are within copy_process(), then this check
 	 * triggers too late. This doesn't hurt, the check is only there
@@ -861,13 +864,21 @@ struct task_struct *copy_process(unsigne
 	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
-	if (clone_flags & CLONE_IDLETASK)
+	if (clone_flags & CLONE_IDLETASK) {
 		p->pid = 0;
-	else {
+		set_task_sched_policy(p, SCHED_IDLE);
+	} else {
+		if (task_sched_policy(p) == SCHED_IDLE) {
+			memset(&p->sched_info, 0, sizeof(struct sched_info));
+			set_task_sched_policy(p, SCHED_NORMAL);
+			set_user_nice(p, 0);
+		}
 		p->pid = alloc_pidmap();
 		if (p->pid == -1)
 			goto bad_fork_cleanup;
 	}
+	if (p->pid == 1)
+		BUG_ON(task_nice(p));
 	retval = -EFAULT;
 	if (clone_flags & CLONE_PARENT_SETTID)
 		if (put_user(p->pid, parent_tidptr))
@@ -875,8 +886,7 @@ struct task_struct *copy_process(unsigne
 
 	p->proc_dentry = NULL;
 
-	INIT_LIST_HEAD(&p->run_list);
-
+	INIT_LIST_HEAD(&p->sched_info.run_list);
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
 	INIT_LIST_HEAD(&p->posix_timers);
@@ -885,8 +895,6 @@ struct task_struct *copy_process(unsigne
 	spin_lock_init(&p->alloc_lock);
 	spin_lock_init(&p->switch_lock);
 	spin_lock_init(&p->proc_lock);
-
-	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);
 
 	p->it_real_value = p->it_virt_value = p->it_prof_value = 0;
@@ -898,7 +906,6 @@ struct task_struct *copy_process(unsigne
 	p->tty_old_pgrp = 0;
 	p->utime = p->stime = 0;
 	p->cutime = p->cstime = 0;
-	p->array = NULL;
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
@@ -948,33 +955,6 @@ struct task_struct *copy_process(unsigne
 	p->pdeath_signal = 0;
 
 	/*
-	 * Share the timeslice between parent and child, thus the
-	 * total amount of pending timeslices in the system doesn't change,
-	 * resulting in more scheduling fairness.
-	 */
-	local_irq_disable();
-        p->time_slice = (current->time_slice + 1) >> 1;
-	/*
-	 * The remainder of the first timeslice might be recovered by
-	 * the parent if the child exits early enough.
-	 */
-	p->first_time_slice = 1;
-	current->time_slice >>= 1;
-	p->timestamp = sched_clock();
-	if (!current->time_slice) {
-		/*
-	 	 * This case is rare, it happens when the parent has only
-	 	 * a single jiffy left from its timeslice. Taking the
-		 * runqueue lock is not a problem.
-		 */
-		current->time_slice = 1;
-		preempt_disable();
-		scheduler_tick(0, 0);
-		local_irq_enable();
-		preempt_enable();
-	} else
-		local_irq_enable();
-	/*
 	 * Ok, add it to the run-queues and make it
 	 * visible to the rest of the system.
 	 *
diff -prauN linux-2.6.0-test11/kernel/sched/Makefile sched-2.6.0-test11-5/kernel/sched/Makefile
--- linux-2.6.0-test11/kernel/sched/Makefile	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/Makefile	2003-12-17 03:32:21.000000000 -0800
@@ -0,0 +1 @@
+obj-y = util.o ts.o idle.o rt.o batch.o
diff -prauN linux-2.6.0-test11/kernel/sched/batch.c sched-2.6.0-test11-5/kernel/sched/batch.c
--- linux-2.6.0-test11/kernel/sched/batch.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/batch.c	2003-12-19 21:32:49.000000000 -0800
@@ -0,0 +1,190 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+struct batch_queue {
+	int base, tasks;
+	task_t *curr;
+	unsigned long bitmap[BITS_TO_LONGS(MAX_BATCH_PRIO)];
+	struct list_head queue[MAX_BATCH_PRIO];
+};
+
+static int batch_quantum = 1024;
+static DEFINE_PER_CPU(struct batch_queue, batch_queues);
+
+static int batch_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct batch_queue *queue = &per_cpu(batch_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	for (k = 0; k < MAX_BATCH_PRIO; ++k)
+		INIT_LIST_HEAD(&queue->queue[k]);
+	return 0;
+}
+
+static int batch_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+
+	cpustat->nice += user_ticks;
+	cpustat->system += sys_ticks;
+
+	task->sched_info.cl_data.bt.ticks--;
+	if (!task->sched_info.cl_data.bt.ticks) {
+		int new_idx;
+
+		task->sched_info.cl_data.bt.ticks = batch_quantum;
+		new_idx = (task->sched_info.idx + task->sched_info.cl_data.bt.prio)
+				% MAX_BATCH_PRIO;
+		if (!test_bit(new_idx, queue->bitmap))
+			__set_bit(new_idx, queue->bitmap);
+		list_move_tail(&task->sched_info.run_list,
+				&queue->queue[new_idx]);
+		if (list_empty(&queue->queue[task->sched_info.idx]))
+			__clear_bit(task->sched_info.idx, queue->bitmap);
+		task->sched_info.idx = new_idx;
+		queue->base = find_first_circular_bit(queue->bitmap,
+							queue->base,
+							MAX_BATCH_PRIO);
+		set_need_resched();
+	}
+	return 0;
+}
+
+static void batch_yield(struct queue *__queue, task_t *task)
+{
+	int new_idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	new_idx = (queue->base + MAX_BATCH_PRIO - 1) % MAX_BATCH_PRIO;
+	if (!test_bit(new_idx, queue->bitmap))
+		__set_bit(new_idx, queue->bitmap);
+	list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+	if (list_empty(&queue->queue[task->sched_info.idx]))
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	task->sched_info.idx = new_idx;
+	queue->base = find_first_circular_bit(queue->bitmap,
+						queue->base,
+						MAX_BATCH_PRIO);
+	set_need_resched();
+}
+
+static task_t *batch_curr(struct queue *__queue)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	return queue->curr;
+}
+
+static void batch_set_curr(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	queue->curr = task;
+}
+
+static task_t *batch_best(struct queue *__queue)
+{
+	int idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	idx = find_first_circular_bit(queue->bitmap,
+					queue->base,
+					MAX_BATCH_PRIO);
+	BUG_ON(idx >= MAX_BATCH_PRIO);
+	BUG_ON(list_empty(&queue->queue[idx]));
+	return list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+}
+
+static void batch_enqueue(struct queue *__queue, task_t *task)
+{
+	int idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	idx = (queue->base + task->sched_info.cl_data.bt.prio) % MAX_BATCH_PRIO;
+	if (!test_bit(idx, queue->bitmap))
+		__set_bit(idx, queue->bitmap);
+	list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+	task->sched_info.idx = idx;
+	task->sched_info.cl_data.bt.ticks = batch_quantum;
+	queue->tasks++;
+	if (!queue->curr)
+		queue->curr = task;
+}
+
+static void batch_dequeue(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.idx]))
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	queue->tasks--;
+	if (!queue->tasks)
+		queue->curr = NULL;
+	else if (task == queue->curr)
+		queue->curr = batch_best(__queue);
+}
+
+static int batch_preempt(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	if (!queue->curr)
+		return 1;
+	else
+		return task->sched_info.cl_data.bt.prio
+				< queue->curr->sched_info.cl_data.bt.prio;
+}
+
+static int batch_tasks(struct queue *__queue)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	return queue->tasks;
+}
+
+static int batch_nice(struct queue *queue, task_t *task)
+{
+	return 20;
+}
+
+static int batch_prio(task_t *task)
+{
+	return USER_PRIO(task->sched_info.cl_data.bt.prio + MIN_BATCH_PRIO);
+}
+
+static void batch_setprio(task_t *task, int prio)
+{
+	BUG_ON(prio < 0);
+	BUG_ON(prio >= MAX_BATCH_PRIO);
+	task->sched_info.cl_data.bt.prio = prio;
+}
+
+struct queue_ops batch_ops = {
+	.init		= batch_init,
+	.fini		= nop_fini,
+	.tick		= batch_tick,
+	.yield		= batch_yield,
+	.curr		= batch_curr,
+	.set_curr	= batch_set_curr,
+	.tasks		= batch_tasks,
+	.best		= batch_best,
+	.enqueue	= batch_enqueue,
+	.dequeue	= batch_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= batch_preempt,
+	.nice		= batch_nice,
+	.renice		= nop_renice,
+	.prio		= batch_prio,
+	.setprio	= batch_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy batch_policy = {
+	.ops	= &batch_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/idle.c sched-2.6.0-test11-5/kernel/sched/idle.c
--- linux-2.6.0-test11/kernel/sched/idle.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/idle.c	2003-12-19 17:31:39.000000000 -0800
@@ -0,0 +1,99 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+static DEFINE_PER_CPU(task_t *, idle_tasks) = NULL;
+
+static int idle_nice(struct queue *queue, task_t *task)
+{
+	return 20;
+}
+
+static int idle_tasks(struct queue *queue)
+{
+	task_t **idle = (task_t **)queue;
+	return !!(*idle);
+}
+
+static task_t *idle_task(struct queue *queue)
+{
+	return *((task_t **)queue);
+}
+
+static void idle_yield(struct queue *queue, task_t *task)
+{
+	set_need_resched();
+}
+
+static void idle_enqueue(struct queue *queue, task_t *task)
+{
+	task_t **idle = (task_t **)queue;
+	*idle = task;
+}
+
+static void idle_dequeue(struct queue *queue, task_t *task)
+{
+}
+
+static int idle_preempt(struct queue *queue, task_t *task)
+{
+	return 0;
+}
+
+static int idle_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	runqueue_t *rq = &per_cpu(runqueues, smp_processor_id());
+
+	if (atomic_read(&rq->nr_iowait) > 0)
+		cpustat->iowait += sys_ticks;
+	else
+		cpustat->idle += sys_ticks;
+	return 1;
+}
+
+static int idle_init(struct policy *policy, int cpu)
+{
+	policy->queue = (struct queue *)&per_cpu(idle_tasks, cpu);
+	return 0;
+}
+
+static int idle_prio(task_t *task)
+{
+	return MAX_USER_PRIO;
+}
+
+static void idle_setprio(task_t *task, int prio)
+{
+}
+
+static struct queue_ops idle_ops = {
+	.init		= idle_init,
+	.fini		= nop_fini,
+	.tick		= idle_tick,
+	.yield		= idle_yield,
+	.curr		= idle_task,
+	.set_curr	= queue_nop,
+	.tasks		= idle_tasks,
+	.best		= idle_task,
+	.enqueue	= idle_enqueue,
+	.dequeue	= idle_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= idle_preempt,
+	.nice		= idle_nice,
+	.renice		= nop_renice,
+	.prio		= idle_prio,
+	.setprio	= idle_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy idle_policy = {
+	.ops	= &idle_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/queue.h sched-2.6.0-test11-5/kernel/sched/queue.h
--- linux-2.6.0-test11/kernel/sched/queue.h	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/queue.h	2003-12-23 03:58:02.000000000 -0800
@@ -0,0 +1,104 @@
+#define SCHED_POLICY_RT		0
+#define SCHED_POLICY_TS		1
+#define SCHED_POLICY_BATCH	2
+#define SCHED_POLICY_IDLE	3
+
+#define RT_POLICY_FIFO		0
+#define RT_POLICY_RR		1
+
+#define NODE_THRESHOLD		125
+
+struct queue;
+struct queue_ops;
+
+struct policy {
+	struct queue *queue;
+	struct queue_ops *ops;
+};
+
+extern struct policy rt_policy, ts_policy, batch_policy, idle_policy;
+
+struct runqueue {
+        spinlock_t lock;
+	int curr;
+	task_t *__curr;
+	unsigned long policy_bitmap;
+	struct policy *policies[BITS_PER_LONG];
+        unsigned long nr_running, nr_switches, nr_uninterruptible;
+        struct mm_struct *prev_mm;
+        int prev_cpu_load[NR_CPUS];
+#ifdef CONFIG_NUMA
+        atomic_t *node_nr_running;
+        int prev_node_load[MAX_NUMNODES];
+#endif
+        task_t *migration_thread;
+        struct list_head migration_queue;
+
+        atomic_t nr_iowait;
+};
+
+typedef struct runqueue runqueue_t;
+
+struct queue_ops {
+	int (*init)(struct policy *, int);
+	void (*fini)(struct policy *, int);
+	task_t *(*curr)(struct queue *);
+	void (*set_curr)(struct queue *, task_t *);
+	task_t *(*best)(struct queue *);
+	int (*tick)(struct queue *, task_t *, int, int);
+	int (*tasks)(struct queue *);
+	void (*enqueue)(struct queue *, task_t *);
+	void (*dequeue)(struct queue *, task_t *);
+	void (*start_wait)(struct queue *, task_t *);
+	void (*stop_wait)(struct queue *, task_t *);
+	void (*sleep)(struct queue *, task_t *);
+	void (*wake)(struct queue *, task_t *);
+	int (*preempt)(struct queue *, task_t *);
+	void (*yield)(struct queue *, task_t *);
+	int (*prio)(task_t *);
+	void (*setprio)(task_t *, int);
+	int (*nice)(struct queue *, task_t *);
+	void (*renice)(struct queue *, task_t *, int);
+	unsigned long (*timeslice)(struct queue *, task_t *);
+	void (*set_timeslice)(struct queue *, task_t *, unsigned long);
+};
+
+DECLARE_PER_CPU(runqueue_t, runqueues);
+
+int find_first_circular_bit(unsigned long *, int, int);
+void queue_nop(struct queue *, task_t *);
+void nop_renice(struct queue *, task_t *, int);
+void nop_fini(struct policy *, int);
+unsigned long nop_timeslice(struct queue *, task_t *);
+void nop_set_timeslice(struct queue *, task_t *, unsigned long);
+
+/* #define DEBUG_SCHED */
+
+#ifdef DEBUG_SCHED
+#define __check_task_policy(idx)					\
+do {									\
+	unsigned long __idx__ = (idx);					\
+	if (__idx__ > SCHED_POLICY_IDLE) {				\
+		printk("invalid policy 0x%lx\n", __idx__);		\
+		BUG();							\
+	}								\
+} while (0)
+
+#define check_task_policy(task)						\
+do {									\
+	__check_task_policy((task)->sched_info.policy);			\
+} while (0)
+
+#define check_policy(policy)						\
+do {									\
+	BUG_ON((policy) != &rt_policy &&				\
+		(policy) != &ts_policy &&				\
+		(policy) != &batch_policy &&				\
+		(policy) != &idle_policy);				\
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define __check_task_policy(idx)			do { } while (0)
+#define check_task_policy(task)				do { } while (0)
+#define check_policy(policy)				do { } while (0)
+#endif /* !DEBUG_SCHED */
diff -prauN linux-2.6.0-test11/kernel/sched/rt.c sched-2.6.0-test11-5/kernel/sched/rt.c
--- linux-2.6.0-test11/kernel/sched/rt.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/rt.c	2003-12-19 18:16:07.000000000 -0800
@@ -0,0 +1,208 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_rt_policy(task)						\
+do {									\
+	BUG_ON((task)->sched_info.policy != SCHED_POLICY_RT);		\
+	BUG_ON((task)->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR	\
+			&&						\
+	      (task)->sched_info.cl_data.rt.rt_policy!=RT_POLICY_FIFO);	\
+	BUG_ON((task)->sched_info.cl_data.rt.prio < 0);			\
+	BUG_ON((task)->sched_info.cl_data.rt.prio >= MAX_RT_PRIO);	\
+} while (0)
+#else
+#define check_rt_policy(task)				do { } while (0)
+#endif
+
+struct rt_queue {
+	unsigned long bitmap[BITS_TO_LONGS(MAX_RT_PRIO)];
+	struct list_head queue[MAX_RT_PRIO];
+	task_t *curr;
+	int tasks;
+};
+
+static DEFINE_PER_CPU(struct rt_queue, rt_queues);
+
+static int rt_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct rt_queue *queue = &per_cpu(rt_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	for (k = 0; k < MAX_RT_PRIO; ++k)
+		INIT_LIST_HEAD(&queue->queue[k]);
+	return 0;
+}
+
+static void rt_yield(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+		set_need_resched();
+	list_add_tail(&task->sched_info.run_list,
+			&queue->queue[task->sched_info.cl_data.rt.prio]);
+	check_rt_policy(task);
+}
+
+static int rt_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	check_rt_policy(task);
+	cpustat->user += user_ticks;
+	cpustat->system += sys_ticks;
+	if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) {
+		task->sched_info.cl_data.rt.ticks--;
+		if (!task->sched_info.cl_data.rt.ticks) {
+			task->sched_info.cl_data.rt.ticks =
+				task->sched_info.cl_data.rt.quantum;
+			rt_yield(queue, task);
+		}
+	}
+	check_rt_policy(task);
+	return 0;
+}
+
+static task_t *rt_curr(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	task_t *task = queue->curr;
+	check_rt_policy(task);
+	return task;
+}
+
+static void rt_set_curr(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	queue->curr = task;
+	check_rt_policy(task);
+}
+
+static task_t *rt_best(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	task_t *task;
+	int idx;
+	idx = find_first_bit(queue->bitmap, MAX_RT_PRIO);
+	BUG_ON(idx >= MAX_RT_PRIO);
+	task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+	check_rt_policy(task);
+	return task;
+}
+
+static void rt_enqueue(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	if (!test_bit(task->sched_info.cl_data.rt.prio, queue->bitmap))
+		__set_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+	list_add_tail(&task->sched_info.run_list,
+			&queue->queue[task->sched_info.cl_data.rt.prio]);
+	check_rt_policy(task);
+	queue->tasks++;
+	if (!queue->curr)
+		queue->curr = task;
+}
+
+static void rt_dequeue(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+		__clear_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+	queue->tasks--;
+	check_rt_policy(task);
+	if (!queue->tasks)
+		queue->curr = NULL;
+	else if (task == queue->curr)
+		queue->curr = rt_best(__queue);
+}
+
+static int rt_preempt(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	if (!queue->curr)
+		return 1;
+	check_rt_policy(queue->curr);
+	return task->sched_info.cl_data.rt.prio
+			< queue->curr->sched_info.cl_data.rt.prio;
+}
+
+static int rt_tasks(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	return queue->tasks;
+}
+
+static int rt_nice(struct queue *queue, task_t *task)
+{
+	check_rt_policy(task);
+	return -20;
+}
+
+static unsigned long rt_timeslice(struct queue *queue, task_t *task)
+{
+	check_rt_policy(task);
+	if (task->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR)
+		return 0;
+	else
+		return task->sched_info.cl_data.rt.quantum;
+}
+
+static void rt_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+	check_rt_policy(task);
+	if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR)
+		task->sched_info.cl_data.rt.quantum = n;
+	check_rt_policy(task);
+}
+
+static void rt_setprio(task_t *task, int prio)
+{
+	check_rt_policy(task);
+	BUG_ON(prio < 0);
+	BUG_ON(prio >= MAX_RT_PRIO);
+	task->sched_info.cl_data.rt.prio = prio;
+}
+
+static int rt_prio(task_t *task)
+{
+	check_rt_policy(task);
+	return USER_PRIO(task->sched_info.cl_data.rt.prio);
+}
+
+static struct queue_ops rt_ops = {
+	.init		= rt_init,
+	.fini		= nop_fini,
+	.tick		= rt_tick,
+	.yield		= rt_yield,
+	.curr		= rt_curr,
+	.set_curr	= rt_set_curr,
+	.tasks		= rt_tasks,
+	.best		= rt_best,
+	.enqueue	= rt_enqueue,
+	.dequeue	= rt_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= rt_preempt,
+	.nice		= rt_nice,
+	.renice		= nop_renice,
+	.prio		= rt_prio,
+	.setprio	= rt_setprio,
+	.timeslice	= rt_timeslice,
+	.set_timeslice	= rt_set_timeslice,
+};
+
+struct policy rt_policy = {
+	.ops	= &rt_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/ts.c sched-2.6.0-test11-5/kernel/sched/ts.c
--- linux-2.6.0-test11/kernel/sched/ts.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/ts.c	2003-12-23 08:24:55.000000000 -0800
@@ -0,0 +1,841 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_ts_policy(task)						\
+do {									\
+	BUG_ON((task)->sched_info.policy != SCHED_POLICY_TS);		\
+} while (0)
+
+#define check_nice(__queue__)						\
+({									\
+	int __k__, __count__ = 0;					\
+	if ((__queue__)->tasks < 0) {					\
+		printk("negative nice task count %d\n", 		\
+			(__queue__)->tasks);				\
+		BUG();							\
+	}								\
+	for (__k__ = 0; __k__ < NICE_QLEN; ++__k__) {			\
+		task_t *__task__;					\
+		if (list_empty(&(__queue__)->queue[__k__])) {		\
+			if (test_bit(__k__, (__queue__)->bitmap)) {	\
+				printk("wrong nice bit set\n");		\
+				BUG();					\
+			}						\
+		} else {						\
+			if (!test_bit(__k__, (__queue__)->bitmap)) {	\
+				printk("wrong nice bit clear\n");	\
+				BUG();					\
+			}						\
+		}							\
+		list_for_each_entry(__task__,				\
+					&(__queue__)->queue[__k__],	\
+					sched_info.run_list) {		\
+			check_ts_policy(__task__);			\
+			if (__task__->sched_info.idx != __k__) {	\
+				printk("nice index mismatch\n");	\
+				BUG();					\
+			}						\
+			++__count__;					\
+		}							\
+	}								\
+	if ((__queue__)->tasks != __count__) {				\
+		printk("wrong nice task count\n");			\
+		printk("expected %d, got %d\n",				\
+			(__queue__)->tasks,				\
+			__count__);					\
+		BUG();							\
+	}								\
+	__count__;							\
+})
+
+#define check_queue(__queue)						\
+do {									\
+	int __k, __count = 0;						\
+	if ((__queue)->tasks < 0) {					\
+		printk("negative queue task count %d\n", 		\
+			(__queue)->tasks);				\
+		BUG();							\
+	}								\
+	for (__k = 0; __k < 40; ++__k) {				\
+		struct nice_queue *__nice;				\
+		if (list_empty(&(__queue)->nices[__k])) {		\
+			if (test_bit(__k, (__queue)->bitmap)) {		\
+				printk("wrong queue bit set\n");	\
+				BUG();					\
+			}						\
+		} else {						\
+			if (!test_bit(__k, (__queue)->bitmap)) {	\
+				printk("wrong queue bit clear\n");	\
+				BUG();					\
+			}						\
+		}							\
+		list_for_each_entry(__nice,				\
+					&(__queue)->nices[__k],		\
+					list) {				\
+			__count += check_nice(__nice);			\
+			if (__nice->idx != __k) {			\
+				printk("queue index mismatch\n");	\
+				BUG();					\
+			}						\
+		}							\
+	}								\
+	if ((__queue)->tasks != __count) {				\
+		printk("wrong queue task count\n");			\
+		printk("expected %d, got %d\n",				\
+			(__queue)->tasks,				\
+			__count);					\
+		BUG();							\
+	}								\
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define check_ts_policy(task)			do { } while (0)
+#define check_nice(nice)			do { } while (0)
+#define check_queue(queue)			do { } while (0)
+#endif
+
+/*
+ * Hybrid deadline/multilevel scheduling. Cpu utilization
+ * -dependent deadlines at wake. Queue rotation every 50ms or when
+ * demotions empty the highest level, setting demoted deadlines
+ * relative to the new highest level. Intra-level RR quantum at 10ms.
+ */
+struct nice_queue {
+	int idx, nice, base, tasks, level_quantum, expired;
+	unsigned long bitmap[BITS_TO_LONGS(NICE_QLEN)];
+	struct list_head list, queue[NICE_QLEN];
+	task_t *curr;
+};
+
+/*
+ * Deadline schedule nice levels with priority-dependent deadlines,
+ * default quantum of 100ms. Queue rotates at demotions emptying the
+ * highest level, setting the demoted deadline relative to the new
+ * highest level.
+ */
+struct ts_queue {
+	struct nice_queue nice_levels[40];
+	struct list_head nices[40];
+	int base, quantum, tasks;
+	unsigned long bitmap[BITS_TO_LONGS(40)];
+	struct nice_queue *curr;
+};
+
+/*
+ * Make these sysctl-tunable.
+ */
+static int nice_quantum = 100;
+static int rr_quantum = 10;
+static int level_quantum = 50;
+static int sample_interval = HZ;
+
+static DEFINE_PER_CPU(struct ts_queue, ts_queues);
+
+static task_t *nice_best(struct nice_queue *);
+static struct nice_queue *ts_best_nice(struct ts_queue *);
+
+static void nice_init(struct nice_queue *queue)
+{
+	int k;
+
+	INIT_LIST_HEAD(&queue->list);
+	for (k = 0; k < NICE_QLEN; ++k) {
+		INIT_LIST_HEAD(&queue->queue[k]);
+	}
+}
+
+static int ts_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct ts_queue *queue = &per_cpu(ts_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	queue->quantum = nice_quantum;
+
+	for (k = 0; k < 40; ++k) {
+		nice_init(&queue->nice_levels[k]);
+		queue->nice_levels[k].nice = k;
+		INIT_LIST_HEAD(&queue->nices[k]);
+	}
+	return 0;
+}
+
+static int task_deadline(task_t *task)
+{
+	u64 frac_cpu = task->sched_info.cl_data.ts.frac_cpu;
+	frac_cpu *= (u64)NICE_QLEN;
+	frac_cpu >>= 32;
+	return (int)min((u32)(NICE_QLEN - 1), (u32)frac_cpu);
+}
+
+static void nice_rotate_queue(struct nice_queue *queue)
+{
+	int idx, new_idx, deadline, idxdiff;
+	task_t *task = queue->curr;
+
+	check_nice(queue);
+
+	/* shit what if idxdiff == NICE_QLEN - 1?? */
+	idx = queue->curr->sched_info.idx;
+	idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+	deadline = min(1 + task_deadline(task), NICE_QLEN - idxdiff - 1);
+	new_idx = (idx + deadline) % NICE_QLEN;
+#if 0
+	if (idx == new_idx) {
+		/*
+		 * buggy; it sets queue->base = idx because in this case
+		 * we have task_deadline(task) == 0
+		 */
+		new_idx = (idx - task_deadline(task) + NICE_QLEN) % NICE_QLEN;
+		if (queue->base != new_idx)
+			queue->base = new_idx;
+		return;
+	}
+	BUG_ON(!deadline);
+	BUG_ON(queue->base <= new_idx && new_idx <= idx);
+	BUG_ON(idx < queue->base && queue->base <= new_idx);
+	BUG_ON(new_idx <= idx && idx < queue->base);
+	if (0 && idx == new_idx) {
+		printk("FUCKUP: pid = %d, tdl = %d, dl = %d, idx = %d, "
+				"base = %d, diff = %d, fcpu = 0x%lx\n",
+			queue->curr->pid,
+			task_deadline(queue->curr),
+			deadline,
+			idx,
+			queue->base,
+			idxdiff,
+			task->sched_info.cl_data.ts.frac_cpu);
+		BUG();
+	}
+#else
+	/*
+	 * RR in the last deadline
+	 * special-cased so as not to trip BUG_ON()'s below
+	 */
+	if (idx == new_idx) {
+		/* if we got here these two things must hold */
+		BUG_ON(idxdiff != NICE_QLEN - 1);
+		BUG_ON(deadline);
+		list_move_tail(&task->sched_info.run_list, &queue->queue[idx]);
+		if (queue->expired) {
+			queue->level_quantum = level_quantum;
+			queue->expired = 0;
+		}
+		return;
+	}
+#endif
+	task->sched_info.idx = new_idx;
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&task->sched_info.run_list,
+			&queue->queue[new_idx]);
+
+	/* expired until list drains */
+	if (!list_empty(&queue->queue[idx]))
+		queue->expired = 1;
+	else {
+		int k, w, m = NICE_QLEN % BITS_PER_LONG;
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+
+		for (w = 0, k = 0; k < NICE_QLEN/BITS_PER_LONG; ++k)
+			w += hweight_long(queue->bitmap[k]);
+		if (NICE_QLEN % BITS_PER_LONG)
+			w += hweight_long(queue->bitmap[k] & ((1UL << m) - 1));
+		if (w > 1)
+			queue->base = (queue->base + 1) % NICE_QLEN;
+		queue->level_quantum = level_quantum;
+		queue->expired = 0;
+	}
+	check_nice(queue);
+}
+
+static void nice_tick(struct nice_queue *queue, task_t *task)
+{
+	int idx = task->sched_info.idx;
+	BUG_ON(!task_queued(task));
+	BUG_ON(task != queue->curr);
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	BUG_ON(list_empty(&queue->queue[idx]));
+	check_ts_policy(task);
+	check_nice(queue);
+
+	if (task->sched_info.cl_data.ts.ticks)
+		task->sched_info.cl_data.ts.ticks--;
+
+	if (queue->level_quantum > level_quantum) {
+		WARN_ON(1);
+		queue->level_quantum = 1;
+	}
+
+	if (!queue->expired) {
+		if (queue->level_quantum)
+			queue->level_quantum--;
+	} else if (0 && queue->queue[idx].prev != &task->sched_info.run_list) {
+		int queued = 0, new_idx = (queue->base + 1) % NICE_QLEN;
+		task_t *curr, *sav;
+		task_t *victim = list_entry(queue->queue[idx].prev,
+						task_t,
+						sched_info.run_list);
+		victim->sched_info.idx = new_idx;
+		if (!test_bit(new_idx, queue->bitmap))
+			__set_bit(new_idx, queue->bitmap);
+#if 1
+		list_for_each_entry_safe(curr, sav, &queue->queue[new_idx], sched_info.run_list) {
+			if (victim->sched_info.cl_data.ts.frac_cpu
+				< curr->sched_info.cl_data.ts.frac_cpu) {
+				queued = 1;
+				list_move(&victim->sched_info.run_list,
+						curr->sched_info.run_list.prev);
+				break;
+			}
+		}
+		if (!queued)
+			list_move_tail(&victim->sched_info.run_list,
+					&queue->queue[new_idx]);
+#else
+		list_move(&victim->sched_info.run_list, &queue->queue[new_idx]);
+#endif
+		BUG_ON(list_empty(&queue->queue[idx]));
+	}
+
+	if (!queue->level_quantum && !queue->expired) {
+		check_nice(queue);
+		nice_rotate_queue(queue);
+		check_nice(queue);
+		set_need_resched();
+	} else if (!task->sched_info.cl_data.ts.ticks) {
+		int idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+		check_nice(queue);
+		task->sched_info.cl_data.ts.ticks = rr_quantum;
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		BUG_ON(list_empty(&queue->queue[idx]));
+		if (queue->expired)
+			nice_rotate_queue(queue);
+		else if (idxdiff == NICE_QLEN - 1)
+			list_move_tail(&task->sched_info.run_list,
+					&queue->queue[idx]);
+		else {
+			int new_idx = (idx + 1) % NICE_QLEN;
+			list_del(&task->sched_info.run_list);
+			if (list_empty(&queue->queue[idx])) {
+				BUG_ON(!test_bit(idx, queue->bitmap));
+				__clear_bit(idx, queue->bitmap);
+			}
+			if (!test_bit(new_idx, queue->bitmap)) {
+				BUG_ON(!list_empty(&queue->queue[new_idx]));
+				__set_bit(new_idx, queue->bitmap);
+			}
+			task->sched_info.idx = new_idx;
+			list_add(&task->sched_info.run_list,
+					&queue->queue[new_idx]);
+		}
+		check_nice(queue);
+		set_need_resched();
+	}
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_rotate_queue(struct ts_queue *queue)
+{
+	int idx, new_idx, idxdiff, off, deadline;
+
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+
+	/* shit what if idxdiff == 39?? */
+	check_queue(queue);
+	idx = queue->curr->idx;
+	idxdiff = (idx - queue->base + 40) % 40;
+	off = (int)(queue->curr - queue->nice_levels);
+	deadline = min(1 + off, 40 - idxdiff - 1);
+	new_idx = (idx + deadline) % 40;
+	if (idx == new_idx) {
+		new_idx = (idx - off + 40) % 40;
+		if (queue->base != new_idx)
+			queue->base = new_idx;
+		return;
+	}
+	BUG_ON(!deadline);
+	BUG_ON(queue->base <= new_idx && new_idx <= idx);
+	BUG_ON(idx < queue->base && queue->base <= new_idx);
+	BUG_ON(new_idx <= idx && idx < queue->base);
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->nices[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&queue->curr->list, &queue->nices[new_idx]);
+	queue->curr->idx = new_idx;
+
+	if (list_empty(&queue->nices[idx])) {
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+		queue->base = (queue->base + 1) % 40;
+	}
+	check_queue(queue);
+}
+
+static int ts_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	int nice_idx = (int)(queue->curr - queue->nice_levels);
+	unsigned long sample_end, delta;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	BUG_ON(!nice);
+	BUG_ON(nice_idx != task->sched_info.cl_data.ts.nice);
+	BUG_ON(!test_bit(nice->idx, queue->bitmap));
+	BUG_ON(list_empty(&queue->nices[nice->idx]));
+
+	sample_end = jiffies;
+	delta = sample_end - task->sched_info.cl_data.ts.sample_start;
+	if (delta)
+		task->sched_info.cl_data.ts.sample_ticks++;
+	else {
+		task->sched_info.cl_data.ts.sample_start = jiffies;
+		task->sched_info.cl_data.ts.sample_ticks = 1;
+	}
+
+	if (delta >= sample_interval) {
+		u64 frac_cpu;
+		frac_cpu = (u64)task->sched_info.cl_data.ts.sample_ticks << 32;
+		do_div(frac_cpu, delta);
+		frac_cpu = 2*frac_cpu + task->sched_info.cl_data.ts.frac_cpu;
+		do_div(frac_cpu, 3);
+		frac_cpu = min(frac_cpu, (1ULL << 32) - 1);
+		task->sched_info.cl_data.ts.frac_cpu = (unsigned long)frac_cpu;
+		task->sched_info.cl_data.ts.sample_start = sample_end;
+		task->sched_info.cl_data.ts.sample_ticks = 0;
+	}
+
+	cpustat->user += user_ticks;
+	cpustat->system += sys_ticks;
+	nice_tick(nice, task);
+	if (queue->quantum > nice_quantum) {
+		queue->quantum = 0;
+		WARN_ON(1);
+	} else if (queue->quantum)
+		queue->quantum--;
+	if (!queue->quantum) {
+		queue->quantum = nice_quantum;
+		ts_rotate_queue(queue);
+		set_need_resched();
+	}
+	check_queue(queue);
+	check_ts_policy(task);
+	return 0;
+}
+
+static void nice_yield(struct nice_queue *queue, task_t *task)
+{
+	int idx, new_idx = (queue->base + NICE_QLEN - 1) % NICE_QLEN;
+
+	check_nice(queue);
+	check_ts_policy(task);
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+	idx = task->sched_info.idx;
+	task->sched_info.idx = new_idx;
+	set_need_resched();
+
+	if (list_empty(&queue->queue[idx])) {
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+	}
+	queue->curr = nice_best(queue);
+#if 0
+	if (queue->curr->sched_info.idx != queue->base)
+		queue->base = queue->curr->sched_info.idx;
+#endif
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+/*
+ * This is somewhat problematic; nice_yield() only parks tasks on
+ * the end of their current nice levels.
+ */
+static void ts_yield(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	nice_yield(nice, task);
+
+	/*
+	 * If there's no one to yield to, move the whole nice level.
+	 * If this is problematic, setting nice-dependent deadlines
+	 * on a single unified queue may be in order.
+	 */
+	if (nice->tasks == 1) {
+		int idx, new_idx = (queue->base + 40 - 1) % 40;
+		idx = nice->idx;
+		if (!test_bit(new_idx, queue->bitmap)) {
+			BUG_ON(!list_empty(&queue->nices[new_idx]));
+			__set_bit(new_idx, queue->bitmap);
+		}
+		list_move_tail(&nice->list, &queue->nices[new_idx]);
+		if (list_empty(&queue->nices[idx])) {
+			BUG_ON(!test_bit(idx, queue->bitmap));
+			__clear_bit(idx, queue->bitmap);
+		}
+		nice->idx = new_idx;
+		queue->base = find_first_circular_bit(queue->bitmap,
+							queue->base,
+							40);
+		BUG_ON(queue->base >= 40);
+		BUG_ON(!test_bit(queue->base, queue->bitmap));
+		queue->curr = ts_best_nice(queue);
+	}
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static task_t *ts_curr(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	task_t *task = queue->curr->curr;
+	check_queue(queue);
+	if (task)
+		check_ts_policy(task);
+	return task;
+}
+
+static void ts_set_curr(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+	queue->curr = nice;
+	nice->curr = task;
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static task_t *nice_best(struct nice_queue *queue)
+{
+	task_t *task;
+	int idx = find_first_circular_bit(queue->bitmap,
+						queue->base,
+						NICE_QLEN);
+	check_nice(queue);
+	if (idx >= NICE_QLEN)
+		return NULL;
+	BUG_ON(list_empty(&queue->queue[idx]));
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+	check_nice(queue);
+	check_ts_policy(task);
+	return task;
+}
+
+static struct nice_queue *ts_best_nice(struct ts_queue *queue)
+{
+	int idx = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	check_queue(queue);
+	if (idx >= 40)
+		return NULL;
+	BUG_ON(list_empty(&queue->nices[idx]));
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	return list_entry(queue->nices[idx].next, struct nice_queue, list);
+}
+
+static task_t *ts_best(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = ts_best_nice(queue);
+	return nice ? nice_best(nice) : NULL;
+}
+
+static void nice_enqueue(struct nice_queue *queue, task_t *task)
+{
+	task_t *curr, *sav;
+	int queued = 0, idx, deadline, base, idxdiff;
+	check_nice(queue);
+	check_ts_policy(task);
+
+	/* don't livelock when queue->expired */
+	deadline = min(!!queue->expired + task_deadline(task), NICE_QLEN - 1);
+	idx = (queue->base + deadline) % NICE_QLEN;
+
+	if (!test_bit(idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[idx]));
+		__set_bit(idx, queue->bitmap);
+	}
+
+#if 1
+	/* keep nice level's queue sorted -- use binomial heaps here soon */
+	list_for_each_entry_safe(curr, sav, &queue->queue[idx], sched_info.run_list) {
+		if (task->sched_info.cl_data.ts.frac_cpu
+				>= curr->sched_info.cl_data.ts.frac_cpu) {
+			list_add(&task->sched_info.run_list,
+					curr->sched_info.run_list.prev);
+			queued = 1;
+			break;
+		}
+	}
+	if (!queued)
+		list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#else
+	list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#endif
+	task->sched_info.idx = idx;
+	/* if (!task->sched_info.cl_data.ts.ticks) */
+		task->sched_info.cl_data.ts.ticks = rr_quantum;
+
+	if (queue->tasks)
+		BUG_ON(!queue->curr);
+	else {
+		BUG_ON(queue->curr);
+		queue->curr = task;
+	}
+	queue->tasks++;
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_enqueue(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+	if (!nice->tasks) {
+		int idx = (queue->base + task->sched_info.cl_data.ts.nice) % 40;
+		if (!test_bit(idx, queue->bitmap)) {
+			BUG_ON(!list_empty(&queue->nices[idx]));
+			__set_bit(idx, queue->bitmap);
+		}
+		list_add_tail(&nice->list, &queue->nices[idx]);
+		nice->idx = idx;
+		if (!queue->curr)
+			queue->curr = nice;
+	}
+	nice_enqueue(nice, task);
+	queue->tasks++;
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static void nice_dequeue(struct nice_queue *queue, task_t *task)
+{
+	check_nice(queue);
+	check_ts_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.idx])) {
+		BUG_ON(!test_bit(task->sched_info.idx, queue->bitmap));
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	}
+	queue->tasks--;
+	if (task == queue->curr) {
+		queue->curr = nice_best(queue);
+#if 0
+		if (queue->curr)
+			queue->base = queue->curr->sched_info.idx;
+#endif
+	}
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_dequeue(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+
+	BUG_ON(!queue->tasks);
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+
+	nice_dequeue(nice, task);
+	queue->tasks--;
+	if (!nice->tasks) {
+		list_del_init(&nice->list);
+		if (list_empty(&queue->nices[nice->idx])) {
+			BUG_ON(!test_bit(nice->idx, queue->bitmap));
+			__clear_bit(nice->idx, queue->bitmap);
+		}
+		if (nice == queue->curr)
+			queue->curr = ts_best_nice(queue);
+	}
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	if (queue->base >= 40)
+		queue->base = 0;
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static int ts_tasks(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	check_queue(queue);
+	return queue->tasks;
+}
+
+static int ts_nice(struct queue *__queue, task_t *task)
+{
+	int nice = task->sched_info.cl_data.ts.nice - 20;
+	check_ts_policy(task);
+	BUG_ON(nice < -20);
+	BUG_ON(nice >= 20);
+	return nice;
+}
+
+static void ts_renice(struct queue *queue, task_t *task, int nice)
+{
+	check_queue((struct ts_queue *)queue);
+	check_ts_policy(task);
+	BUG_ON(nice < -20);
+	BUG_ON(nice >= 20);
+	task->sched_info.cl_data.ts.nice = nice + 20;
+	check_queue((struct ts_queue *)queue);
+}
+
+static int nice_task_prio(struct nice_queue *nice, task_t *task)
+{
+	if (!task_queued(task))
+		return task_deadline(task);
+	else {
+		int prio = task->sched_info.idx - nice->base;
+		return prio < 0 ? prio + NICE_QLEN : prio;
+	}
+}
+
+static int ts_nice_prio(struct ts_queue *ts, struct nice_queue *nice)
+{
+	if (list_empty(&nice->list))
+		return (int)(nice - ts->nice_levels);
+	else {
+		int prio = nice->idx - ts->base;
+		return prio < 0 ? prio + 40 : prio;
+	}
+}
+
+/* 100% fake priority to report heuristics and the like */
+static int ts_prio(task_t *task)
+{
+	int policy_idx;
+	struct policy *policy;
+	struct ts_queue *ts;
+	struct nice_queue *nice;
+
+	policy_idx = task->sched_info.policy;
+	policy = per_cpu(runqueues, task_cpu(task)).policies[policy_idx];
+	ts = (struct ts_queue *)policy->queue;
+	nice = &ts->nice_levels[task->sched_info.cl_data.ts.nice];
+	return 40*ts_nice_prio(ts, nice) + nice_task_prio(nice, task);
+}
+
+static void ts_setprio(task_t *task, int prio)
+{
+}
+
+static void ts_start_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_stop_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_sleep(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_wake(struct queue *__queue, task_t *task)
+{
+}
+
+static int nice_preempt(struct nice_queue *queue, task_t *task)
+{
+	check_nice(queue);
+	check_ts_policy(task);
+	/* assume FB style preemption at wakeup */
+	if (!task_queued(task) || !queue->curr)
+		return 1;
+	else {
+		int delta_t, delta_q;
+		delta_t = (task->sched_info.idx - queue->base + NICE_QLEN)
+				% NICE_QLEN;
+		delta_q = (queue->curr->sched_info.idx - queue->base
+							+ NICE_QLEN)
+				% NICE_QLEN;
+		if (delta_t < delta_q)
+			return 1;
+		else if (task->sched_info.cl_data.ts.frac_cpu
+				< queue->curr->sched_info.cl_data.ts.frac_cpu)
+			return 1;
+		else
+			return 0;
+	}
+	check_nice(queue);
+}
+
+static int ts_preempt(struct queue *__queue, task_t *task)
+{
+	int curr_nice;
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	if (!queue->curr)
+		return 1;
+
+	curr_nice = (int)(nice - queue->nice_levels);
+
+	/* preempt when nice number is lower, or the above for matches */
+	if (task->sched_info.cl_data.ts.nice != curr_nice)
+		return task->sched_info.cl_data.ts.nice < curr_nice;
+	else
+		return nice_preempt(nice, task);
+}
+
+static struct queue_ops ts_ops = {
+	.init		= ts_init,
+	.fini		= nop_fini,
+	.tick		= ts_tick,
+	.yield		= ts_yield,
+	.curr		= ts_curr,
+	.set_curr	= ts_set_curr,
+	.tasks		= ts_tasks,
+	.best		= ts_best,
+	.enqueue	= ts_enqueue,
+	.dequeue	= ts_dequeue,
+	.start_wait	= ts_start_wait,
+	.stop_wait	= ts_stop_wait,
+	.sleep		= ts_sleep,
+	.wake		= ts_wake,
+	.preempt	= ts_preempt,
+	.nice		= ts_nice,
+	.renice		= ts_renice,
+	.prio		= ts_prio,
+	.setprio	= ts_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy ts_policy = {
+	.ops	= &ts_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/util.c sched-2.6.0-test11-5/kernel/sched/util.c
--- linux-2.6.0-test11/kernel/sched/util.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/util.c	2003-12-19 08:43:20.000000000 -0800
@@ -0,0 +1,37 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <asm/page.h>
+#include "queue.h"
+
+int find_first_circular_bit(unsigned long *addr, int start, int end)
+{
+	int bit = find_next_bit(addr, end, start);
+	if (bit < end)
+		return bit;
+	bit = find_first_bit(addr, start);
+	if (bit < start)
+		return bit;
+	return end;
+}
+
+void queue_nop(struct queue *queue, task_t *task)
+{
+}
+
+void nop_renice(struct queue *queue, task_t *task, int nice)
+{
+}
+
+void nop_fini(struct policy *policy, int cpu)
+{
+}
+
+unsigned long nop_timeslice(struct queue *queue, task_t *task)
+{
+	return 0;
+}
+
+void nop_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+}
diff -prauN linux-2.6.0-test11/kernel/sched.c sched-2.6.0-test11-5/kernel/sched.c
--- linux-2.6.0-test11/kernel/sched.c	2003-11-26 12:45:17.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched.c	2003-12-21 06:06:32.000000000 -0800
@@ -15,6 +15,8 @@
  *		and per-CPU runqueues.  Cleanups and useful suggestions
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
+ *  2003-12-17	Total rewrite and generalized scheduler policies
+ *		by William Irwin.
  */
 
 #include <linux/mm.h>
@@ -38,6 +40,8 @@
 #include <linux/cpu.h>
 #include <linux/percpu.h>
 
+#include "sched/queue.h"
+
 #ifdef CONFIG_NUMA
 #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
 #else
@@ -45,181 +49,79 @@
 #endif
 
 /*
- * Convert user-nice values [ -20 ... 0 ... 19 ]
- * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
- * and back.
- */
-#define NICE_TO_PRIO(nice)	(MAX_RT_PRIO + (nice) + 20)
-#define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
-#define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)
-
-/*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p)		((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE	(MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
-			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
-
-/*
- * Some helpers for converting nanosecond timing to jiffy resolution
- */
-#define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
-
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	30
-#define CHILD_PENALTY		95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		3
-#define PRIO_BONUS_RATIO	25
-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	2
-#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define NODE_THRESHOLD		125
-#define CREDIT_LIMIT		100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- *  priority range a task can explore, a value of '1' means the
- *  task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
-			num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
-	(v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
-	(SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
-		INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
-	((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define JUST_INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
-	((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
-	((p)->interactive_credit < -CREDIT_LIMIT)
-
-#define TASK_PREEMPTS_CURR(p, rq) \
-	((p)->prio < (rq)->curr->prio)
-
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
-	((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1)))
-
-static inline unsigned int task_timeslice(task_t *p)
-{
-	return BASE_TIMESLICE(p);
-}
-
-/*
- * These are the runqueue data structures:
- */
-
-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
-
-typedef struct runqueue runqueue_t;
-
-struct prio_array {
-	int nr_active;
-	unsigned long bitmap[BITMAP_SIZE];
-	struct list_head queue[MAX_PRIO];
-};
-
-/*
  * This is the main, per-CPU runqueue data structure.
  *
  * Locking rule: those places that want to lock multiple runqueues
  * (such as the load balancing or the thread migration code), lock
  * acquire operations must be ordered by ascending &runqueue.
  */
-struct runqueue {
-	spinlock_t lock;
-	unsigned long nr_running, nr_switches, expired_timestamp,
-			nr_uninterruptible;
-	task_t *curr, *idle;
-	struct mm_struct *prev_mm;
-	prio_array_t *active, *expired, arrays[2];
-	int prev_cpu_load[NR_CPUS];
-#ifdef CONFIG_NUMA
-	atomic_t *node_nr_running;
-	int prev_node_load[MAX_NUMNODES];
-#endif
-	task_t *migration_thread;
-	struct list_head migration_queue;
+DEFINE_PER_CPU(struct runqueue, runqueues);
 
-	atomic_t nr_iowait;
+struct policy *policies[] = {
+	&rt_policy,
+	&ts_policy,
+	&batch_policy,
+	&idle_policy,
+	NULL,
 };
 
-static DEFINE_PER_CPU(struct runqueue, runqueues);
-
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define rq_curr(rq)		(rq)->__curr
+#define cpu_curr(cpu)		rq_curr(cpu_rq(cpu))
+
+static inline struct policy *task_policy(task_t *task)
+{
+	unsigned long idx;
+	struct policy *policy;
+	idx = task->sched_info.policy;
+	__check_task_policy(idx);
+	policy = task_rq(task)->policies[idx];
+	check_policy(policy);
+	return policy;
+}
+
+static inline struct policy *rq_policy(runqueue_t *rq)
+{
+	unsigned long idx;
+	task_t *task;
+	struct policy *policy;
+
+	task = rq_curr(rq);
+	BUG_ON(!task);
+	BUG_ON((unsigned long)task < PAGE_OFFSET);
+	idx = task->sched_info.policy;
+	__check_task_policy(idx);
+	policy = rq->policies[idx];
+	check_policy(policy);
+	return policy;
+}
+
+static int __task_nice(task_t *task)
+{
+	struct policy *policy = task_policy(task);
+	return policy->ops->nice(policy->queue, task);
+}
+
+static inline void set_rq_curr(runqueue_t *rq, task_t *task)
+{
+	rq->curr = task->sched_info.policy;
+	__check_task_policy(rq->curr);
+	rq->__curr = task;
+}
+
+static inline int task_preempts_curr(task_t *task, runqueue_t *rq)
+{
+	check_task_policy(rq_curr(rq));
+	check_task_policy(task);
+	if (rq_curr(rq)->sched_info.policy != task->sched_info.policy)
+		return task->sched_info.policy < rq_curr(rq)->sched_info.policy;
+	else {
+		struct policy *policy = rq_policy(rq);
+		return policy->ops->preempt(policy->queue, task);
+	}
+}
 
 /*
  * Default context-switch locking:
@@ -227,7 +129,7 @@ static DEFINE_PER_CPU(struct runqueue, r
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(rq, next)	do { } while(0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p)		((rq)->curr == (p))
+# define task_running(rq, p)		(rq_curr(rq) == (p))
 #endif
 
 #ifdef CONFIG_NUMA
@@ -320,53 +222,32 @@ static inline void rq_unlock(runqueue_t 
 }
 
 /*
- * Adding/removing a task to/from a priority array:
+ * Adding/removing a task to/from a policy's queue.
+ * We dare not BUG_ON() a wrong task_queued() as boot-time
+ * calls may trip it.
  */
-static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
+static inline void dequeue_task(task_t *task, runqueue_t *rq)
 {
-	array->nr_active--;
-	list_del(&p->run_list);
-	if (list_empty(array->queue + p->prio))
-		__clear_bit(p->prio, array->bitmap);
+	struct policy *policy = task_policy(task);
+	BUG_ON(!task_queued(task));
+	policy->ops->dequeue(policy->queue, task);
+	if (!policy->ops->tasks(policy->queue)) {
+		BUG_ON(!test_bit(task->sched_info.policy, &rq->policy_bitmap));
+		__clear_bit(task->sched_info.policy, &rq->policy_bitmap);
+	}
+	clear_task_queued(task);
 }
 
-static inline void enqueue_task(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task(task_t *task, runqueue_t *rq)
 {
-	list_add_tail(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
-	p->array = array;
-}
-
-/*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
-	int bonus, prio;
-
-	if (rt_task(p))
-		return p->prio;
-
-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
-	prio = p->static_prio - bonus;
-	if (prio < MAX_RT_PRIO)
-		prio = MAX_RT_PRIO;
-	if (prio > MAX_PRIO-1)
-		prio = MAX_PRIO-1;
-	return prio;
+	struct policy *policy = task_policy(task);
+	BUG_ON(task_queued(task));
+	if (!policy->ops->tasks(policy->queue)) {
+		BUG_ON(test_bit(task->sched_info.policy, &rq->policy_bitmap));
+		__set_bit(task->sched_info.policy, &rq->policy_bitmap);
+	}
+	policy->ops->enqueue(policy->queue, task);
+	set_task_queued(task);
 }
 
 /*
@@ -374,134 +255,34 @@ static int effective_prio(task_t *p)
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task(p, rq->active);
+	enqueue_task(p, rq);
 	nr_running_inc(rq);
 }
 
-static void recalc_task_prio(task_t *p, unsigned long long now)
-{
-	unsigned long long __sleep_time = now - p->timestamp;
-	unsigned long sleep_time;
-
-	if (__sleep_time > NS_MAX_SLEEP_AVG)
-		sleep_time = NS_MAX_SLEEP_AVG;
-	else
-		sleep_time = (unsigned long)__sleep_time;
-
-	if (likely(sleep_time > 0)) {
-		/*
-		 * User tasks that sleep a long time are categorised as
-		 * idle and will get just interactive status to stay active &
-		 * prevent them suddenly becoming cpu hogs and starving
-		 * other processes.
-		 */
-		if (p->mm && p->activated != -1 &&
-			sleep_time > JUST_INTERACTIVE_SLEEP(p)){
-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-						AVG_TIMESLICE);
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-		} else {
-			/*
-			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
-			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
-			/*
-			 * Tasks with low interactive_credit are limited to
-			 * one timeslice worth of sleep avg bonus.
-			 */
-			if (LOW_CREDIT(p) &&
-				sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
-					sleep_time =
-						JIFFIES_TO_NS(task_timeslice(p));
-
-			/*
-			 * Non high_credit tasks waking from uninterruptible
-			 * sleep are limited in their sleep_avg rise as they
-			 * are likely to be cpu hogs waiting on I/O
-			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
-				if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-					JUST_INTERACTIVE_SLEEP(p)){
-						p->sleep_avg =
-							JUST_INTERACTIVE_SLEEP(p);
-						sleep_time = 0;
-					}
-			}
-
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a task
-			 * spends sleeping, the higher the average gets - and the
-			 * higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
-
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG){
-				p->sleep_avg = NS_MAX_SLEEP_AVG;
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-			}
-		}
-	}
-
-	p->prio = effective_prio(p);
-}
-
 /*
  * activate_task - move a task to the runqueue and do priority recalculation
  *
  * Update all the scheduling statistics stuff. (sleep average
  * calculation, priority modifiers, etc.)
  */
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static inline void activate_task(task_t *task, runqueue_t *rq)
 {
-	unsigned long long now = sched_clock();
-
-	recalc_task_prio(p, now);
-
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated){
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->activated = 2;
-		else
-		/*
-		 * Normal first-time wakeups get a credit too for on-runqueue
-		 * time, but it will be weighted down:
-		 */
-			p->activated = 1;
-		}
-	p->timestamp = now;
-
-	__activate_task(p, rq);
+	struct policy *policy = task_policy(task);
+	policy->ops->wake(policy->queue, task);
+	__activate_task(task, rq);
 }
 
 /*
  * deactivate_task - remove a task from the runqueue.
  */
-static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
+static inline void deactivate_task(task_t *task, runqueue_t *rq)
 {
+	struct policy *policy = task_policy(task);
 	nr_running_dec(rq);
-	if (p->state == TASK_UNINTERRUPTIBLE)
+	if (task->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
-	dequeue_task(p, p->array);
-	p->array = NULL;
+	policy->ops->sleep(policy->queue, task);
+	dequeue_task(task, rq);
 }
 
 /*
@@ -625,7 +406,7 @@ repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
 	if (old_state & state) {
-		if (!p->array) {
+		if (!task_queued(p)) {
 			/*
 			 * Fast-migrate the task if it's not running or runnable
 			 * currently. Do not violate hard affinity.
@@ -644,14 +425,13 @@ repeat_lock_task:
 				 * Tasks on involuntary sleep don't earn
 				 * sleep_avg beyond just interactive state.
 				 */
-				p->activated = -1;
 			}
 			if (sync)
 				__activate_task(p, rq);
 			else {
 				activate_task(p, rq);
-				if (TASK_PREEMPTS_CURR(p, rq))
-					resched_task(rq->curr);
+				if (task_preempts_curr(p, rq))
+					resched_task(rq_curr(rq));
 			}
 			success = 1;
 		}
@@ -679,68 +459,26 @@ int wake_up_state(task_t *p, unsigned in
  * This function will do some initial scheduler statistics housekeeping
  * that must be done for every newly created process.
  */
-void wake_up_forked_process(task_t * p)
+void wake_up_forked_process(task_t *task)
 {
 	unsigned long flags;
 	runqueue_t *rq = task_rq_lock(current, &flags);
 
-	p->state = TASK_RUNNING;
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
-
-	p->prio = effective_prio(p);
-	set_task_cpu(p, smp_processor_id());
-
-	if (unlikely(!current->array))
-		__activate_task(p, rq);
-	else {
-		p->prio = current->prio;
-		list_add_tail(&p->run_list, &current->run_list);
-		p->array = current->array;
-		p->array->nr_active++;
-		nr_running_inc(rq);
-	}
+	task->state = TASK_RUNNING;
+	set_task_cpu(task, smp_processor_id());
+	if (unlikely(!task_queued(current)))
+		__activate_task(task, rq);
+	else
+		activate_task(task, rq);
 	task_rq_unlock(rq, &flags);
 }
 
 /*
- * Potentially available exiting-child timeslices are
- * retrieved here - this way the parent does not get
- * penalized for creating too many threads.
- *
- * (this cannot be used to 'generate' timeslices
- * artificially, because any timeslice recovered here
- * was given away by the parent in the first place.)
+ * Policies that depend on trapping fork() and exit() may need to
+ * put a hook here.
  */
-void sched_exit(task_t * p)
+void sched_exit(task_t *task)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	if (p->first_time_slice) {
-		p->parent->time_slice += p->time_slice;
-		if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
-			p->parent->time_slice = MAX_TIMESLICE;
-	}
-	local_irq_restore(flags);
-	/*
-	 * If the child was a (relative-) CPU hog then decrease
-	 * the sleep_avg of the parent as well.
-	 */
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
 }
 
 /**
@@ -1128,18 +866,18 @@ out:
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline void pull_task(runqueue_t *src_rq, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
-	dequeue_task(p, src_array);
+	dequeue_task(p, src_rq);
 	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
 	nr_running_inc(this_rq);
-	enqueue_task(p, this_rq->active);
+	enqueue_task(p, this_rq);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
 	 * to be always true for them.
 	 */
-	if (TASK_PREEMPTS_CURR(p, this_rq))
+	if (task_preempts_curr(p, this_rq))
 		set_need_resched();
 }
 
@@ -1150,14 +888,14 @@ static inline void pull_task(runqueue_t 
  *	((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
  *		cache_decay_ticks)) && !task_running(rq, p) && \
  *			cpu_isset(this_cpu, (p)->cpus_allowed))
+ *
+ * Since there isn't a timestamp anymore, this needs adjustment.
  */
 
 static inline int
 can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
 {
-	unsigned long delta = sched_clock() - tsk->timestamp;
-
-	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
+	if (!idle)
 		return 0;
 	if (task_running(rq, tsk))
 		return 0;
@@ -1176,11 +914,8 @@ can_migrate_task(task_t *tsk, runqueue_t
  */
 static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
-	prio_array_t *array;
-	struct list_head *head, *curr;
-	task_t *tmp;
 
 	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
@@ -1192,37 +927,6 @@ static void load_balance(runqueue_t *thi
 	 */
 	imbalance /= 2;
 
-	/*
-	 * We first consider expired tasks. Those will likely not be
-	 * executed in the near future, and they are most likely to
-	 * be cache-cold, thus switching CPUs has the least effect
-	 * on them.
-	 */
-	if (busiest->expired->nr_active)
-		array = busiest->expired;
-	else
-		array = busiest->active;
-
-new_array:
-	/* Start searching at priority 0: */
-	idx = 0;
-skip_bitmap:
-	if (!idx)
-		idx = sched_find_first_bit(array->bitmap);
-	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired) {
-			array = busiest->active;
-			goto new_array;
-		}
-		goto out_unlock;
-	}
-
-	head = array->queue + idx;
-	curr = head->prev;
-skip_queue:
-	tmp = list_entry(curr, task_t, run_list);
 
 	/*
 	 * We do not migrate tasks that are:
@@ -1231,21 +935,19 @@ skip_queue:
 	 * 3) are cache-hot on their current CPU.
 	 */
 
-	curr = curr->prev;
+	do {
+		struct policy *policy;
+		task_t *task;
+
+		policy = rq_migrate_policy(busiest);
+		if (!policy)
+			break;
+		task = policy->migrate(policy->queue);
+		if (!task)
+			break;
+		pull_task(busiest, task, this_rq, this_cpu);
+	} while (!idle && --imbalance);
 
-	if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -1356,10 +1058,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
  */
 void scheduler_tick(int user_ticks, int sys_ticks)
 {
-	int cpu = smp_processor_id();
+	int idle, cpu = smp_processor_id();
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	struct policy *policy;
 	runqueue_t *rq = this_rq();
-	task_t *p = current;
 
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_ticks);
@@ -1373,98 +1075,28 @@ void scheduler_tick(int user_ticks, int 
 		sys_ticks = 0;
 	}
 
-	if (p == rq->idle) {
-		if (atomic_read(&rq->nr_iowait) > 0)
-			cpustat->iowait += sys_ticks;
-		else
-			cpustat->idle += sys_ticks;
-		rebalance_tick(rq, 1);
-		return;
-	}
-	if (TASK_NICE(p) > 0)
-		cpustat->nice += user_ticks;
-	else
-		cpustat->user += user_ticks;
-	cpustat->system += sys_ticks;
-
-	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
-		set_tsk_need_resched(p);
-		goto out;
-	}
 	spin_lock(&rq->lock);
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter. Note: we do not update a thread's
-	 * priority until it either goes to sleep or uses up its
-	 * timeslice. This makes it possible for interactive tasks
-	 * to use up their timeslices at their highest priority levels.
-	 */
-	if (unlikely(rt_task(p))) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			dequeue_task(p, rq->active);
-			enqueue_task(p, rq->active);
-		}
-		goto out_unlock;
-	}
-	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
-		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!rq->expired_timestamp)
-			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			enqueue_task(p, rq->expired);
-		} else
-			enqueue_task(p, rq->active);
-	} else {
-		/*
-		 * Prevent a too long timeslice allowing a task to monopolize
-		 * the CPU. We do this by splitting up the timeslice into
-		 * smaller pieces.
-		 *
-		 * Note: this does not mean the task's timeslices expire or
-		 * get lost in any way, they just might be preempted by
-		 * another task of equal priority. (one with higher
-		 * priority would have preempted this task already.) We
-		 * requeue this task to the end of the list on this priority
-		 * level, which is in essence a round-robin of tasks with
-		 * equal priority.
-		 *
-		 * This only applies to tasks in the interactive
-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
-		 */
-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
-			(p->array == rq->active)) {
-
-			dequeue_task(p, rq->active);
-			set_tsk_need_resched(p);
-			p->prio = effective_prio(p);
-			enqueue_task(p, rq->active);
-		}
-	}
-out_unlock:
+	policy = rq_policy(rq);
+	idle = policy->ops->tick(policy->queue, current, user_ticks, sys_ticks);
 	spin_unlock(&rq->lock);
-out:
-	rebalance_tick(rq, 0);
+	rebalance_tick(rq, idle);
 }
 
 void scheduling_functions_start_here(void) { }
 
+static inline task_t *find_best_task(runqueue_t *rq)
+{
+	int idx;
+	struct policy *policy;
+
+	BUG_ON(!rq->policy_bitmap);
+	idx = __ffs(rq->policy_bitmap);
+	__check_task_policy(idx);
+	policy = rq->policies[idx];
+	check_policy(policy);
+	return policy->ops->best(policy->queue);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
@@ -1472,11 +1104,7 @@ asmlinkage void schedule(void)
 {
 	task_t *prev, *next;
 	runqueue_t *rq;
-	prio_array_t *array;
-	struct list_head *queue;
-	unsigned long long now;
-	unsigned long run_time;
-	int idx;
+	struct policy *policy;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1494,22 +1122,9 @@ need_resched:
 	preempt_disable();
 	prev = current;
 	rq = this_rq();
+	policy = rq_policy(rq);
 
 	release_kernel_lock(prev);
-	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
-		run_time = now - prev->timestamp;
-	else
-		run_time = NS_MAX_SLEEP_AVG;
-
-	/*
-	 * Tasks with interactive credits get charged less run_time
-	 * at high sleep_avg to delay them losing their interactive
-	 * status
-	 */
-	if (HIGH_CREDIT(prev))
-		run_time /= (CURRENT_BONUS(prev) ? : 1);
-
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1530,66 +1145,27 @@ need_resched:
 		prev->nvcsw++;
 		break;
 	case TASK_RUNNING:
+		policy->ops->start_wait(policy->queue, prev);
 		prev->nivcsw++;
 	}
+
 pick_next_task:
-	if (unlikely(!rq->nr_running)) {
 #ifdef CONFIG_SMP
+	if (unlikely(!rq->nr_running))
 		load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
-		if (rq->nr_running)
-			goto pick_next_task;
 #endif
-		next = rq->idle;
-		rq->expired_timestamp = 0;
-		goto switch_tasks;
-	}
-
-	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-	}
-
-	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
-
-	if (next->activated > 0) {
-		unsigned long long delta = now - next->timestamp;
-
-		if (next->activated == 1)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
-		array = next->array;
-		dequeue_task(next, array);
-		recalc_task_prio(next, next->timestamp + delta);
-		enqueue_task(next, array);
-	}
-	next->activated = 0;
-switch_tasks:
+	next = find_best_task(rq);
+	BUG_ON(!next);
 	prefetch(next);
 	clear_tsk_need_resched(prev);
 	RCU_qsctr(task_cpu(prev))++;
 
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0){
-		prev->sleep_avg = 0;
-		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
-			prev->interactive_credit--;
-	}
-	prev->timestamp = now;
-
 	if (likely(prev != next)) {
-		next->timestamp = now;
 		rq->nr_switches++;
-		rq->curr = next;
-
 		prepare_arch_switch(rq, next);
+		policy = task_policy(next);
+		policy->ops->set_curr(policy->queue, next);
+		set_rq_curr(rq, next);
 		prev = context_switch(rq, prev, next);
 		barrier();
 
@@ -1845,45 +1421,46 @@ void scheduling_functions_end_here(void)
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
-	prio_array_t *array;
 	runqueue_t *rq;
-	int old_prio, new_prio, delta;
+	struct policy *policy;
+	int delta, queued;
 
-	if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
+	if (nice < -20 || nice > 19)
 		return;
 	/*
 	 * We have to be careful, if called from sys_setpriority(),
 	 * the task might be in the middle of scheduling on another CPU.
 	 */
 	rq = task_rq_lock(p, &flags);
+	delta = nice - __task_nice(p);
+	if (!delta) {
+		if (p->pid == 0 || p->pid == 1)
+			printk("no change in nice, set_user_nice() nops!\n");
+		goto out_unlock;
+	}
+
+	policy = task_policy(p);
+
 	/*
 	 * The RT priorities are set via setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
 	 * not SCHED_NORMAL:
 	 */
-	if (rt_task(p)) {
-		p->static_prio = NICE_TO_PRIO(nice);
-		goto out_unlock;
-	}
-	array = p->array;
-	if (array)
-		dequeue_task(p, array);
-
-	old_prio = p->prio;
-	new_prio = NICE_TO_PRIO(nice);
-	delta = new_prio - old_prio;
-	p->static_prio = NICE_TO_PRIO(nice);
-	p->prio += delta;
+	queued = task_queued(p);
+	if (queued)
+		dequeue_task(p, rq);
+
+	policy->ops->renice(policy->queue, p, nice);
 
-	if (array) {
-		enqueue_task(p, array);
+	if (queued) {
+		enqueue_task(p, rq);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
 		 */
 		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+			resched_task(rq_curr(rq));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -1919,7 +1496,7 @@ asmlinkage long sys_nice(int increment)
 	if (increment > 40)
 		increment = 40;
 
-	nice = PRIO_TO_NICE(current->static_prio) + increment;
+	nice = task_nice(current) + increment;
 	if (nice < -20)
 		nice = -20;
 	if (nice > 19)
@@ -1935,6 +1512,12 @@ asmlinkage long sys_nice(int increment)
 
 #endif
 
+static int __task_prio(task_t *task)
+{
+	struct policy *policy = task_policy(task);
+	return policy->ops->prio(task);
+}
+
 /**
  * task_prio - return the priority value of a given task.
  * @p: the task in question.
@@ -1943,29 +1526,111 @@ asmlinkage long sys_nice(int increment)
  * RT tasks are offset by -200. Normal tasks are centered
  * around 0, value goes from -16 to +15.
  */
-int task_prio(task_t *p)
+int task_prio(task_t *task)
 {
-	return p->prio - MAX_RT_PRIO;
+	int prio;
+	unsigned long flags;
+	runqueue_t *rq;
+
+	rq = task_rq_lock(task, &flags);
+	prio = __task_prio(task);
+	task_rq_unlock(rq, &flags);
+	return prio;
 }
 
 /**
  * task_nice - return the nice value of a given task.
  * @p: the task in question.
  */
-int task_nice(task_t *p)
+int task_nice(task_t *task)
 {
-	return TASK_NICE(p);
+	int nice;
+	unsigned long flags;
+	runqueue_t *rq;
+
+
+	rq = task_rq_lock(task, &flags);
+	nice = __task_nice(task);
+	task_rq_unlock(rq, &flags);
+	return nice;
 }
 
 EXPORT_SYMBOL(task_nice);
 
+int task_sched_policy(task_t *task)
+{
+	check_task_policy(task);
+	switch (task->sched_info.policy) {
+		case SCHED_POLICY_RT:
+			if (task->sched_info.cl_data.rt.rt_policy
+							== RT_POLICY_RR)
+				return SCHED_RR;
+			else
+				return SCHED_FIFO;
+		case SCHED_POLICY_TS:
+			return SCHED_NORMAL;
+		case SCHED_POLICY_BATCH:
+			return SCHED_BATCH;
+		case SCHED_POLICY_IDLE:
+			return SCHED_IDLE;
+		default:
+			BUG();
+			return -1;
+	}
+}
+EXPORT_SYMBOL(task_sched_policy);
+
+void set_task_sched_policy(task_t *task, int policy)
+{
+	check_task_policy(task);
+	BUG_ON(task_queued(task));
+	switch (policy) {
+		case SCHED_FIFO:
+			task->sched_info.policy = SCHED_POLICY_RT;
+			task->sched_info.cl_data.rt.rt_policy = RT_POLICY_FIFO;
+			break;
+		case SCHED_RR:
+			task->sched_info.policy = SCHED_POLICY_RT;
+			task->sched_info.cl_data.rt.rt_policy = RT_POLICY_RR;
+			break;
+		case SCHED_NORMAL:
+			task->sched_info.policy = SCHED_POLICY_TS;
+			break;
+		case SCHED_BATCH:
+			task->sched_info.policy = SCHED_POLICY_BATCH;
+			break;
+		case SCHED_IDLE:
+			task->sched_info.policy = SCHED_POLICY_IDLE;
+			break;
+		default:
+			BUG();
+			break;
+	}
+	check_task_policy(task);
+}
+EXPORT_SYMBOL(set_task_sched_policy);
+
+int rt_task(task_t *task)
+{
+	check_task_policy(task);
+	return !!(task->sched_info.policy == SCHED_POLICY_RT);
+}
+EXPORT_SYMBOL(rt_task);
+
 /**
  * idle_cpu - is a given cpu idle currently?
  * @cpu: the processor in question.
  */
 int idle_cpu(int cpu)
 {
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+	int idle;
+	unsigned long flags;
+	runqueue_t *rq = cpu_rq(cpu);
+
+	spin_lock_irqsave(&rq->lock, flags);
+	idle = !!(rq->curr == SCHED_POLICY_IDLE);
+	spin_unlock_irqrestore(&rq->lock, flags);
+	return idle;
 }
 
 EXPORT_SYMBOL_GPL(idle_cpu);
@@ -1985,11 +1650,10 @@ static inline task_t *find_process_by_pi
 static int setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 {
 	struct sched_param lp;
-	int retval = -EINVAL;
-	int oldprio;
-	prio_array_t *array;
+	int queued, retval = -EINVAL;
 	unsigned long flags;
 	runqueue_t *rq;
+	struct policy *rq_policy;
 	task_t *p;
 
 	if (!param || pid < 0)
@@ -2017,7 +1681,7 @@ static int setscheduler(pid_t pid, int p
 	rq = task_rq_lock(p, &flags);
 
 	if (policy < 0)
-		policy = p->policy;
+		policy = task_sched_policy(p);
 	else {
 		retval = -EINVAL;
 		if (policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -2047,29 +1711,23 @@ static int setscheduler(pid_t pid, int p
 	if (retval)
 		goto out_unlock;
 
-	array = p->array;
-	if (array)
+	queued = task_queued(p);
+	if (queued)
 		deactivate_task(p, task_rq(p));
 	retval = 0;
-	p->policy = policy;
-	p->rt_priority = lp.sched_priority;
-	oldprio = p->prio;
-	if (policy != SCHED_NORMAL)
-		p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
-	else
-		p->prio = p->static_prio;
-	if (array) {
+	set_task_sched_policy(p, policy);
+	check_task_policy(p);
+	rq_policy = rq->policies[p->sched_info.policy];
+	check_policy(rq_policy);
+	rq_policy->ops->setprio(p, lp.sched_priority);
+	if (queued) {
 		__activate_task(p, task_rq(p));
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
 		 * this runqueue and our priority is higher than the current's
 		 */
-		if (rq->curr == p) {
-			if (p->prio > oldprio)
-				resched_task(rq->curr);
-		} else if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		resched_task(rq_curr(rq));
 	}
 
 out_unlock:
@@ -2121,7 +1779,7 @@ asmlinkage long sys_sched_getscheduler(p
 	if (p) {
 		retval = security_task_getscheduler(p);
 		if (!retval)
-			retval = p->policy;
+			retval = task_sched_policy(p);
 	}
 	read_unlock(&tasklist_lock);
 
@@ -2153,7 +1811,7 @@ asmlinkage long sys_sched_getparam(pid_t
 	if (retval)
 		goto out_unlock;
 
-	lp.sched_priority = p->rt_priority;
+	lp.sched_priority = task_prio(p);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -2262,32 +1920,13 @@ out_unlock:
  */
 asmlinkage long sys_sched_yield(void)
 {
+	struct policy *policy;
 	runqueue_t *rq = this_rq_lock();
-	prio_array_t *array = current->array;
-
-	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
-	 */
-	if (likely(!rt_task(current))) {
-		dequeue_task(current, array);
-		enqueue_task(current, rq->expired);
-	} else {
-		list_del(&current->run_list);
-		list_add_tail(&current->run_list, array->queue + current->prio);
-	}
-	/*
-	 * Since we are going to call schedule() anyway, there's
-	 * no need to preempt:
-	 */
+	policy = rq_policy(rq);
+	policy->ops->yield(policy->queue, current);
 	_raw_spin_unlock(&rq->lock);
 	preempt_enable_no_resched();
-
 	schedule();
-
 	return 0;
 }
 
@@ -2387,6 +2026,19 @@ asmlinkage long sys_sched_get_priority_m
 	return ret;
 }
 
+static inline unsigned long task_timeslice(task_t *task)
+{
+	unsigned long flags, timeslice;
+	struct policy *policy;
+	runqueue_t *rq;
+
+	rq = task_rq_lock(task, &flags);
+	policy = task_policy(task);
+	timeslice = policy->ops->timeslice(policy->queue, task);
+	task_rq_unlock(rq, &flags);
+	return timeslice;
+}
+
 /**
  * sys_sched_rr_get_interval - return the default timeslice of a process.
  * @pid: pid of the process.
@@ -2414,8 +2066,7 @@ asmlinkage long sys_sched_rr_get_interva
 	if (retval)
 		goto out_unlock;
 
-	jiffies_to_timespec(p->policy & SCHED_FIFO ?
-				0 : task_timeslice(p), &t);
+	jiffies_to_timespec(task_timeslice(p), &t);
 	read_unlock(&tasklist_lock);
 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
@@ -2523,17 +2174,22 @@ void show_state(void)
 void __init init_idle(task_t *idle, int cpu)
 {
 	runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle));
+	struct policy *policy;
 	unsigned long flags;
 
 	local_irq_save(flags);
 	double_rq_lock(idle_rq, rq);
-
-	idle_rq->curr = idle_rq->idle = idle;
+	policy = rq_policy(rq);
+	BUG_ON(policy != task_policy(idle));
+	printk("deactivating, have %d tasks\n",
+			policy->ops->tasks(policy->queue));
 	deactivate_task(idle, rq);
-	idle->array = NULL;
-	idle->prio = MAX_PRIO;
+	set_task_sched_policy(idle, SCHED_IDLE);
 	idle->state = TASK_RUNNING;
 	set_task_cpu(idle, cpu);
+	activate_task(idle, rq);
+	nr_running_dec(rq);
+	set_rq_curr(rq, idle);
 	double_rq_unlock(idle_rq, rq);
 	set_tsk_need_resched(idle);
 	local_irq_restore(flags);
@@ -2804,38 +2460,27 @@ __init static void init_kstat(void) {
 void __init sched_init(void)
 {
 	runqueue_t *rq;
-	int i, j, k;
+	int i, j;
 
 	/* Init the kstat counters */
 	init_kstat();
 	for (i = 0; i < NR_CPUS; i++) {
-		prio_array_t *array;
-
 		rq = cpu_rq(i);
-		rq->active = rq->arrays;
-		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
 		nr_running_init(rq);
-
-		for (j = 0; j < 2; j++) {
-			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
-				INIT_LIST_HEAD(array->queue + k);
-				__clear_bit(k, array->bitmap);
-			}
-			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
-		}
+		memcpy(rq->policies, policies, sizeof(policies));
+		for (j = 0; j < BITS_PER_LONG && rq->policies[j]; ++j)
+			rq->policies[j]->ops->init(rq->policies[j], i);
 	}
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.
 	 */
 	rq = this_rq();
-	rq->curr = current;
-	rq->idle = current;
+	set_task_sched_policy(current, SCHED_NORMAL);
+	set_rq_curr(rq, current);
 	set_task_cpu(current, smp_processor_id());
 	wake_up_forked_process(current);
 
diff -prauN linux-2.6.0-test11/lib/Makefile sched-2.6.0-test11-5/lib/Makefile
--- linux-2.6.0-test11/lib/Makefile	2003-11-26 12:42:55.000000000 -0800
+++ sched-2.6.0-test11-5/lib/Makefile	2003-12-20 15:09:16.000000000 -0800
@@ -5,7 +5,7 @@
 
 lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
 	 bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
-	 kobject.o idr.o div64.o parser.o
+	 kobject.o idr.o div64.o parser.o binomial.o
 
 lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
diff -prauN linux-2.6.0-test11/lib/binomial.c sched-2.6.0-test11-5/lib/binomial.c
--- linux-2.6.0-test11/lib/binomial.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/lib/binomial.c	2003-12-20 17:32:09.000000000 -0800
@@ -0,0 +1,138 @@
+#include <linux/kernel.h>
+#include <linux/binomial.h>
+
+struct binomial *binomial_minimum(struct binomial **heap)
+{
+	struct binomial *minimum, *tmp;
+
+	for (minimum = NULL, tmp = *heap; tmp; tmp = tmp->sibling) {
+		if (!minimum || minimum->priority > tmp->priority)
+			minimum = tmp;
+	}
+	return minimum;
+}
+
+static void binomial_link(struct binomial *left, struct binomial *right)
+{
+	left->parent  = right;
+	left->sibling = right->child;
+	right->child  = left;
+	right->degree++;
+}
+
+static void binomial_merge(struct binomial **both, struct binomial **left,
+						struct binomial **right)
+{
+	while (*left && *right) {
+		if ((*left)->degree < (*right)->degree) {
+			*both = *left;
+			left = &(*left)->sibling;
+		} else {
+			*both = *right;
+			right = &(*right)->sibling;
+		}
+		both = &(*both)->sibling;
+	}
+	/*
+	 * for more safety:
+	 * *left = *right = NULL;
+	 */
+}
+
+void binomial_union(struct binomial **both, struct binomial **left,
+						struct binomial **right)
+{
+	struct binomial *prev, *tmp, *next;
+
+	binomial_merge(both, left, right);
+	if (!(tmp = *both))
+		return;
+
+	for (prev = NULL, next = tmp->sibling; next; next = tmp->sibling) {
+		if ((next->sibling && next->sibling->degree == tmp->degree)
+					|| tmp->degree != next->degree) {
+			prev = tmp;
+			tmp  = next;
+		} else if (tmp->priority <= next->priority) {
+			tmp->sibling = next->sibling;
+			binomial_link(next, tmp);
+		} else {
+			if (!prev)
+				*both = next;
+			else
+				prev->sibling = next;
+			binomial_link(tmp, next);
+			tmp = next;
+		}
+	}
+}
+
+void binomial_insert(struct binomial **heap, struct binomial *element)
+{
+	element->parent  = NULL;
+	element->child   = NULL;
+	element->sibling = NULL;
+	element->degree  = 0;
+	binomial_union(heap, heap, &element);
+}
+
+static void binomial_reverse(struct binomial **in, struct binomial **out)
+{
+	while (*in) {
+		struct binomial *tmp = *in;
+		*in = (*in)->sibling;
+		tmp->sibling = *out;
+		*out = tmp;
+	}
+}
+
+struct binomial *binomial_extract_min(struct binomial **heap)
+{
+	struct binomial *tmp, *minimum, *last, *min_last, *new_heap;
+
+	minimum = last = min_last = new_heap = NULL;
+	for (tmp = *heap; tmp; last = tmp, tmp = tmp->sibling) {
+		if (!minimum || tmp->priority < minimum->priority) {
+			minimum  = tmp;
+			min_last = last;
+		}
+	}
+	if (min_last && minimum)
+		min_last->sibling = minimum->sibling;
+	else if (minimum)
+		(*heap)->sibling = minimum->sibling;
+	else
+		return NULL;
+	binomial_reverse(&minimum->child, &new_heap);
+	binomial_union(heap, heap, &new_heap);
+	return minimum;
+}
+
+void binomial_decrease(struct binomial **heap, struct binomial *element,
+							unsigned increment)
+{
+	struct binomial *tmp, *last = NULL;
+
+	element->priority -= min(element->priority, increment);
+	last = element;
+	tmp  = last->parent;
+	while (tmp && last->priority < tmp->priority) {
+		unsigned tmp_prio = tmp->priority;
+		tmp->priority = last->priority;
+		last->priority = tmp_prio;
+		last = tmp;
+		tmp  = tmp->parent;
+	}
+}
+
+void binomial_delete(struct binomial **heap, struct binomial *element)
+{
+	struct binomial *tmp, *last = element;
+	for (tmp = last->parent; tmp; last = tmp, tmp = tmp->parent) {
+		unsigned tmp_prio = tmp->priority;
+		tmp->priority = last->priority;
+		last->priority = tmp_prio;
+	}
+	binomial_reverse(&last->child, &tmp);
+	binomial_union(heap, heap, &tmp);
+}
diff -prauN linux-2.6.0-test11/mm/oom_kill.c sched-2.6.0-test11-5/mm/oom_kill.c
--- linux-2.6.0-test11/mm/oom_kill.c	2003-11-26 12:44:16.000000000 -0800
+++ sched-2.6.0-test11-5/mm/oom_kill.c	2003-12-17 07:07:53.000000000 -0800
@@ -158,7 +158,6 @@ static void __oom_kill_task(task_t *p)
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
-	p->time_slice = HZ;
 	p->flags |= PF_MEMALLOC | PF_MEMDIE;
 
 	/* This process has hardware access, be more careful. */

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
  2007-04-17  0:04                 ` Peter Williams
@ 2007-04-17 13:07                 ` James Bruce
  2007-04-17 20:05                   ` William Lee Irwin III
  2 siblings, 1 reply; 712+ messages in thread
From: James Bruce @ 2007-04-17 13:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: ck

Chris Friesen wrote:
> William Lee Irwin III wrote:
> 
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>     corresponds to some share of CPU bandwidth. Implementations
>>     should not have the freedom to change this arbitrarily according
>>     to some intention.
> 
> The first question that comes to my mind is whether nice levels should 
> be linear or not.  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.

Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589

That value has the property that a nice=10 task gets 1/10th the cpu of a 
nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
would be fairly easy to explain to admins and users so that they can 
know what to expect from nicing tasks.

> What about also having something that looks at latency, and how latency 
> changes with niceness?

I think this would be a lot harder to pin down, since it's a function of 
all the other tasks running and their nice levels.  Do you have any of 
the RT-derived analysis models in mind?

> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.  As you can 
> imagine, mainline doesn't do very well in this case.

It might be easier to specify the maximum deviation from the ideal 
bandwidth over a certain period.  I.e. something like "over a period of 
one second, each task receives within 10% of the expected bandwidth".

  - Jim Bruce


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56                     ` Nick Piggin
@ 2007-04-17 13:16                       ` Peter Williams
  2007-04-18  4:46                         ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17 13:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>>> Other hints that it was a bad idea was the need to transfer time slices 
>>>> between children and parents during fork() and exit().
>>> I don't see how that has anything to do with dual arrays.
>> It's totally to do with the dual arrays.  The only real purpose of the 
>> time slice in O(1) (regardless of what its perceived purpose was) was to 
>> control the switching between the arrays.
> 
> The O(1) design is pretty convoluted in that regard. In my scheduler,
> the only purpose of the arrays is to renew time slices.
> 
> The fork/exit logic is added to make interactivity better. Ingo's
> scheduler has similar equivalent logic.
> 
> 
>>> If you put
>>> a new child at the back of the queue, then your various interactive
>>> shell commands that typically do a lot of dependant forking get slowed
>>> right down behind your compile job. If you give a new child its own
>>> timeslice irrespective of the parent, then you have things like 'make'
>>> (which doesn't use a lot of CPU time) spawning off lots of high
>>> priority children.
>> This is an artefact of trying to control nice using time slices while 
>> using them for controlling array switching and whatever else they were 
>> being used for.  Priority (static and dynamic) is the the best way to 
>> implement nice.
> 
> I don't like the timeslice based nice in mainline. It's too nasty
> with latencies. nicksched is far better in that regard IMO.
> 
> But I don't know how you can assert a particular way is the best way
> to do something.

I should have added "I may be wrong but I think that ...".

My opinion is based on a lot of experience with different types of 
scheduler design and the observation from gathering scheduling 
statistics while playing with these schedulers that the size of the time 
slices we're talking about is much larger than the CPU chunks most tasks 
use in any one go so time slice size has no real effect on most tasks 
and the faster CPUs become the more this becomes true.

> 
> 
>>> You need to do _something_ (Ingo's does). I don't see why this would
>>> be tied with a dual array. FWIW, mine doesn't do anything on exit()
>>> like most others, but it may need more tuning in this area.
>>>
>>>
>>>> This disregard for the dual array mechanism has prevented me from 
>>>> looking at the rest of your scheduler in any great detail so I can't 
>>>> comment on any other ideas that may be in there.
>>> Well I wasn't really asking you to review it. As I said, everyone
>>> has their own idea of what a good design does, and review can't really
>>> distinguish between the better of two reasonable designs.
>>>
>>> A fair evaluation of the alternatives seems like a good idea though.
>>> Nobody is actually against this, are they?
>> No.  It would be nice if the basic ideas that each scheduler tries to 
>> implement could be extracted and explained though.  This could lead to a 
>> melding of ideas that leads to something quite good.
>>
>>>
>>>>> I haven't looked at Con's ones for a while,
>>>>> but I believe they are also much more straightforward than mainline...
>>>> I like Con's scheduler (partly because it uses a single array) but 
>>>> mainly because it's nice and simple.  However, his earlier schedulers 
>>>> were prone to starvation (admittedly, only if you went out of your way 
>>>> to make it happen) and I tried to convince him to use the anti 
>>>> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
>>>> haven't looked at his latest scheduler that sparked all this furore so 
>>>> can't comment on it.
>>> I agree starvation or unfairness is unacceptable for a new scheduler.
>>>
>>>
>>>>> For example, let's say all else is equal between them, then why would
>>>>> we go with the O(logN) implementation rather than the O(1)?
>>>> In the highly unlikely event that you can't separate them on technical 
>>>> grounds, Occam's razor recommends choosing the simplest solution. :-)
>>> O(logN) vs O(1) is technical grounds.
>> In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
>> greater than O(logN)'s k factor multiplied by logMaxN.
> 
> Yes, or even significantly greater around typical large sizes of N.

Yes.  In fact its' probably better to use the maximum number of threads 
allowed on the system for N.  We know that value don't we?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:51               ` Ingo Molnar
@ 2007-04-17 13:44                 ` Peter Williams
  2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-20 20:47                 ` Bill Davidsen
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17 13:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> * Nick Piggin <npiggin@suse.de> wrote:
> 
>>>> Maybe the progress is that more key people are becoming open to 
>>>> the idea of changing the scheduler.
>>> Could be.  All was quiet for quite a while, but when RSDL showed up, 
>>> it aroused enough interest to show that scheduling woes is on folks 
>>> radar.
>> Well I know people have had woes with the scheduler for ever (I guess 
>> that isn't going to change :P). [...]
> 
> yes, that part isnt going to change, because the CPU is a _scarce 
> resource_ that is perhaps the most frequently overcommitted physical 
> computer resource in existence, and because the kernel does not (yet) 
> track eye movements of humans to figure out which tasks are more 
> important them. So critical human constraints are unknown to the 
> scheduler and thus complaints will always come.
> 
> The upstream scheduler thought it had enough information: the sleep 
> average. So now the attempt is to go back and _simplify_ the scheduler 
> and remove that information, and concentrate on getting fairness 
> precisely right. The magic thing about 'fairness' is that it's a pretty 
> good default policy if we decide that we simply have not enough 
> information to do an intelligent choice.
> 
> ( Lets be cautious though: the jury is still out whether people actually 
>   like this more than the current approach. While CFS feedback looks 
>   promising after a whopping 3 days of it being released [ ;-) ], the 
>   test coverage of all 'fairness centric' schedulers, even considering 
>   years of availability is less than 1% i'm afraid, and that < 1% was 
>   mostly self-selecting. )

At this point I'd like to make the observation that spa_ebs is a very 
fair scheduler if you consider "nice" to be an indication of the 
relative entitlement of tasks to CPU bandwidth.

It works by mapping nice to shares using a function very similar to the 
one for calculating p->load weight except it's not offset by the RT 
priorities as RT is handled separately.  In theory, a runnable task's 
entitlement to CPU bandwidth at any time is the ratio of its shares to 
the total shares held by runnable tasks on the same CPU (in reality, a 
smoothed average of this sum is used to make scheduling smoother).  The 
dynamic priorities of the runnable tasks are then fiddled to try to keep 
each tasks CPU bandwidth usage in proportion to its entitlement.

That's the theory anyway.

The actual implementation looks a bit different due to efficiency 
considerations.  The modifications to the above theory boil down to 
keeping a running measure of the (recent) highest CPU bandwidth use per 
share for tasks running on the CPU -- I call this the yardstick for this 
CPU.  When it's time to put a task on the run queue it's dynamic 
priority is determined by comparing its CPU bandwidth per share value 
with the yardstick for its CPU.  If it's greater than the yardstick this 
value becomes the new yardstick and the task gets given the lowest 
possible dynamic priority (for its scheduling class).  If the value is 
zero it gets the highest possible priority (for its scheduling class) 
which would be MAX_RT_PRIO for a SCHED_OTHER task.  Otherwise it gets 
given a priority between these two extremes proportional to ratio of its 
CPU bandwidth per share value and the yardstick.  Quite simple really.

The other way in which the code deviates from the original as that (for 
a few years now) I no longer calculated CPU bandwidth usage directly. 
I've found that the overhead is less if I keep a running average of the 
size of a tasks CPU bursts and the length of its scheduling cycle (i.e. 
from on CPU one time to on CPU next time) and using the ratio of these 
values as a measure of bandwidth usage.

Anyway it works and gives very predictable allocations of CPU bandwidth 
based on nice.

Another good feature is that (in this pure form) it's starvation free. 
However, if you fiddle with it and do things like giving bonus priority 
boosts to interactive tasks it becomes susceptible to starvation.  This 
can be fixed by using an anti starvation mechanism such as SPA's 
promotion scheme and that's what spa_ebs does.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 10:41                                     ` William Lee Irwin III
@ 2007-04-17 13:48                                       ` Peter Williams
  2007-04-18  0:27                                         ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17 13:48 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Comments on which directions you'd like this to go in these respects
>>> would be appreciated, as I regard you as the current "project owner."
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I'd do scan through LKML from about 18 months ago looking for mention of 
>> runtime configurable version of plugsched.  Some students at a 
>> university (in Germany, I think) posted some patches adding this feature 
>> to plugsched around about then.
> 
> Excellent. I'll go hunting for that.
> 
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I never added them to plugsched proper as I knew (from previous 
>> experience when the company I worked for posted patches with similar 
>> functionality) that Linux would like this idea less than he did the 
>> current plugsched mechanism.
> 
> Odd how the requirements ended up including that. Fickleness abounds.
> If only we knew up-front what the end would be.
> 
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> Unfortunately, my own cache of the relevant e-mails got overwritten 
>> during a Fedora Core upgrade (I've since moved /var onto a separate 
>> drive to avoid a repetition) or I would dig them out and send them to 
>> you.  I'd provided with copies of the company's patches to use as a 
>> guide to how to overcome the problems associated with changing 
>> schedulers on a running system (a few non trivial locking issues pop up).
>> Maybe if one of the students still reads LKML he will provide a pointer.
> 
> I was tempted to restart from scratch given Ingo's comments, but I
> reconsidered and I'll be working with your code (and the German
> students' as well). If everything has to change, so be it, but it'll
> still be a derived work. It would be ignoring precedent and failure to
> properly attribute if I did otherwise.

I can give you a patch (or set of patches) against the latest git 
vanilla kernel version if that would help.  There have been changes to 
the vanilla scheduler code since 2.6.20 so the latest patch on 
sourceforge won't apply cleanly.  I've found that implementing this as a 
series of patches rather than one big patch makes it easier fro me to 
cope with changes to the underlying code.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  0:54               ` Peter Williams
@ 2007-04-17 15:52                 ` Chris Friesen
  2007-04-17 23:50                   ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Chris Friesen @ 2007-04-17 15:52 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Chris Friesen wrote:
>> Scuse me if I jump in here, but doesn't the load balancer need some 
>> way to figure out a) when to run, and b) which tasks to pull and where 
>> to push them?

> Yes but both of these are independent of the scheduler discipline in force.

It is not clear to me that this is always the case, especially once you 
mix in things like resource groups.

> If
> the load balancer manages to keep the weighted (according to static 
> priority) load and distribution of priorities within the loads on the 
> CPUs roughly equal and the scheduler does a good job of ensuring 
> fairness, interactive responsiveness etc. for the tasks within a CPU 
> then the result will be good system performance within the constraints 
> set by the sys admins use of real time priorities and nice.

Suppose I have a really high priority task running.  Another very high 
priority task wakes up and would normally preempt the first one. 
However, there happens to be another cpu available.  It seems like it 
would be a win if we moved one of those tasks to the available cpu 
immediately so they can both run simultaneously.  This would seem to 
require some communication between the scheduler and the load balancer.

Certainly the above design could introduce a lot of context switching. 
But if my goal is a scheduler that minimizes latency (even at the cost 
of throughput) then that's an acceptable price to pay.

Chris

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:31               ` Nick Piggin
@ 2007-04-17 17:35                 ` Matt Mackall
  0 siblings, 0 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > > think that simplifies analysis and focuses testing.
> > 
> > I think you'll find something like 80-90% of the testing will be done
> > on the default choice, even if other choices exist. So you really
> > won't have much of a problem here.
> > 
> > But when the only choice for other schedulers is to go out-of-tree,
> > then only 1% of the people will try it out and those people are
> > guaranteed to be the ones who saw scheduling problems in mainline.
> > So the alternative won't end up getting any testing on many of the
> > workloads that work fine in mainstream so their feedback won't tell
> > you very much at all.
> 
> Yeah I concede that perhaps it is the only way to get things going
> any further. But how do we decide if and when the current scheduler
> should be demoted from default, and which should replace it?

Step one is ship both in -mm. If that doesn't give us enough
confidence, ship both in mainline. If that doesn't give us enough
confidence, wait until vendors ship both. Eventually a clear picture
should emerge. If it doesn't, either the change is not significant or
no one cares.

But it really is important to be able to do controlled experiments on
this stuff with little effort. That's the recipe for getting lots of
valid feedback.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:07                 ` James Bruce
@ 2007-04-17 20:05                   ` William Lee Irwin III
  0 siblings, 0 replies; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 20:05 UTC (permalink / raw)
  To: James Bruce
  Cc: Chris Friesen, Willy Tarreau, Pekka Enberg, hui Bill Huey,
	Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a 
> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
> would be fairly easy to explain to admins and users so that they can 
> know what to expect from nicing tasks.
[...additional good commentary trimmed...]

Lots of good ideas here. I'll follow them.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  8:23                       ` William Lee Irwin III
@ 2007-04-17 21:39                       ` Matt Mackall
  2007-04-17 23:23                         ` Peter Williams
  2007-04-18  3:15                         ` Nick Piggin
  1 sibling, 2 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 21:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > >> All things are not equal; they all have different properties. I like
> > 
> > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > Exactly. So we have to explore those properties and evaluate performance
> > > (in all meanings of the word). That's only logical.
> > 
> > Any chance you'd be willing to put down a few thoughts on what sorts
> > of standards you'd like to set for both correctness (i.e. the bare
> > minimum a scheduler implementation must do to be considered valid
> > beyond not oopsing) and performance metrics (i.e. things that produce
> > numbers for each scheduler you can compare to say "this scheduler is
> > better than this other scheduler at this.").
> 
> Yeah I guess that's the hard part :)
> 
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

I'm a big fan of fairness, but I think it's a bit early to declare it
a mandatory feature. Bounded unfairness is probably something we can
agree on, ie "if we decide to be unfair, no process suffers more than
a factor of x".
 
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

This is a slightly stronger statement than starvation-free (which is
obviously mandatory). I think you're looking for something like
"worst-case scheduling latency is proportional to the number of
runnable tasks". Which I think is quite a reasonable requirement.

I'm pretty sure the stock scheduler falls short of both of these
guarantees though.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:24                           ` Ingo Molnar
  2007-04-17  9:57                             ` William Lee Irwin III
@ 2007-04-17 22:08                             ` Matt Mackall
  2007-04-17 22:32                               ` William Lee Irwin III
  1 sibling, 1 reply; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 22:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Davide Libenzi, Nick Piggin,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
> > [...] Also rest assured that the tone of the critique is not hostile, 
> > and wasn't meant to sound that way.
> 
> ok :) (And i guess i was too touchy - sorry about coming out swinging.)
> 
> > Also, given the general comments it appears clear that some 
> > statistical metric of deviation from the intended behavior furthermore 
> > qualified by timescale is necessary, so this appears to be headed 
> > toward a sort of performance metric as opposed to a pass/fail test 
> > anyway. However, to even measure this at all, some statement of 
> > intention is required. I'd prefer that there be a Linux-standard 
> > semantics for nice so results are more directly comparable and so that 
> > users also get similar nice behavior from the scheduler as it varies 
> > over time and possibly implementations if users should care to switch 
> > them out with some scheduler patch or other.
> 
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great!

How's this:

If you're running two identical CPU hog tasks A and B differing only by nice
level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
constant f(Anice - Bnice).

Other definitions make things hard to analyze and probably not
well-bounded when confronted with > 2 tasks.

I -think- this implies keeping a separate scaled CPU usage counter,
where the scaling factor is a trivial exponential function of nice
level where f(0) == 1. Then you schedule based on this scaled usage
counter rather than unscaled.

I also suspect we want to keep the exponential base small so that the
maximal difference is 10x-100x.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:23                       ` William Lee Irwin III
@ 2007-04-17 22:23                         ` Davide Libenzi
  0 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-17 22:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
> 
> ISTR Davide Libenzi having a scheduling latency test a number of years
> ago. Resurrecting that and tuning it to the needs of this kind of
> testing sounds relevant here. The test suite Peter Willliams mentioned
> would also help.

That helped me a lot at that time. At every context switch was sampling 
critical scheduler parameters for both entering and exiting task (and 
associated timestamps). Then the data was collected through a 
/dev/idontremember from userspace for analysis. It'd very useful to have 
it those days, to study what really happens under the hook 
(scheduler internal parameters variations and such) when those wierd loads 
make the scheduler unstable.



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:08                             ` Matt Mackall
@ 2007-04-17 22:32                               ` William Lee Irwin III
  2007-04-17 22:39                                 ` Matt Mackall
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:32 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>> yeah. If you could come up with a sane definition that also translates 
>> into low overhead on the algorithm side that would be great!

On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> How's this:
> If you're running two identical CPU hog tasks A and B differing only by nice
> level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> constant f(Anice - Bnice).
> Other definitions make things hard to analyze and probably not
> well-bounded when confronted with > 2 tasks.
> I -think- this implies keeping a separate scaled CPU usage counter,
> where the scaling factor is a trivial exponential function of nice
> level where f(0) == 1. Then you schedule based on this scaled usage
> counter rather than unscaled.
> I also suspect we want to keep the exponential base small so that the
> maximal difference is 10x-100x.

I'm already working with this as my assumed nice semantics (actually
something with a specific exponential base, suggested in other emails)
until others start saying they want something different and agree.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:32                               ` William Lee Irwin III
@ 2007-04-17 22:39                                 ` Matt Mackall
  2007-04-17 22:59                                   ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 22:39 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> >> yeah. If you could come up with a sane definition that also translates 
> >> into low overhead on the algorithm side that would be great!
> 
> On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> > How's this:
> > If you're running two identical CPU hog tasks A and B differing only by nice
> > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> > constant f(Anice - Bnice).
> > Other definitions make things hard to analyze and probably not
> > well-bounded when confronted with > 2 tasks.
> > I -think- this implies keeping a separate scaled CPU usage counter,
> > where the scaling factor is a trivial exponential function of nice
> > level where f(0) == 1. Then you schedule based on this scaled usage
> > counter rather than unscaled.
> > I also suspect we want to keep the exponential base small so that the
> > maximal difference is 10x-100x.
> 
> I'm already working with this as my assumed nice semantics (actually
> something with a specific exponential base, suggested in other emails)
> until others start saying they want something different and agree.

Good. This has a couple nice mathematical properties, including
"bounded unfairness" which I mentioned earlier. What base are you
looking at?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:59                                   ` William Lee Irwin III
@ 2007-04-17 22:57                                     ` Matt Mackall
  2007-04-18  4:29                                       ` William Lee Irwin III
  2007-04-18  7:29                                       ` James Bruce
  0 siblings, 2 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 22:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> >> I'm already working with this as my assumed nice semantics (actually
> >> something with a specific exponential base, suggested in other emails)
> >> until others start saying they want something different and agree.
> 
> On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> > Good. This has a couple nice mathematical properties, including
> > "bounded unfairness" which I mentioned earlier. What base are you
> > looking at?
> 
> I'm working with the following suggestion:
> 
> 
> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> > Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> > That value has the property that a nice=10 task gets 1/10th the cpu of a
> > nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
> > would be fairly easy to explain to admins and users so that they can
> > know what to expect from nicing tasks.
> 
> I'm not likely to write the testcase until this upcoming weekend, though.

So that means there's a 10000:1 ratio between nice 20 and nice -19. In
that sort of dynamic range, you're likely to have non-trivial
numerical accuracy issues in integer/fixed-point math.

(Especially if your clock is jiffies-scale, which a significant number
of machines will continue to be.)

I really think if we want to have vastly different ratios, we probably
want to be looking at BATCH and RT scheduling classes instead.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:39                                 ` Matt Mackall
@ 2007-04-17 22:59                                   ` William Lee Irwin III
  2007-04-17 22:57                                     ` Matt Mackall
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:59 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
>> I'm already working with this as my assumed nice semantics (actually
>> something with a specific exponential base, suggested in other emails)
>> until others start saying they want something different and agree.

On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> Good. This has a couple nice mathematical properties, including
> "bounded unfairness" which I mentioned earlier. What base are you
> looking at?

I'm working with the following suggestion:


On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a
> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
> would be fairly easy to explain to admins and users so that they can
> know what to expect from nicing tasks.


I'm not likely to write the testcase until this upcoming weekend, though.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:44                 ` Peter Williams
@ 2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-17 23:07                     ` William Lee Irwin III
  2007-04-18  2:39                     ` Peter Williams
  0 siblings, 2 replies; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:00 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> The other way in which the code deviates from the original as that (for
> a few years now) I no longer calculated CPU bandwidth usage directly.
> I've found that the overhead is less if I keep a running average of the
> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
> from on CPU one time to on CPU next time) and using the ratio of these
> values as a measure of bandwidth usage.
>
> Anyway it works and gives very predictable allocations of CPU bandwidth
> based on nice.

Works, that is, right up until you add nonlinear interactions with CPU
speed scaling.  From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room.  Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild.  Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities.  Blecch.

Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application.  (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.)  You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH.  Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
blecch!

Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling?  And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?

It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications.  It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork.  But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department.  :-P

Seriously, though.  If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:00                   ` Michael K. Edwards
@ 2007-04-17 23:07                     ` William Lee Irwin III
  2007-04-17 23:52                       ` Michael K. Edwards
  2007-04-18  2:39                     ` Peter Williams
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-17 23:07 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:00:53PM -0700, Michael K. Edwards wrote:
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling.  From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room.  Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild.  Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities.  Blecch.
[...not terribly enlightening stuff trimmed...]

The ongoing scheduler work is on a much more basic level than these
affairs I'm guessing you googled. When the basics work as intended it
will be possible to move on to more advanced issues.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:23                         ` Peter Williams
@ 2007-04-17 23:19                           ` Matt Mackall
  0 siblings, 0 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-17 23:19 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 09:23:42AM +1000, Peter Williams wrote:
> Matt Mackall wrote:
> >On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> >>On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> >>>On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >>>>>All things are not equal; they all have different properties. I like
> >>>On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> >>>>Exactly. So we have to explore those properties and evaluate performance
> >>>>(in all meanings of the word). That's only logical.
> >>>Any chance you'd be willing to put down a few thoughts on what sorts
> >>>of standards you'd like to set for both correctness (i.e. the bare
> >>>minimum a scheduler implementation must do to be considered valid
> >>>beyond not oopsing) and performance metrics (i.e. things that produce
> >>>numbers for each scheduler you can compare to say "this scheduler is
> >>>better than this other scheduler at this.").
> >>Yeah I guess that's the hard part :)
> >>
> >>For correctness, I guess fairness is an easy one. I think that unfairness
> >>is basically a bug and that it would be very unfortunate to merge 
> >>something
> >>unfair. But this is just within the context of a single runqueue... for
> >>better or worse, we allow some unfairness in multiprocessors for 
> >>performance
> >>reasons of course.
> >
> >I'm a big fan of fairness, but I think it's a bit early to declare it
> >a mandatory feature. Bounded unfairness is probably something we can
> >agree on, ie "if we decide to be unfair, no process suffers more than
> >a factor of x".
> > 
> >>Latency. Given N tasks in the system, an arbitrary task should get
> >>onto the CPU in a bounded amount of time (excluding events like freak
> >>IRQ holdoffs and such, obviously -- ie. just considering the context
> >>of the scheduler's state machine).
> >
> >This is a slightly stronger statement than starvation-free (which is
> >obviously mandatory). I think you're looking for something like
> >"worst-case scheduling latency is proportional to the number of
> >runnable tasks".
> 
> add "taking into consideration nice and/or real time priorities of 
> runnable tasks".  I.e. if a task is nice 19 it can expect to wait longer 
> to get onto the CPU than if it was nice 0.

Yes. Assuming we meet the "bounded unfairness" criterion above, this
follows.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 21:39                       ` Matt Mackall
@ 2007-04-17 23:23                         ` Peter Williams
  2007-04-17 23:19                           ` Matt Mackall
  2007-04-18  3:15                         ` Nick Piggin
  1 sibling, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17 23:23 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
>> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>>> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>>>>> All things are not equal; they all have different properties. I like
>>> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
>>>> Exactly. So we have to explore those properties and evaluate performance
>>>> (in all meanings of the word). That's only logical.
>>> Any chance you'd be willing to put down a few thoughts on what sorts
>>> of standards you'd like to set for both correctness (i.e. the bare
>>> minimum a scheduler implementation must do to be considered valid
>>> beyond not oopsing) and performance metrics (i.e. things that produce
>>> numbers for each scheduler you can compare to say "this scheduler is
>>> better than this other scheduler at this.").
>> Yeah I guess that's the hard part :)
>>
>> For correctness, I guess fairness is an easy one. I think that unfairness
>> is basically a bug and that it would be very unfortunate to merge something
>> unfair. But this is just within the context of a single runqueue... for
>> better or worse, we allow some unfairness in multiprocessors for performance
>> reasons of course.
> 
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".
>  
>> Latency. Given N tasks in the system, an arbitrary task should get
>> onto the CPU in a bounded amount of time (excluding events like freak
>> IRQ holdoffs and such, obviously -- ie. just considering the context
>> of the scheduler's state machine).
> 
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks".

add "taking into consideration nice and/or real time priorities of 
runnable tasks".  I.e. if a task is nice 19 it can expect to wait longer 
to get onto the CPU than if it was nice 0.

> Which I think is quite a reasonable requirement.
> 
> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.
> 

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 15:52                 ` Chris Friesen
@ 2007-04-17 23:50                   ` Peter Williams
  2007-04-18  5:43                     ` Chris Friesen
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-17 23:50 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
>> Chris Friesen wrote:
>>> Scuse me if I jump in here, but doesn't the load balancer need some 
>>> way to figure out a) when to run, and b) which tasks to pull and 
>>> where to push them?
> 
>> Yes but both of these are independent of the scheduler discipline in 
>> force.
> 
> It is not clear to me that this is always the case, especially once you 
> mix in things like resource groups.
> 
>> If
>> the load balancer manages to keep the weighted (according to static 
>> priority) load and distribution of priorities within the loads on the 
>> CPUs roughly equal and the scheduler does a good job of ensuring 
>> fairness, interactive responsiveness etc. for the tasks within a CPU 
>> then the result will be good system performance within the constraints 
>> set by the sys admins use of real time priorities and nice.
> 
> Suppose I have a really high priority task running.  Another very high 
> priority task wakes up and would normally preempt the first one. 
> However, there happens to be another cpu available.  It seems like it 
> would be a win if we moved one of those tasks to the available cpu 
> immediately so they can both run simultaneously.  This would seem to 
> require some communication between the scheduler and the load balancer.

Not really the load balancer can do this on its own AND the decision 
should be based on the STATIC priority of the task being woken.

> 
> Certainly the above design could introduce a lot of context switching. 
> But if my goal is a scheduler that minimizes latency (even at the cost 
> of throughput) then that's an acceptable price to pay.

It would actually probably reduce context switching as putting the woken 
task on the best CPU at wake up means you don't have to move it later 
on.  The wake up code already does a little bit in this direction when 
it chooses which CPU to put a newly woken task on but could do more -- 
the only real cost would be the cost of looking at more candidate CPUs 
than it currently does.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:07                     ` William Lee Irwin III
@ 2007-04-17 23:52                       ` Michael K. Edwards
  2007-04-18  0:36                         ` Bill Huey
  0 siblings, 1 reply; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:52 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> The ongoing scheduler work is on a much more basic level than these
> affairs I'm guessing you googled. When the basics work as intended it
> will be possible to move on to more advanced issues.

OK, let me try this in smaller words for people who can't tell bitter
experience from Google hits.  CPU clock scaling for power efficiency
is already the only thing that matters about the Linux scheduler in my
world, because battery-powered device vendors in their infinite wisdom
are abandoning real RTOSes in favor of Linux now that WiFi is the "in"
thing (again).  And on the timescale that anyone will actually be
_using_ this shiny new scheduler of Ingo's, it'll be nearly the only
thing that matters about the Linux scheduler in anyone's world,
because the amount of work the CPU can get done in a given minute will
depend mostly on how intelligently it can spend its heat dissipation
budget.

Clock scaling schemes that aren't integral to the scheduler design
make a bad situation (scheduling embedded loads with shotgun
heuristics tuned for desktop CPUs) worse, because the opaque
heuristics are now being applied to distorted data.  Add a "smoothing"
scheme for the distorted data, and you may find that you have
introduced an actual control-path instability.  A small fluctuation in
the data (say, two bursts of interrupt traffic at just the right
interval) can result in a long-lasting oscillation in some task's
"dynamic priority" -- and, on a fully loaded CPU, in the time that
task actually gets.  If anything else depends on how much work this
task gets done each time around, the oscillation can easily propagate
throughout the system.  Thrash city.

(If you haven't seen this happen on real production systems under what
shouldn't be a pathological load, you haven't been around long.  The
classic mechanisms that triggered oscillations in, say, early SMP
Solaris boxes haven't bitten recently, perhaps because most modern
CPUs don't lose their marbles so comprehensively on context switch.
But I got to live this nightmare again recently on ARM Linux, due to
some impressively broken application-level threading/locking "design",
whose assumptions about scheduler behavior got broken when I switched
to an NPTL toolchain.)

I don't have the training to design a scheduler that isn't vulnerable
to control-feedback oscillations.  Neither do you, if you haven't
taken (and excelled at) a control theory course, which nowadays seems
to be taught by applied math and ECE departments and too often skipped
by CS types.  But I can recognize an impending train wreck when I see
it.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:48                                       ` Peter Williams
@ 2007-04-18  0:27                                         ` Peter Williams
  2007-04-18  2:03                                           ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-18  0:27 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

Peter Williams wrote:
> William Lee Irwin III wrote:
>> I was tempted to restart from scratch given Ingo's comments, but I
>> reconsidered and I'll be working with your code (and the German
>> students' as well). If everything has to change, so be it, but it'll
>> still be a derived work. It would be ignoring precedent and failure to
>> properly attribute if I did otherwise.
> 
> I can give you a patch (or set of patches) against the latest git 
> vanilla kernel version if that would help.  There have been changes to 
> the vanilla scheduler code since 2.6.20 so the latest patch on 
> sourceforge won't apply cleanly.  I've found that implementing this as a 
> series of patches rather than one big patch makes it easier fro me to 
> cope with changes to the underlying code.

I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
to Linus's git tree as of an hour or two ago on sourceforge:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>

This should at least enable you to get it to apply cleanly to the latest 
kernel sources.  Let me know if you'd also like this as a quilt/mq 
friendly patch series?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:52                       ` Michael K. Edwards
@ 2007-04-18  0:36                         ` Bill Huey
  0 siblings, 0 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-18  0:36 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: William Lee Irwin III, Peter Williams, Ingo Molnar, Nick Piggin,
	Mike Galbraith, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner,
	Bill Huey (hui)

On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote:
> On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> >The ongoing scheduler work is on a much more basic level than these
> >affairs I'm guessing you googled. When the basics work as intended it
> >will be possible to move on to more advanced issues.
... 

Will probably shouldn't have dismissed your points but he probably means
that can't even get at this stuff until fundamental are in place.

> Clock scaling schemes that aren't integral to the scheduler design
> make a bad situation (scheduling embedded loads with shotgun
> heuristics tuned for desktop CPUs) worse, because the opaque
> heuristics are now being applied to distorted data.  Add a "smoothing"
> scheme for the distorted data, and you may find that you have
> introduced an actual control-path instability.  A small fluctuation in
> the data (say, two bursts of interrupt traffic at just the right
> interval) can result in a long-lasting oscillation in some task's
> "dynamic priority" -- and, on a fully loaded CPU, in the time that
> task actually gets.  If anything else depends on how much work this
> task gets done each time around, the oscillation can easily propagate
> throughout the system.  Thrash city.

Hyperthreading issues are quite similar that clock scaling issues.
Con's infrastructures changes to move things in that direction were
rejected, as well as other infrastructure changes, further infuritating
Con to drop development on RSDL and derivatives.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  0:27                                         ` Peter Williams
@ 2007-04-18  2:03                                           ` William Lee Irwin III
  2007-04-18  2:31                                             ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-18  2:03 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List

> Peter Williams wrote:
> >William Lee Irwin III wrote:
> >>I was tempted to restart from scratch given Ingo's comments, but I
> >>reconsidered and I'll be working with your code (and the German
> >>students' as well). If everything has to change, so be it, but it'll
> >>still be a derived work. It would be ignoring precedent and failure to
> >>properly attribute if I did otherwise.
> >
> >I can give you a patch (or set of patches) against the latest git 
> >vanilla kernel version if that would help.  There have been changes to 
> >the vanilla scheduler code since 2.6.20 so the latest patch on 
> >sourceforge won't apply cleanly.  I've found that implementing this as a 
> >series of patches rather than one big patch makes it easier fro me to 
> >cope with changes to the underlying code.
> 
On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
> I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
> to Linus's git tree as of an hour or two ago on sourceforge:
> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
> This should at least enable you to get it to apply cleanly to the latest 
> kernel sources.  Let me know if you'd also like this as a quilt/mq 
> friendly patch series?

A quilt-friendly series would be most excellent if you could arrange it.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  2:03                                           ` William Lee Irwin III
@ 2007-04-18  2:31                                             ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-18  2:31 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
>> Peter Williams wrote:
>>> William Lee Irwin III wrote:
>>>> I was tempted to restart from scratch given Ingo's comments, but I
>>>> reconsidered and I'll be working with your code (and the German
>>>> students' as well). If everything has to change, so be it, but it'll
>>>> still be a derived work. It would be ignoring precedent and failure to
>>>> properly attribute if I did otherwise.
>>> I can give you a patch (or set of patches) against the latest git 
>>> vanilla kernel version if that would help.  There have been changes to 
>>> the vanilla scheduler code since 2.6.20 so the latest patch on 
>>> sourceforge won't apply cleanly.  I've found that implementing this as a 
>>> series of patches rather than one big patch makes it easier fro me to 
>>> cope with changes to the underlying code.
> On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
>> I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
>> to Linus's git tree as of an hour or two ago on sourceforge:
>> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
>> This should at least enable you to get it to apply cleanly to the latest 
>> kernel sources.  Let me know if you'd also like this as a quilt/mq 
>> friendly patch series?
> 
> A quilt-friendly series would be most excellent if you could arrange it.

Done:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch-series.tar.gz>

Just untar this in the base directory of your Linux kernel source and 
Bob's your uncle.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-17 23:07                     ` William Lee Irwin III
@ 2007-04-18  2:39                     ` Peter Williams
  1 sibling, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-18  2:39 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Michael K. Edwards wrote:
> On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> The other way in which the code deviates from the original as that (for
>> a few years now) I no longer calculated CPU bandwidth usage directly.
>> I've found that the overhead is less if I keep a running average of the
>> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
>> from on CPU one time to on CPU next time) and using the ratio of these
>> values as a measure of bandwidth usage.
>>
>> Anyway it works and gives very predictable allocations of CPU bandwidth
>> based on nice.
> 
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling.  From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room.  Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild.  Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities.  Blecch.
> 
> Not to mention the likelihood that the task whose clock speed you're
> trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
> than the application.  (You want to crank the CPU for this task
> because it runs with the RF hot, which may cost you as much power as
> the rest of the platform.)  You'd better hope you can remove it from
> the dynamic priority heuristics with SCHED_BATCH.  Otherwise
> everything _else_ has to be RT priority (or it'll be starved by the
> soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
> blecch!
> 
> Is it too much to ask for someone with actual engineering training
> (not me, unfortunately) to sit down and build a negative-feedback
> control system that handles soft-real-time _and_ dynamic-priority
> _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
> scaling?  And actually separates the accounting and control mechanisms
> from the heuristics, so the latter can be tuned (within a well
> documented stable range) to reflect the expected system usage
> patterns?
> 
> It's not like there isn't a vast literature in this area over the past
> decade, including some dealing specifically with clock scaling
> consistent with low-latency applications.  It's a pity that people
> doing academic work in this area rarely wade into LKML, even when
> they're hacking on a Linux fork.  But then, there's not much economic
> incentive for them to do so, and they can usually get their fill of
> citation politics and dominance games without leaving their home
> department.  :-P
> 
> Seriously, though.  If you're really going to put the mainline
> scheduler through this kind of churn, please please pretty please knit
> in per-task clock scaling (possibly even rejigged during the slice;
> see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
> linger mechanism to keep from taking context switch hits when you're
> confident that an I/O will complete quickly.

I think that this doesn't effect the basic design principles of spa_ebs 
but just means that the statistics that it uses need to be rethought. 
E.g. instead of measuring average CPU usage per burst in terms of wall 
clock time spent on the CPU measure it in terms of CPU capacity (for the 
want of a better word) used per burst.

I don't have suitable hardware for investigating this line of attack 
further, unfortunately, and have no idea what would be the best way to 
calculate this new statistic.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 21:39                       ` Matt Mackall
  2007-04-17 23:23                         ` Peter Williams
@ 2007-04-18  3:15                         ` Nick Piggin
  2007-04-18  3:45                           ` Mike Galbraith
  2007-04-18  4:38                           ` Matt Mackall
  1 sibling, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  3:15 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > >> All things are not equal; they all have different properties. I like
> > > 
> > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > Exactly. So we have to explore those properties and evaluate performance
> > > > (in all meanings of the word). That's only logical.
> > > 
> > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > of standards you'd like to set for both correctness (i.e. the bare
> > > minimum a scheduler implementation must do to be considered valid
> > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > numbers for each scheduler you can compare to say "this scheduler is
> > > better than this other scheduler at this.").
> > 
> > Yeah I guess that's the hard part :)
> > 
> > For correctness, I guess fairness is an easy one. I think that unfairness
> > is basically a bug and that it would be very unfortunate to merge something
> > unfair. But this is just within the context of a single runqueue... for
> > better or worse, we allow some unfairness in multiprocessors for performance
> > reasons of course.
> 
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".

I don't know why this would be a useful feature (of course I'm talking
about processes at the same nice level). One of the big problems with
the current scheduler is that it is unfair in some corner cases. It
works OK for most people, but when it breaks down it really hurts. At
least if you start with a fair scheduler, you can alter priorities
until it satisfies your need... with an unfair one your guess is as
good as mine.

So on what basis would you allow unfairness? On the basis that it doesn't
seem to harm anyone? It doesn't seem to harm testers?

I think we should aim for something better.


> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
> 
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks". Which I think is quite a reasonable requirement.

Yes, bounded and proportional to.


> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.

And I think that's what its main problems are. It's interactivity
obviously can't be too bad for most people. It's performance seems to
be pretty good.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:15                         ` Nick Piggin
@ 2007-04-18  3:45                           ` Mike Galbraith
  2007-04-18  3:56                             ` Nick Piggin
  2007-04-18  4:38                           ` Matt Mackall
  1 sibling, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-18  3:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > 
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
> 
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
> 
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?

Well, there's short term fair and long term fair.  Seems to me a burst
load having to always merge with a steady stream load using a short term
fairness yardstick absolutely must 'starve' relative to the steady load,
so to be long term fair, you have to add some short term unfairness.
The mainline scheduler is more long term fair (discounting the rather
obnoxious corner cases).

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:45                           ` Mike Galbraith
@ 2007-04-18  3:56                             ` Nick Piggin
  2007-04-18  4:29                               ` Mike Galbraith
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  3:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > > 
> > > I'm a big fan of fairness, but I think it's a bit early to declare it
> > > a mandatory feature. Bounded unfairness is probably something we can
> > > agree on, ie "if we decide to be unfair, no process suffers more than
> > > a factor of x".
> > 
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> > 
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
> 
> Well, there's short term fair and long term fair.  Seems to me a burst
> load having to always merge with a steady stream load using a short term
> fairness yardstick absolutely must 'starve' relative to the steady load,
> so to be long term fair, you have to add some short term unfairness.
> The mainline scheduler is more long term fair (discounting the rather
> obnoxious corner cases).

Oh yes definitely I mean long term fair. I guess it is impossible to be
completely fair so long as we have to timeshare the CPU :)

So a constant delta is fine and unavoidable. But I don't think I agree
with a constant factor: that means you can pick a time where process 1
is allowed an arbitrary T more CPU time than process 2.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:57                                     ` Matt Mackall
@ 2007-04-18  4:29                                       ` William Lee Irwin III
  2007-04-18  4:42                                         ` Davide Libenzi
  2007-04-18  7:29                                       ` James Bruce
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-18  4:29 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
>>> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
>>> That value has the property that a nice=10 task gets 1/10th the cpu of a
>>> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
>>> would be fairly easy to explain to admins and users so that they can
>>> know what to expect from nicing tasks.

On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
>> I'm not likely to write the testcase until this upcoming weekend, though.

On Tue, Apr 17, 2007 at 05:57:23PM -0500, Matt Mackall wrote:
> So that means there's a 10000:1 ratio between nice 20 and nice -19. In
> that sort of dynamic range, you're likely to have non-trivial
> numerical accuracy issues in integer/fixed-point math.
> (Especially if your clock is jiffies-scale, which a significant number
> of machines will continue to be.)
> I really think if we want to have vastly different ratios, we probably
> want to be looking at BATCH and RT scheduling classes instead.

100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
even 1000**(1/39.0) ~= 1.19378 still seems weak.

I suspect that in order to get low nice numbers strong enough without
making high nice numbers too strong something sub-exponential may need
to be used. Maybe just picking percentages outright as opposed to some
particular function.

We may also be better off defining it in terms of a share weighting as
opposed to two tasks in competition. In such a manner the extension to
N tasks is more automatic. f(n) would be a univariate function of nice
numbers and two tasks in competition with nice numbers n_1 and n_2
would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
the exponential case f(n) = K*e**(r*n) this ends up as
1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
other choices it's not so. f(n) = n+K for K >= 20 results in a share
weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
when n <= 0 is highly plausible. An exponent or an additive constant
may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
and the ratio of shares is 420, which is still arithmeticaly feasible.
-10 vs. 0 and 0 vs. 10 are both 10:1.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:56                             ` Nick Piggin
@ 2007-04-18  4:29                               ` Mike Galbraith
  0 siblings, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-18  4:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 05:56 +0200, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > >
> > > 
> > > So on what basis would you allow unfairness? On the basis that it doesn't
> > > seem to harm anyone? It doesn't seem to harm testers?
> > 
> > Well, there's short term fair and long term fair.  Seems to me a burst
> > load having to always merge with a steady stream load using a short term
> > fairness yardstick absolutely must 'starve' relative to the steady load,
> > so to be long term fair, you have to add some short term unfairness.
> > The mainline scheduler is more long term fair (discounting the rather
> > obnoxious corner cases).
> 
> Oh yes definitely I mean long term fair. I guess it is impossible to be
> completely fair so long as we have to timeshare the CPU :)
> 
> So a constant delta is fine and unavoidable. But I don't think I agree
> with a constant factor: that means you can pick a time where process 1
> is allowed an arbitrary T more CPU time than process 2.

Definitely.  Using constants with no consideration of what else is
running is what causes the fairness mechanism in mainline to break down
under load.

(aside: What I was experimenting with before this new scheduler came
along was to turn the sleep_avg thing into an off-cpu period thing. Once
a time slice begins execution [runqueue wait doesn't count], that task
has the right to use it's slice in one go, and _anything_ that knocked
it off the cpu added to it's credit.  Knocking someone else off detracts
from credit, and to get to the point where you can knock others off
costs you stored credit proportional to the dynamic priority you attain
by using it.  All tasks that have credit stay active, no favorites.)

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:15                         ` Nick Piggin
  2007-04-18  3:45                           ` Mike Galbraith
@ 2007-04-18  4:38                           ` Matt Mackall
  2007-04-18  5:00                             ` Nick Piggin
  1 sibling, 1 reply; 712+ messages in thread
From: Matt Mackall @ 2007-04-18  4:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > > >> All things are not equal; they all have different properties. I like
> > > > 
> > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > > Exactly. So we have to explore those properties and evaluate performance
> > > > > (in all meanings of the word). That's only logical.
> > > > 
> > > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > > of standards you'd like to set for both correctness (i.e. the bare
> > > > minimum a scheduler implementation must do to be considered valid
> > > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > > numbers for each scheduler you can compare to say "this scheduler is
> > > > better than this other scheduler at this.").
> > > 
> > > Yeah I guess that's the hard part :)
> > > 
> > > For correctness, I guess fairness is an easy one. I think that unfairness
> > > is basically a bug and that it would be very unfortunate to merge something
> > > unfair. But this is just within the context of a single runqueue... for
> > > better or worse, we allow some unfairness in multiprocessors for performance
> > > reasons of course.
> > 
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
> 
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
> 
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?

On the basis that there's only anecdotal evidence thus far that
fairness is the right approach.

It's not yet clear that a fair scheduler can do the right thing with X,
with various kernel threads, etc. without fiddling with nice levels.
Which makes it no longer "completely fair".

It's also not yet clear that a scheduler can't be taught to do the
right thing with X without fiddling with nice levels.

So I'm just not yet willing to completely rule out systems that aren't
"completely fair".

But I think we should rule out schedulers that don't have rigid bounds on
that unfairness. That's where the really ugly behavior lies.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  4:29                                       ` William Lee Irwin III
@ 2007-04-18  4:42                                         ` Davide Libenzi
  0 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18  4:42 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Matt Mackall, Ingo Molnar, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
> even 1000**(1/39.0) ~= 1.19378 still seems weak.
> 
> I suspect that in order to get low nice numbers strong enough without
> making high nice numbers too strong something sub-exponential may need
> to be used. Maybe just picking percentages outright as opposed to some
> particular function.
> 
> We may also be better off defining it in terms of a share weighting as
> opposed to two tasks in competition. In such a manner the extension to
> N tasks is more automatic. f(n) would be a univariate function of nice
> numbers and two tasks in competition with nice numbers n_1 and n_2
> would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
> the exponential case f(n) = K*e**(r*n) this ends up as
> 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
> other choices it's not so. f(n) = n+K for K >= 20 results in a share
> weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
> in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
> when n <= 0 is highly plausible. An exponent or an additive constant
> may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
> and the ratio of shares is 420, which is still arithmeticaly feasible.
> -10 vs. 0 and 0 vs. 10 are both 10:1.

This makes more sense, and the ratio at the extremes is something 
reasonable.



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:16                       ` Peter Williams
@ 2007-04-18  4:46                         ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  4:46 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 11:16:54PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >I don't like the timeslice based nice in mainline. It's too nasty
> >with latencies. nicksched is far better in that regard IMO.
> >
> >But I don't know how you can assert a particular way is the best way
> >to do something.
> 
> I should have added "I may be wrong but I think that ...".
> 
> My opinion is based on a lot of experience with different types of 
> scheduler design and the observation from gathering scheduling 
> statistics while playing with these schedulers that the size of the time 
> slices we're talking about is much larger than the CPU chunks most tasks 
> use in any one go so time slice size has no real effect on most tasks 
> and the faster CPUs become the more this becomes true.

For desktop loads, maybe. But for things that are compute bound, the
cost of context switching I believe still gets worse as CPUs continue
to be able to execute more instructions per cycle, get clocked faster,
and get larger caches.


> >>In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
> >>greater than O(logN)'s k factor multiplied by logMaxN.
> >
> >Yes, or even significantly greater around typical large sizes of N.
> 
> Yes.  In fact its' probably better to use the maximum number of threads 
> allowed on the system for N.  We know that value don't we?

Well we might be able to work it out by looking at the tunables or
amount of kernel memory available, but I guess it is hard to just
pick a number.

I'll try running a few more benchmarks.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  4:38                           ` Matt Mackall
@ 2007-04-18  5:00                             ` Nick Piggin
  2007-04-18  5:55                               ` Matt Mackall
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  5:00 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:38:31PM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> > 
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> > 
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
> 
> On the basis that there's only anecdotal evidence thus far that
> fairness is the right approach.
> 
> It's not yet clear that a fair scheduler can do the right thing with X,
> with various kernel threads, etc. without fiddling with nice levels.
> Which makes it no longer "completely fair".

Of course I mean SCHED_OTHER tasks at the same nice level. Otherwise
I would be arguing to make nice basically a noop.


> It's also not yet clear that a scheduler can't be taught to do the
> right thing with X without fiddling with nice levels.

Being fair doesn't prevent that. Implicit unfairness is wrong though,
because it will bite people.

What's wrong with allowing X to get more than it's fair share of CPU
time by "fiddling with nice levels"? That's what they're there for.


> So I'm just not yet willing to completely rule out systems that aren't
> "completely fair".
> 
> But I think we should rule out schedulers that don't have rigid bounds on
> that unfairness. That's where the really ugly behavior lies.

Been a while since I really looked at the mainline scheduler, but I
don't think it can permanently starve something, so I don't know what
your bounded unfairness would help with.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:50                   ` Peter Williams
@ 2007-04-18  5:43                     ` Chris Friesen
  2007-04-18 13:00                       ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Chris Friesen @ 2007-04-18  5:43 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Chris Friesen wrote:

>> Suppose I have a really high priority task running.  Another very high 
>> priority task wakes up and would normally preempt the first one. 
>> However, there happens to be another cpu available.  It seems like it 
>> would be a win if we moved one of those tasks to the available cpu 
>> immediately so they can both run simultaneously.  This would seem to 
>> require some communication between the scheduler and the load balancer.
> 
> 
> Not really the load balancer can do this on its own AND the decision 
> should be based on the STATIC priority of the task being woken.

I guess I don't follow.  How would the load balancer know that it needs 
to run?  Running on every task wake-up seems expensive.  Also, static 
priority isn't everything.  What about the gang-scheduler concept where 
certain tasks must be scheduled simultaneously on different cpus?  What 
about a resource-group scenario where you have per-cpu resource limits, 
so that for good latency/fairness you need to force a high priority task 
to migrate to another cpu once it has consumed the cpu allocation of 
that group on the current cpu?

I can see having a generic load balancer core code, but it seems to me 
that the scheduler proper needs to have some way of triggering the load 
balancer to run, and some kind of goodness functions to indicate a) 
which tasks to move, and b) where to move them.

Chris


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:00                             ` Nick Piggin
@ 2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
                                                   ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-18  5:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > It's also not yet clear that a scheduler can't be taught to do the
> > right thing with X without fiddling with nice levels.
> 
> Being fair doesn't prevent that. Implicit unfairness is wrong though,
> because it will bite people.
> 
> What's wrong with allowing X to get more than it's fair share of CPU
> time by "fiddling with nice levels"? That's what they're there for.

Why is X special? Because it does work on behalf of other processes?
Lots of things do this. Perhaps a scheduler should focus entirely on
the implicit and directed wakeup matrix and optimizing that
instead[1].

Why are processes special? Should user A be able to get more CPU time
for his job than user B by splitting it into N parallel jobs? Should
we be fair per process, per user, per thread group, per session, per
controlling terminal? Some weighted combination of the preceding?[2]

Why is the measure CPU time? I can imagine a scheduler that weighed
memory bandwidth in the equation. Or power consumption. Or execution
unit usage.

Fairness is nice. It's simple, it's obvious, it's predictable. But
it's just not clear that it's optimal. If the question is (and it
was!) "what should the basic requirements for the scheduler be?" it's
not clear that fairness is a requirement or even how to pick a metric
for fairness that's obviously and uniquely superior.

It's instead much easier to try to recognize and rule out really bad
behaviour with bounded latencies, minimum service guarantees, etc.

[1] That's basically how Google decides to prioritize webpages, which
it seems to do moderately well. And how a number of other optimization
problems are solved.

[2] It's trivial to construct two or more perfectly reasonable and
desirable definitions of fairness that are mutually incompatible.
-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
@ 2007-04-18  6:37                                 ` Nick Piggin
  2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18 13:08                                 ` William Lee Irwin III
  2007-04-18 14:48                                 ` Linus Torvalds
  2 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  6:37 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > > It's also not yet clear that a scheduler can't be taught to do the
> > > right thing with X without fiddling with nice levels.
> > 
> > Being fair doesn't prevent that. Implicit unfairness is wrong though,
> > because it will bite people.
> > 
> > What's wrong with allowing X to get more than it's fair share of CPU
> > time by "fiddling with nice levels"? That's what they're there for.
> 
> Why is X special? Because it does work on behalf of other processes?

The high level reason is that giving it more than its fair share of
CPU allows a desktop to remain interactive under load. And it isn't
just about doing work on behalf of other processes. Mouse interrupts
are a big part of it, for example.

> Lots of things do this. Perhaps a scheduler should focus entirely on
> the implicit and directed wakeup matrix and optimizing that
> instead[1].

You could do that, and I tried a variant of it at one point. The
problem was that it leads to unexpected bad things too.

UNIX programs more or less expect fair SCHED_OTHER scheduling, and
given the principle of least surprise...


> Why are processes special? Should user A be able to get more CPU time
> for his job than user B by splitting it into N parallel jobs? Should
> we be fair per process, per user, per thread group, per session, per
> controlling terminal? Some weighted combination of the preceding?[2]

I don't know how that supports your argument for unfairness, but
processes are special only because that's how we've always done
scheduling.  I'm not precluding other groupings for fairness, though.


> Why is the measure CPU time? I can imagine a scheduler that weighed
> memory bandwidth in the equation. Or power consumption. Or execution
> unit usage.

Feel free. And I'd also argue that once you schedule for those metrics
then fairness is also important there too.


> Fairness is nice. It's simple, it's obvious, it's predictable. But
> it's just not clear that it's optimal. If the question is (and it
> was!) "what should the basic requirements for the scheduler be?" it's
> not clear that fairness is a requirement or even how to pick a metric
> for fairness that's obviously and uniquely superior.

What do you mean optimal? If your criteria is fairness, then of course
it is optimal. If your criteria is throughput, then it probably isn't.

Considering it is simple and what we've always done, measuring fairness
by CPU time per process is obvious for a general purpose scheduler.
If you accept that, then I argue that fairness is an optimal property
given that the alternative is unfairness.


> It's instead much easier to try to recognize and rule out really bad
> behaviour with bounded latencies, minimum service guarantees, etc.

It's the bad behaviour that you didn't recognize that is the problem.
If you start with explicit fairness, then unfairness will never be
one of those problems.


> [1] That's basically how Google decides to prioritize webpages, which
> it seems to do moderately well. And how a number of other optimization
> problems are solved.

This is not an optimization problem, it is a heuristic. There is no
right and wrong answer.


> [2] It's trivial to construct two or more perfectly reasonable and
> desirable definitions of fairness that are mutually incompatible.

Probably not if you use common sense, and in the context of a replacement
for the 2.6 scheduler.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:37                                 ` Nick Piggin
@ 2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18  7:24                                     ` Nick Piggin
  2007-04-21 13:33                                     ` Bill Davidsen
  0 siblings, 2 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-18  6:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:
> I don't know how that supports your argument for unfairness,

I never had such an argument. I like fairness.

My argument is that -you- don't have an argument for making fairness a
-requirement-.

> processes are special only because that's how we've always done
> scheduling.  I'm not precluding other groupings for fairness, though.

If you make one form of fairness a -requirement- for all acceptable
algorithms, your -are- precluding most other forms of fairness.

If you refuse to define what "fairness" means when specifying your
requirement, what's the point of requiring it?

> What do you mean optimal? If your criteria is fairness, then of course
> it is optimal. If your criteria is throughput, then it probably isn't.

I don't know what optimal behavior is. And neither do you. It may or
may not be fair. It very likely includes small deviations from fair.

> > [2] It's trivial to construct two or more perfectly reasonable and
> > desirable definitions of fairness that are mutually incompatible.
> 
> Probably not if you use common sense, and in the context of a replacement
> for the 2.6 scheduler.

Ok, trivial example. You cannot allocate equal CPU time to
processes/tasks and simultaneously allocate equal time to thread
groups. Is it common sense that a heavily-threaded app should be able
to get hugely more CPU than a well-written app? No. I don't want Joe's
stupid Java app to make my compile crawl.

On the other hand, if my heavily threaded app is, say, a voicemail
server serving 30 customers, I probably want it to get 30x the CPU of
my gzip job.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:55                                   ` Matt Mackall
@ 2007-04-18  7:24                                     ` Nick Piggin
  2007-04-21 13:33                                     ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  7:24 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 01:55:34AM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:
> > I don't know how that supports your argument for unfairness,
> 
> I never had such an argument. I like fairness.
> 
> My argument is that -you- don't have an argument for making fairness a
> -requirement-.

It seems easy enough that there is no point accepting unfair
behaviour like the old scheduler if we're going to go to all
this trouble to replace it. The old scheduler seems to have
bounded unfairness and bounded starvation, so let the good times
roll.


> > processes are special only because that's how we've always done
> > scheduling.  I'm not precluding other groupings for fairness, though.
> 
> If you make one form of fairness a -requirement- for all acceptable
> algorithms, your -are- precluding most other forms of fairness.
> 
> If you refuse to define what "fairness" means when specifying your
> requirement, what's the point of requiring it?

I don't refuse. I'm talking about per-process CPU time fairness.
My paragraph above was pointing out that subsequent work to
add other classes of fairness are not excluded as configurable
features, but this basic type of fairness should be included.


> > What do you mean optimal? If your criteria is fairness, then of course
> > it is optimal. If your criteria is throughput, then it probably isn't.
> 
> I don't know what optimal behavior is. And neither do you. It may or
> may not be fair. It very likely includes small deviations from fair.

You misunderstand me. There is no single "optimal" when you're talking
about fairness (or most other scheduler things). So pondering whether
fairness is optimal or not doesn't really make sense.

I'm saying it should be a basic axiom, not that it is quantitively
better. It isn't a refutable argument. I state it because that it is
what users and programs expect.

You can reject that, and fine. I guess if a scheduler comes along that
does exactly the right thing for everyone, then it is better than any
fair scheduler. So OK, while we're talking theoretical, I won't dismiss
an unfair scheduler out of hand.


> > > [2] It's trivial to construct two or more perfectly reasonable and
> > > desirable definitions of fairness that are mutually incompatible.
> > 
> > Probably not if you use common sense, and in the context of a replacement
> > for the 2.6 scheduler.
> 
> Ok, trivial example. You cannot allocate equal CPU time to
> processes/tasks and simultaneously allocate equal time to thread
> groups. Is it common sense that a heavily-threaded app should be able
> to get hugely more CPU than a well-written app? No. I don't want Joe's
> stupid Java app to make my compile crawl.
> 
> On the other hand, if my heavily threaded app is, say, a voicemail
> server serving 30 customers, I probably want it to get 30x the CPU of
> my gzip job.

So that might be a nice addition, but the base funcionality is threads
simply because that's what we've always done. Just common sense.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:57                                     ` Matt Mackall
  2007-04-18  4:29                                       ` William Lee Irwin III
@ 2007-04-18  7:29                                       ` James Bruce
  1 sibling, 0 replies; 712+ messages in thread
From: James Bruce @ 2007-04-18  7:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: ck

Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
>> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
>> I'm working with the following suggestion:
>>
>> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
>>> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
>>> That value has the property that a nice=10 task gets 1/10th the cpu of a
>>> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
>>> would be fairly easy to explain to admins and users so that they can
>>> know what to expect from nicing tasks.
>> I'm not likely to write the testcase until this upcoming weekend, though.
> 
> So that means there's a 10000:1 ratio between nice 20 and nice -19. In
> that sort of dynamic range, you're likely to have non-trivial
> numerical accuracy issues in integer/fixed-point math.

Well, you *are* specifying vastly different priorities.  The question is 
how many nice=20 tasks should it take to interfere with a nice=-19 task? 
  If you've only got a 100:1 ratio, 100 nice=20 tasks will take ~50% of 
the CPU away from a nice=-19 task.  I don't think that's ideal, as in my 
mind a -19 task shouldn't have to care how many nice=20 tasks there are 
(within reason).  IMHO, if a user is running a CPU hog at nice=-19, and 
expecting nice=20 tasks to run immediately, I don't think the scheduler 
is the problem.

> (Especially if your clock is jiffies-scale, which a significant number
> of machines will continue to be.)
> 
> I really think if we want to have vastly different ratios, we probably
> want to be looking at BATCH and RT scheduling classes instead.

I, like all users, can live with anything, but there should be a clear 
specification of what the user should expect.  Magic changes in the 
function at nice=0, or no real clear meaning at all (mainline), are both 
things that don't help the users to figure that out.  I like the 
exponential base because shifting all tasks up or down one nice level 
does not change the relative cpu distribution (i.e. two tasks 
{nice=-5,nice=0} get the same relative cpu distribution as if they were 
{nice=0,nice=5}.  An exponential base is the only way that property can 
hold.

Now, perhaps implementation issues may prevent something like the 
"1.2589" ratio rule from being realized, but I'm not sure we should 
throw it out _before_ we know that it's actually a problem.  This is the 
same sort of resistance that the timekeeping code updates faces (using 
nanoseconds everywhere instead of "natural" clock bases), but that got 
addressed eventually.

  - Jim Bruce


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-15 16:55           ` Christoph Pfister
  2007-04-15 22:14             ` S.Çağlar Onur
@ 2007-04-18  8:27             ` Ingo Molnar
  2007-04-18  8:57               ` Ingo Molnar
  2007-04-18  8:57               ` Christoph Pfister
       [not found]             ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com>
  2 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  8:27 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


[ i've Cc:-ed Ulrich Drepper, this CFS-triggered hang seems to have some 
  futex and pthread_cond_wait() relevance. ]

* Christoph Pfister <christophpfister@gmail.com> wrote:

> >> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
> 
> Could you try xine-ui or gxine? Because I suspect rather xine-lib for 
> freezing issues. In any way I think a gdb backtrace would be much 
> nicer - but if you can't reproduce the freeze issue with other xine 
> based players and want to run kaffeine in gdb, you need to execute 
> "gdb --args kaffeine --nofork".

update: i've reproduced one kind of a hang but i'm not sure it's the 
same hang Ismail is seeing. It was quite hard to trigger it under CFS, i 
had to do wild forward/backward button seeks on a real DVD and i mixed 
it with CPU-intense workloads on the same box. Here are the straces and 
gdb backtraces:

kaffeine thread PID 9303, waiting for other threads to do something, 
stuck in pthread_mutex_lock():

  futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...>

backtrace:

 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
 #2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
 #3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
 #4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
 #5  0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so
 #6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
 #7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #8  0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #9  0x4b55526e in QApplication::notify ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
 #11 0x4b4dd5de in QETWidget::translateWheelEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #12 0x4b4eb41d in QETWidget::translateMouseEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #13 0x4b4e9766 in QApplication::x11ProcessEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #14 0x4b4fb38b in QEventLoop::processEvents ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #15 0x4b56ce30 in QEventLoop::enterLoop ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #18 0x0806fc1a in QWidget::setUpdatesEnabled ()
 #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
 #20 0x0806f7e1 in QWidget::setUpdatesEnabled ()

Kaffeine thread 9324, seems to be in an infinite pthread_cond_wait() 
loop that does:

 futex(0xb0740b78, FUTEX_WAIT, 3559, NULL) = 0
 futex(0xb0740b5c, FUTEX_WAKE, 1)        = 0
 munmap(0xaacb1000, 1662976)             = 0
 mmap2(NULL, 1662976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xaacb1000
 gettimeofday({1176891363, 347259}, NULL) = 0
 munmap(0xab309000, 1662976)             = 0

backtrace:

 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
 #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #3  0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #5  0x4a05820e in clone () from /lib/libc.so.6

Kaffine thread 9325 does a loop of short pthread_cond_wait() futex 
sleeps:

 1176891721.419314 futex(0xb07527e8, FUTEX_WAIT, 8537, NULL) = 0 <0.011710>
 1176891721.431068 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000006>
 1176891721.431429 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000008>
 1176891721.431458 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000012>
 1176891721.431489 futex(0xb07527e8, FUTEX_WAIT, 8539, NULL) = 0 <0.007339>
 1176891721.439008 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000052>
 1176891721.439510 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000055>
 1176891721.439636 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000089>
 1176891721.439789 futex(0xb07527e8, FUTEX_WAIT, 8541, NULL) = 0 <0.007045>
 1176891721.447017 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000054>
 1176891721.447682 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000065>

backtrace:

 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
 #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #3  0xb7a04079 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #5  0x4a05820e in clone () from /lib/libc.so.6

library versions:

 xine-lib-1.1.5-1.fc7
 xine-plugin-1.0-3.fc7
 glibc-headers-2.5.90-21
 glibc-common-2.5.90-21
 glibc-2.5.90-21
 glibc-devel-2.5.90-21
 gxine-0.5.11-3.fc7
 kaffeine-0.8.3-4.fc7
 xine-0.99.4-11.lvn7
 xine-lib-extras-1.1.5-1.fc7
 gxine-mozplugin-0.5.11-3.fc7

what's weird is that all threads are in a pthread op and seem to be kind 
of busy-looping. Maybe xine-lib has some buggy use of pthread condvars 
that CFS happens to trigger? (If CFS broke futexes in general i think 
we'd be seeing far more widespread breakage.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:59     ` Ingo Molnar
  2007-04-17 11:11       ` Nick Piggin
@ 2007-04-18  8:55       ` Nick Piggin
  2007-04-18  9:33         ` Con Kolivas
  2007-04-18  9:53         ` Ingo Molnar
  1 sibling, 2 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-18  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
> 
> Thanks for testing this! Could you please try this also with:
> 
>    echo 100000000 > /proc/sys/kernel/sched_granularity

507.68user 31.87system 2:18.05elapsed 390%CPU
507.99user 31.93system 2:18.09elapsed 390%CPU
507.46user 31.78system 2:18.03elapsed 390%CPU
507.68user 31.93system 2:18.11elapsed 390%CPU
507.63user 31.98system 2:18.01elapsed 390%CPU
507.83user 31.94system 2:18.28elapsed 390%CPU

> could you maybe even try a more extreme setting of:
> 
>    echo 500000000 > /proc/sys/kernel/sched_granularity

504.87user 32.13system 2:18.03elapsed 389%CPU
505.94user 32.29system 2:17.87elapsed 390%CPU
506.10user 31.90system 2:17.96elapsed 389%CPU
505.02user 32.02system 2:17.96elapsed 389%CPU
506.69user 31.96system 2:17.82elapsed 390%CPU
505.70user 31.84system 2:17.90elapsed 389%CPU


Again, for comparison 2.6.21-rc7 mainline:

508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

So looking at elapsed time, a granularity of 100ms is just behind the
mainline score. However it is using slightly less user time and
slightly more idle time, which indicates that balancing might have got
a bit less aggressive.

But anyway, it conclusively shows the efficiency impact of such tiny
timeslices.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  8:27             ` Ingo Molnar
@ 2007-04-18  8:57               ` Ingo Molnar
  2007-04-18  9:06                 ` Ingo Molnar
  2007-04-18  8:57               ` Christoph Pfister
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  8:57 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Ingo Molnar <mingo@elte.hu> wrote:

> update: i've reproduced one kind of a hang but i'm not sure it's the 
> same hang Ismail is seeing. It was quite hard to trigger it under CFS, 
> i had to do wild forward/backward button seeks on a real DVD and i 
> mixed it with CPU-intense workloads on the same box. Here are the 
> straces and gdb backtraces:

these were only the threads that showed up in htop. Here's a full 
analysis about what all threads are doing:

 Process 9303: stuck in xine_play()/pthread_mutex_lock()
 Process 9319:  stuck in pthread_cond_timedwait()
 Process 9320:  stuck in pthread_cond_timedwait()
 Process 9321: loop of ~3 msec nanosleeps
 Process 9322: loop of poll() calls every 335 msecs
 Process 9323:  stuck in pthread_cond_timedwait()
 Process 9324: stuck in a loop of 1-second futex-waits + mmap/munmap (malloc)
 Process 9325:  stuck in pthread_cond_timedwait()
 Process 9326:  stuck in pthread_cond_timedwait()
 Process 9327:  stuck in pthread_cond_timedwait()

now here's a weird thing: occasionally, when i strace one of the 
threads, i can get a single frame refreshed in the Kaffeine window - but 
the general picture does not change, the same 'stuck' state is still 
there.

most threads are sitting in:

 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a25134c in pthread_cond_timedwait@@GLIBC_2.3.2 ()   from /lib/libpthread.so.0
 #2  0xb79f9a05 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #3  0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #4  0x4a05820e in clone () from /lib/libc.so.6

9324 is looping around this place, apparently in the opengl video output 
driver, but the backtrace is not always this one:

 (gdb) bt
 #0  0x49ff7257 in memset () from /lib/libc.so.6
 #1  0x49ff1877 in calloc () from /lib/libc.so.6
 #2  0xb7a224d6 in xine_xmalloc_aligned () from /usr/lib/libxine.so.1
 #3  0xb708c8f6 in QWidget::setUpdatesEnabled ()
    from /usr/lib/xine/plugins/1.1.5/xineplug_vo_out_opengl.so
 #4  0xb7a0525a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #5  0xb78944e4 in QWidget::setUpdatesEnabled ()
    from /usr/lib/xine/plugins/1.1.5/post/xineplug_post_tvtime.so
 #6  0xb7895234 in QWidget::setUpdatesEnabled ()
    from /usr/lib/xine/plugins/1.1.5/post/xineplug_post_tvtime.so
 #7  0xad4e5439 in QWidget::setUpdatesEnabled ()
    from /usr/lib/xine/plugins/1.1.5/xineplug_decode_mpeg2.so
 #8  0xad4fa8e1 in QWidget::setUpdatesEnabled ()
    from /usr/lib/xine/plugins/1.1.5/xineplug_decode_mpeg2.so
 #9  0xb7a032d6 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
 #10 0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #11 0x4a05820e in clone () from /lib/libc.so.6

9321 is sitting in:

(gdb) bt
 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a2544a6 in nanosleep () from /lib/libpthread.so.0
 #2  0xb7a222fa in xine_usec_sleep () from /usr/lib/libxine.so.1
 #3  0xb7a073bb in QWidget::setUpdatesEnabled () from  /usr/lib/libxine.so.1
 #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #5  0x4a05820e in clone () from /lib/libc.so.6

9322 is in poll():

(gdb) bt
 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a04e533 in poll () from /lib/libc.so.6
 #2  0xb12e1f75 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/xineplug_ao_out_alsa.so
 #3  0x4a24d2db in start_thread () from /lib/libpthread.so.0
 #4  0x4a05820e in clone () from /lib/libc.so.6

9303 is stuck in xine_play(), pthread_mutex_lock():

 #0  0xffffe410 in __kernel_vsyscall ()
 #1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
 #2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
 #3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
 #4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
 #5  0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so
 #6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
 #7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #8  0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #9  0x4b55526e in QApplication::notify ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
 #11 0x4b4dd5de in QETWidget::translateWheelEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #12 0x4b4eb41d in QETWidget::translateMouseEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #13 0x4b4e9766 in QApplication::x11ProcessEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #14 0x4b4fb38b in QEventLoop::processEvents ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #15 0x4b56ce30 in QEventLoop::enterLoop ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
 #18 0x0806fc1a in QWidget::setUpdatesEnabled ()
 #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
 #20 0x0806f7e1 in QWidget::setUpdatesEnabled ()

library versions:

 xine-lib-1.1.5-1.fc7
 xine-plugin-1.0-3.fc7
 glibc-headers-2.5.90-21
 glibc-common-2.5.90-21
 glibc-2.5.90-21
 glibc-devel-2.5.90-21
 gxine-0.5.11-3.fc7
 kaffeine-0.8.3-4.fc7
 xine-0.99.4-11.lvn7
 xine-lib-extras-1.1.5-1.fc7
 gxine-mozplugin-0.5.11-3.fc7

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  8:27             ` Ingo Molnar
  2007-04-18  8:57               ` Ingo Molnar
@ 2007-04-18  8:57               ` Christoph Pfister
  2007-04-18  9:01                 ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18  8:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

Hi,

2007/4/18, Ingo Molnar <mingo@elte.hu>:
>
> [ i've Cc:-ed Ulrich Drepper, this CFS-triggered hang seems to have some
>   futex and pthread_cond_wait() relevance. ]
>
> * Christoph Pfister <christophpfister@gmail.com> wrote:
>
> > >> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
> >
> > Could you try xine-ui or gxine? Because I suspect rather xine-lib for
> > freezing issues. In any way I think a gdb backtrace would be much
> > nicer - but if you can't reproduce the freeze issue with other xine
> > based players and want to run kaffeine in gdb, you need to execute
> > "gdb --args kaffeine --nofork".
>
> update: i've reproduced one kind of a hang but i'm not sure it's the
> same hang Ismail is seeing. It was quite hard to trigger it under CFS, i
> had to do wild forward/backward button seeks on a real DVD and i mixed
> it with CPU-intense workloads on the same box. Here are the straces and
> gdb backtraces:
>
> kaffeine thread PID 9303, waiting for other threads to do something,
> stuck in pthread_mutex_lock():
>
>   futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...>
>
> backtrace:
>
>  #0  0xffffe410 in __kernel_vsyscall ()
>  #1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
>  #2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
>  #3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
>  #4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
>  #5  0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so
>  #6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
>  #7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #8  0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #9  0x4b55526e in QApplication::notify ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
>  #11 0x4b4dd5de in QETWidget::translateWheelEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #12 0x4b4eb41d in QETWidget::translateMouseEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #13 0x4b4e9766 in QApplication::x11ProcessEvent ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #14 0x4b4fb38b in QEventLoop::processEvents ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #15 0x4b56ce30 in QEventLoop::enterLoop ()   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
>  #18 0x0806fc1a in QWidget::setUpdatesEnabled ()
>  #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
>  #20 0x0806f7e1 in QWidget::setUpdatesEnabled ()
>
> Kaffeine thread 9324, seems to be in an infinite pthread_cond_wait()
> loop that does:
>
>  futex(0xb0740b78, FUTEX_WAIT, 3559, NULL) = 0
>  futex(0xb0740b5c, FUTEX_WAKE, 1)        = 0
>  munmap(0xaacb1000, 1662976)             = 0
>  mmap2(NULL, 1662976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xaacb1000
>  gettimeofday({1176891363, 347259}, NULL) = 0
>  munmap(0xab309000, 1662976)             = 0
>
> backtrace:
>
>  #0  0xffffe410 in __kernel_vsyscall ()
>  #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
>  #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
>  #3  0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
>  #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
>  #5  0x4a05820e in clone () from /lib/libc.so.6

This backtrace is useless - QWidget::setUpdatesEnabled() is certainly
_not_ defined in libxine. So the function names in #2 and #3 are wrong
because the addresses seem to belong to libxine.

> Kaffine thread 9325 does a loop of short pthread_cond_wait() futex
> sleeps:
>
>  1176891721.419314 futex(0xb07527e8, FUTEX_WAIT, 8537, NULL) = 0 <0.011710>
>  1176891721.431068 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000006>
>  1176891721.431429 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000008>
>  1176891721.431458 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000012>
>  1176891721.431489 futex(0xb07527e8, FUTEX_WAIT, 8539, NULL) = 0 <0.007339>
>  1176891721.439008 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000052>
>  1176891721.439510 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000055>
>  1176891721.439636 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000089>
>  1176891721.439789 futex(0xb07527e8, FUTEX_WAIT, 8541, NULL) = 0 <0.007045>
>  1176891721.447017 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000054>
>  1176891721.447682 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000065>
>
> backtrace:
>
>  #0  0xffffe410 in __kernel_vsyscall ()
>  #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
>  #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
>  #3  0xb7a04079 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
>  #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
>  #5  0x4a05820e in clone () from /lib/libc.so.6

Dito.

> library versions:
>
>  xine-lib-1.1.5-1.fc7
>  xine-plugin-1.0-3.fc7
>  glibc-headers-2.5.90-21
>  glibc-common-2.5.90-21
>  glibc-2.5.90-21
>  glibc-devel-2.5.90-21
>  gxine-0.5.11-3.fc7
>  kaffeine-0.8.3-4.fc7
>  xine-0.99.4-11.lvn7
>  xine-lib-extras-1.1.5-1.fc7
>  gxine-mozplugin-0.5.11-3.fc7
>
> what's weird is that all threads are in a pthread op and seem to be kind
> of busy-looping. Maybe xine-lib has some buggy use of pthread condvars
> that CFS happens to trigger? (If CFS broke futexes in general i think
> we'd be seeing far more widespread breakage.)
>
>         Ingo

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  8:57               ` Christoph Pfister
@ 2007-04-18  9:01                 ` Ingo Molnar
  2007-04-18  9:12                   ` Mike Galbraith
  2007-04-18  9:13                   ` Christoph Pfister
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:01 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Christoph Pfister <christophpfister@gmail.com> wrote:

> >backtrace:
> >
> > #0  0xffffe410 in __kernel_vsyscall ()
> > #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> > /lib/libpthread.so.0
> > #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > #3  0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
> > #5  0x4a05820e in clone () from /lib/libc.so.6
> 
> This backtrace is useless - QWidget::setUpdatesEnabled() is certainly 
> _not_ defined in libxine. So the function names in #2 and #3 are wrong 
> because the addresses seem to belong to libxine.

are the updated backtraces in the followup mail i just sent more useful? 
(I still have that stuck session running so i can whatever debugging 
you'd like to see done.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  8:57               ` Ingo Molnar
@ 2007-04-18  9:06                 ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:06 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Ingo Molnar <mingo@elte.hu> wrote:

> these were only the threads that showed up in htop. Here's a full 
> analysis about what all threads are doing:
> 
>  Process 9303: stuck in xine_play()/pthread_mutex_lock()
>  Process 9319:  stuck in pthread_cond_timedwait()
>  Process 9320:  stuck in pthread_cond_timedwait()
>  Process 9321: loop of ~3 msec nanosleeps
>  Process 9322: loop of poll() calls every 335 msecs
>  Process 9323:  stuck in pthread_cond_timedwait()
>  Process 9324: stuck in a loop of 1-second futex-waits + mmap/munmap (malloc)
>  Process 9325:  stuck in pthread_cond_timedwait()
>  Process 9326:  stuck in pthread_cond_timedwait()
>  Process 9327:  stuck in pthread_cond_timedwait()

and here's a top snapshot:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9324 mingo     20   0  300m  59m  17m R 96.4  6.8  21:00.55 kaffeine
 9325 mingo     20   0  300m  59m  17m S  2.0  6.8   0:15.57 kaffeine
 9327 mingo     20   0  300m  59m  17m S  2.0  6.8   0:20.10 kaffeine

so 9324 doing the mpeg decoding seems to be stuck somehow?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:01                 ` Ingo Molnar
@ 2007-04-18  9:12                   ` Mike Galbraith
  2007-04-18  9:13                   ` Christoph Pfister
  1 sibling, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-18  9:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Pfister, S.Çağlar Onur, linux-kernel,
	Michael Lothian, Christophe Thommeret, Jurgen Kofler,
	Ulrich Drepper

On Wed, 2007-04-18 at 11:01 +0200, Ingo Molnar wrote:
> * Christoph Pfister <christophpfister@gmail.com> wrote:
> 
> > >backtrace:
> > >
> > > #0  0xffffe410 in __kernel_vsyscall ()
> > > #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> > > /lib/libpthread.so.0
> > > #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > > #3  0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > > #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
> > > #5  0x4a05820e in clone () from /lib/libc.so.6
> > 
> > This backtrace is useless - QWidget::setUpdatesEnabled() is certainly 
> > _not_ defined in libxine. So the function names in #2 and #3 are wrong 
> > because the addresses seem to belong to libxine.
> 
> are the updated backtraces in the followup mail i just sent more useful? 
> (I still have that stuck session running so i can whatever debugging 
> you'd like to see done.)

The xine website release note says there are problems with playback with
xine-lib version 1.1.5, so people encountering this may want to check to
see if they're running 1.1.5, and either upgrade to the latest, or
downgrade to 1.1.4.

<snippet from xine website>

18.04.2007   xine-lib 1.1.6   A new xine-lib version is now available.
This is mainly a bug-fix release; 1.1.5 has CD audio and DVD playback
problems and a couple of X-related build problems.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:01                 ` Ingo Molnar
  2007-04-18  9:12                   ` Mike Galbraith
@ 2007-04-18  9:13                   ` Christoph Pfister
  2007-04-18  9:17                     ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18  9:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

2007/4/18, Ingo Molnar <mingo@elte.hu>:
>
> * Christoph Pfister <christophpfister@gmail.com> wrote:
>
> > >backtrace:
> > >
> > > #0  0xffffe410 in __kernel_vsyscall ()
> > > #1  0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/libpthread.so.0
> > > #2  0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > > #3  0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> > > #4  0x4a24d2db in start_thread () from /lib/libpthread.so.0
> > > #5  0x4a05820e in clone () from /lib/libc.so.6
> >
> > This backtrace is useless - QWidget::setUpdatesEnabled() is certainly
> > _not_ defined in libxine. So the function names in #2 and #3 are wrong
> > because the addresses seem to belong to libxine.
>
> are the updated backtraces in the followup mail i just sent more useful?
> (I still have that stuck session running so i can whatever debugging
> you'd like to see done.)

QWidget::setUpdatesEnabled() is (wrongly) present in every thread
except the main. So I'm afraid there's nothing which can be done :-/
Btw the main thread is waiting for the first frame being displayed
after the seek.

>         Ingo

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:13                   ` Christoph Pfister
@ 2007-04-18  9:17                     ` Ingo Molnar
  2007-04-18  9:25                       ` Christoph Pfister
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:17 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Christoph Pfister <christophpfister@gmail.com> wrote:

> >are the updated backtraces in the followup mail i just sent more 
> >useful? (I still have that stuck session running so i can whatever 
> >debugging you'd like to see done.)
> 
> QWidget::setUpdatesEnabled() is (wrongly) present in every thread 
> except the main. So I'm afraid there's nothing which can be done :-/ 
> Btw the main thread is waiting for the first frame being displayed 
> after the seek.

i didnt have all the debuginfo packages installed. I installed some (but 
not all yet), here's an updated backtrace:

(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0
#2  0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0
#3  0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0
#4  0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0
#5  0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
#6  0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570
#7  0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
#8  0x4a24d2db in start_thread () from /lib/libpthread.so.0
#9  0x4a05820e in clone () from /lib/libc.so.6

at least the dvd_plugin_free_buffer() call has been resolved now. (I'll 
hunt for the other debuginfo packages too.)

which thread would be the most interesting to you - 9324?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:17                     ` Ingo Molnar
@ 2007-04-18  9:25                       ` Christoph Pfister
  2007-04-18  9:28                         ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

2007/4/18, Ingo Molnar <mingo@elte.hu>:
>
> * Christoph Pfister <christophpfister@gmail.com> wrote:
>
> > >are the updated backtraces in the followup mail i just sent more
> > >useful? (I still have that stuck session running so i can whatever
> > >debugging you'd like to see done.)
> >
> > QWidget::setUpdatesEnabled() is (wrongly) present in every thread
> > except the main. So I'm afraid there's nothing which can be done :-/
> > Btw the main thread is waiting for the first frame being displayed
> > after the seek.
>
> i didnt have all the debuginfo packages installed. I installed some (but
> not all yet), here's an updated backtrace:
>
> (gdb) bt
> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0
> #2  0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0
> #3  0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0
> #4  0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0
> #5  0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> #6  0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570
> #7  0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
> #8  0x4a24d2db in start_thread () from /lib/libpthread.so.0
> #9  0x4a05820e in clone () from /lib/libc.so.6
>
> at least the dvd_plugin_free_buffer() call has been resolved now. (I'll
> hunt for the other debuginfo packages too.)
>
> which thread would be the most interesting to you - 9324?

The thread which should wake the main thread - but hmm ... 9303 seems
to be rather dead-locked than doing pthread_cond_timedwait() ?

>         Ingo

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:25                       ` Christoph Pfister
@ 2007-04-18  9:28                         ` Ingo Molnar
  2007-04-18  9:52                           ` Christoph Pfister
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:28 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Christoph Pfister <christophpfister@gmail.com> wrote:

> >which thread would be the most interesting to you - 9324?
> 
> The thread which should wake the main thread - but hmm ... 9303 seems 
> to be rather dead-locked than doing pthread_cond_timedwait() ?

ok, here it is, 9303 with better symbol names:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
#3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
#5  0xb7a9b0fb in KXineWidget::slotSeekToPosition ()
   from /usr/lib/kde3/libxinepart.so
#6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
#7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#8  0x4b55353b in QApplication::internalNotify ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#9  0x4b55526e in QApplication::notify ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
#11 0x4b4dd5de in QETWidget::translateWheelEvent ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#12 0x4b4eb41d in QETWidget::translateMouseEvent ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#13 0x4b4e9766 in QApplication::x11ProcessEvent ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#14 0x4b4fb38b in QEventLoop::processEvents ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#15 0x4b56ce30 in QEventLoop::enterLoop ()
   from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
#18 0x0806fc1a in QWidget::setUpdatesEnabled ()
#19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
#20 0x0806f7e1 in QWidget::setUpdatesEnabled ()

does this make more sense to you?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  8:55       ` Nick Piggin
@ 2007-04-18  9:33         ` Con Kolivas
  2007-04-18 12:14           ` Nick Piggin
  2007-04-18  9:53         ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-18  9:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> > * Nick Piggin <npiggin@suse.de> wrote:
> > > 2.6.21-rc7-cfs-v2
> > > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > > 535.43user 30.62system 2:23.72elapsed 393%CPU
> >
> > Thanks for testing this! Could you please try this also with:
> >
> >    echo 100000000 > /proc/sys/kernel/sched_granularity
>
> 507.68user 31.87system 2:18.05elapsed 390%CPU
> 507.99user 31.93system 2:18.09elapsed 390%CPU
> 507.46user 31.78system 2:18.03elapsed 390%CPU
> 507.68user 31.93system 2:18.11elapsed 390%CPU
> 507.63user 31.98system 2:18.01elapsed 390%CPU
> 507.83user 31.94system 2:18.28elapsed 390%CPU
>
> > could you maybe even try a more extreme setting of:
> >
> >    echo 500000000 > /proc/sys/kernel/sched_granularity
>
> 504.87user 32.13system 2:18.03elapsed 389%CPU
> 505.94user 32.29system 2:17.87elapsed 390%CPU
> 506.10user 31.90system 2:17.96elapsed 389%CPU
> 505.02user 32.02system 2:17.96elapsed 389%CPU
> 506.69user 31.96system 2:17.82elapsed 390%CPU
> 505.70user 31.84system 2:17.90elapsed 389%CPU
>
>
> Again, for comparison 2.6.21-rc7 mainline:
>
> 508.87user 32.47system 2:17.82elapsed 392%CPU
> 509.05user 32.25system 2:17.84elapsed 392%CPU
> 508.75user 32.26system 2:17.83elapsed 392%CPU
> 508.63user 32.17system 2:17.88elapsed 392%CPU
> 509.01user 32.26system 2:17.90elapsed 392%CPU
> 509.08user 32.20system 2:17.95elapsed 392%CPU
>
> So looking at elapsed time, a granularity of 100ms is just behind the
> mainline score. However it is using slightly less user time and
> slightly more idle time, which indicates that balancing might have got
> a bit less aggressive.
>
> But anyway, it conclusively shows the efficiency impact of such tiny
> timeslices.

See test.kernel.org for how (the now defunct) SD was performing on kernbench. 
It had low latency _and_ equivalent throughput to mainline. Set the standard 
appropriately on both counts please.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:28                         ` Ingo Molnar
@ 2007-04-18  9:52                           ` Christoph Pfister
  2007-04-18 10:04                             ` Christoph Pfister
  2007-04-18 10:17                             ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18  9:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

2007/4/18, Ingo Molnar <mingo@elte.hu>:
>
> * Christoph Pfister <christophpfister@gmail.com> wrote:
>
> > >which thread would be the most interesting to you - 9324?
> >
> > The thread which should wake the main thread - but hmm ... 9303 seems
> > to be rather dead-locked than doing pthread_cond_timedwait() ?
>
> ok, here it is, 9303 with better symbol names:
>
> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
> #2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
> #3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
> #4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
> #5  0xb7a9b0fb in KXineWidget::slotSeekToPosition ()
>    from /usr/lib/kde3/libxinepart.so
> #6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
> #7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #8  0x4b55353b in QApplication::internalNotify ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #9  0x4b55526e in QApplication::notify ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
> #11 0x4b4dd5de in QETWidget::translateWheelEvent ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #12 0x4b4eb41d in QETWidget::translateMouseEvent ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #13 0x4b4e9766 in QApplication::x11ProcessEvent ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #14 0x4b4fb38b in QEventLoop::processEvents ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #15 0x4b56ce30 in QEventLoop::enterLoop ()
>    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> #18 0x0806fc1a in QWidget::setUpdatesEnabled ()
> #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
> #20 0x0806f7e1 in QWidget::setUpdatesEnabled ()
>
> does this make more sense to you?

It's nearly impossible for me to find out which mutex is deadlocking.
There are 4 mutexs locked / released during xine_play (or one of the
possibly inlined functions) and to be honest I have little idea which
other thread is also involved in the deadlock (maybe some xine-lib
junkie could help you more with that).
It would be great if you could reproduce the same problem with a
xine-lib which has been compiled with debug support (so you'd get line
numbers in the back trace - that makes life _a lot_ easier and maybe I
could identify the problem that way) and the least optimization
possible ... :-)

>         Ingo

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  8:55       ` Nick Piggin
  2007-04-18  9:33         ` Con Kolivas
@ 2007-04-18  9:53         ` Ingo Molnar
  2007-04-18 12:13           ` Nick Piggin
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan


* Nick Piggin <npiggin@suse.de> wrote:

> > > 535.43user 30.62system 2:23.72elapsed 393%CPU
> > 
> > Thanks for testing this! Could you please try this also with:
> > 
> >    echo 100000000 > /proc/sys/kernel/sched_granularity
> 
> 507.68user 31.87system 2:18.05elapsed 390%CPU
> 507.99user 31.93system 2:18.09elapsed 390%CPU

> > could you maybe even try a more extreme setting of:
> > 
> >    echo 500000000 > /proc/sys/kernel/sched_granularity

> 506.69user 31.96system 2:17.82elapsed 390%CPU
> 505.70user 31.84system 2:17.90elapsed 389%CPU

> Again, for comparison 2.6.21-rc7 mainline:
> 
> 508.87user 32.47system 2:17.82elapsed 392%CPU
> 509.05user 32.25system 2:17.84elapsed 392%CPU

thanks for testing this!

> So looking at elapsed time, a granularity of 100ms is just behind the 
> mainline score. However it is using slightly less user time and 
> slightly more idle time, which indicates that balancing might have got 
> a bit less aggressive.
> 
> But anyway, it conclusively shows the efficiency impact of such tiny 
> timeslices.

yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is 
not unexpected when going to really frequent preemption. Clearly, the 
default preemption granularity needs to be tuned up.

I think you said you measured ~3msec average preemption rate per CPU? 
That would suggest the average cache-trashing cost was 120 usecs per 
every 3 msec window. Taking that as a ballpark figure, to get the 
difference back into the noise range we'd have to either use ~5 msec:

    echo 5000000 > /proc/sys/kernel/sched_granularity

or 15 msec:

    echo 15000000 > /proc/sys/kernel/sched_granularity

(depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i 
correctly understood your 3msec value. I'd have to know your kernbench 
workload's approximate 'steady state' context-switch rate to do a more 
accurate calculation.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:52                           ` Christoph Pfister
@ 2007-04-18 10:04                             ` Christoph Pfister
  2007-04-18 10:17                             ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18 10:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

2007/4/18, Christoph Pfister <christophpfister@gmail.com>:
> 2007/4/18, Ingo Molnar <mingo@elte.hu>:
> >
> > * Christoph Pfister <christophpfister@gmail.com> wrote:
> >
> > > >which thread would be the most interesting to you - 9324?
> > >
> > > The thread which should wake the main thread - but hmm ... 9303 seems
> > > to be rather dead-locked than doing pthread_cond_timedwait() ?
> >
> > ok, here it is, 9303 with better symbol names:
> >
> > #0  0xffffe410 in __kernel_vsyscall ()
> > #1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
> > #2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
> > #3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
> > #4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1
> > #5  0xb7a9b0fb in KXineWidget::slotSeekToPosition ()
> >    from /usr/lib/kde3/libxinepart.so
> > #6  0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so
> > #7  0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #8  0x4b55353b in QApplication::internalNotify ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #9  0x4b55526e in QApplication::notify ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4
> > #11 0x4b4dd5de in QETWidget::translateWheelEvent ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #12 0x4b4eb41d in QETWidget::translateMouseEvent ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #13 0x4b4e9766 in QApplication::x11ProcessEvent ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #14 0x4b4fb38b in QEventLoop::processEvents ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #15 0x4b56ce30 in QEventLoop::enterLoop ()
> >    from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3
> > #18 0x0806fc1a in QWidget::setUpdatesEnabled ()
> > #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6
> > #20 0x0806f7e1 in QWidget::setUpdatesEnabled ()
> >
> > does this make more sense to you?
>
> It's nearly impossible for me to find out which mutex is deadlocking.
> There are 4 mutexs locked / released during xine_play (or one of the
> possibly inlined functions) and to be honest I have little idea which
> other thread is also involved in the deadlock (maybe some xine-lib
> junkie could help you more with that).
> It would be great if you could reproduce the same problem with a
> xine-lib which has been compiled with debug support (so you'd get line
> numbers in the back trace - that makes life _a lot_ easier and maybe I
> could identify the problem that way) and the least optimization
> possible ... :-)
>
> >         Ingo

Or I could try playing around a bit with your patchset and trying to
reproduce it over here. Because I already have debug builds for
xine-lib and compiling a new kernel can take place in the background
it wouldn't be much effort for me.

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18  9:52                           ` Christoph Pfister
  2007-04-18 10:04                             ` Christoph Pfister
@ 2007-04-18 10:17                             ` Ingo Molnar
  2007-04-18 10:32                               ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:17 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Christoph Pfister <christophpfister@gmail.com> wrote:

> It's nearly impossible for me to find out which mutex is deadlocking.

i've disassembled the xine_play function, and here are the function 
calls in it:

  <unresolved widget call?>
 pthread_mutex_lock()
 xine_log()
  <unresolved widget call?>
 function pointer call
 right after it: pthread_mutex_lock()

this second pthread_mutex_lock() in question is the one that deadlocks. 
It comes right after that function pointer call, maybe that identifies 
it?

[some time passes]

i rebuilt the library from source and while the installed library is 
different from it, looking at the disassembly i'm quite sure it's this 
pthread_mutex_lock() in xine_play_internal():

  pthread_mutex_lock( &stream->demux_lock );

src/xine-engine/xine.c:1201

the function pointer call was:

  stream->xine->port_ticket->acquire(stream->xine->port_ticket, 1);

right before the pthread_mutex_lock() call.

> It would be great if you could reproduce the same problem with a 
> xine-lib which has been compiled with debug support (so you'd get line 
> numbers in the back trace - that makes life _a lot_ easier and maybe I 
> could identify the problem that way) and the least optimization 
> possible ... :-)

ok, i'll try that too (but it will take some more time), but given how 
hard it was for me to trigger it, i wanted to get maximum info out of it 
before having to kill the threads.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56 ` Andy Whitcroft
  2007-04-17  9:32   ` Nick Piggin
@ 2007-04-18 10:22   ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:22 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan


* Andy Whitcroft <apw@shadowen.org> wrote:

> > as usual, any sort of feedback, bugreports, fixes and suggestions 
> > are more than welcome,
> 
> Pushed this through the test.kernel.org and nothing new blew up. 
> Notably the kernbench figures are within expectations even on the 
> bigger numa systems, commonly badly affected by balancing problems in 
> the schedular.

thanks! Given the really low preemption latency/granularity default 
(roughly equivalent to 'timeslice length'), and that basically all of my 
focus was on interactivity characteristics, this is a pretty good 
result. I suspect it will be necessary to increase the default to 10 
msecs (or more) to be on the safe side. (Nick has reported a 4% 
kernbench drop so for his kernbench workload it's needed.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 10:17                             ` Ingo Molnar
@ 2007-04-18 10:32                               ` Ingo Molnar
  2007-04-18 10:37                                 ` Ingo Molnar
  2007-04-18 10:53                                 ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:32 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c does 
this:

        pthread_mutex_unlock( &stream->demux_lock );

        lprintf ("sched_yield\n");

        sched_yield();
        pthread_mutex_lock( &stream->demux_lock );

why is this done? CFS has definitely changed the yield implementation so 
there could be some connection.

OTOH, in the 'hung' state none of the straces suggests any yield() call.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 10:32                               ` Ingo Molnar
@ 2007-04-18 10:37                                 ` Ingo Molnar
  2007-04-18 10:49                                   ` Ingo Molnar
  2007-04-18 10:53                                 ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:37 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Ingo Molnar <mingo@elte.hu> wrote:

> hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c 
> does this:

plus it does this too:

      pthread_mutex_unlock( &stream->demux_lock );
      xine_usec_sleep(100000);
      pthread_mutex_lock( &stream->demux_lock );

this would explain the nanosleep() strace entries. But the task stuck on 
demux_lock never gets the unlock event. Weird.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 10:37                                 ` Ingo Molnar
@ 2007-04-18 10:49                                   ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:49 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Ingo Molnar <mingo@elte.hu> wrote:

> > hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c 
> > does this:
> 
> plus it does this too:
> 
>       pthread_mutex_unlock( &stream->demux_lock );
>       xine_usec_sleep(100000);
>       pthread_mutex_lock( &stream->demux_lock );
> 
> this would explain the nanosleep() strace entries. But the task stuck 
> on demux_lock never gets the unlock event. Weird.

9303 is stuck here on demux_lock:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2  0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0
#3  0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1

that mutex related futex is at address 0xb07409e0, but the only sign in 
the strace of that futex being touched is:

9303  futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...>

no other event ever happens on futex 0xb07409e0. Other threads dont 
touch it.

Maybe thread 9324 is the owner of that mutex, and it's looping somewhere 
that does xine_xmalloc_aligned(), with the lock held? It did this:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0
#2  0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0
#3  0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0
#4  0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0
#5  0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
#6  0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570
#7  0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1
#8  0x4a24d2db in start_thread () from /lib/libpthread.so.0
#9  0x4a05820e in clone () from /lib/libc.so.6

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 10:32                               ` Ingo Molnar
  2007-04-18 10:37                                 ` Ingo Molnar
@ 2007-04-18 10:53                                 ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:53 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Ingo Molnar <mingo@elte.hu> wrote:

> hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c 
> does this:
> 
>         pthread_mutex_unlock( &stream->demux_lock );
> 
>         lprintf ("sched_yield\n");
> 
>         sched_yield();
>         pthread_mutex_lock( &stream->demux_lock );
> 
> why is this done? CFS has definitely changed the yield implementation 
> so there could be some connection.
> 
> OTOH, in the 'hung' state none of the straces suggests any yield() 
> call.

yeah, there's no yield() call in any of the straces so i'd exclude this 
as a possibility .

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  9:53         ` Ingo Molnar
@ 2007-04-18 12:13           ` Nick Piggin
  2007-04-18 12:49             ` Con Kolivas
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-18 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > So looking at elapsed time, a granularity of 100ms is just behind the 
> > mainline score. However it is using slightly less user time and 
> > slightly more idle time, which indicates that balancing might have got 
> > a bit less aggressive.
> > 
> > But anyway, it conclusively shows the efficiency impact of such tiny 
> > timeslices.
> 
> yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is 
> not unexpected when going to really frequent preemption. Clearly, the 
> default preemption granularity needs to be tuned up.
> 
> I think you said you measured ~3msec average preemption rate per CPU? 

This was just looking at ctxsw numbers from running 2 cpu hogs on the
same runqueue.

> That would suggest the average cache-trashing cost was 120 usecs per 
> every 3 msec window. Taking that as a ballpark figure, to get the 
> difference back into the noise range we'd have to either use ~5 msec:
> 
>     echo 5000000 > /proc/sys/kernel/sched_granularity
> 
> or 15 msec:
> 
>     echo 15000000 > /proc/sys/kernel/sched_granularity
> 
> (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i 
> correctly understood your 3msec value. I'd have to know your kernbench 
> workload's approximate 'steady state' context-switch rate to do a more 
> accurate calculation.)

The kernel compile (make -j8 on 4 thread system) is doing 1800 total
context switches per second (450/s per runqueue) for cfs, and 670
for mainline. Going up to 20ms granularity for cfs brings the context
switch numbers similar, but user time is still a % or so higher. I'd
be more worried about compute heavy threads which naturally don't do
much context switching.

Some other numbers on the same system
Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
10 groups: Time: 1.332		0.743		0.607
20 groups: Time: 1.197		1.100		1.241
30 groups: Time: 1.754		2.376		1.834
40 groups: Time: 3.451		2.227		2.503
50 groups: Time: 3.726		3.399		3.220
60 groups: Time: 3.548		4.567		3.668
70 groups: Time: 4.206		4.905		4.314
80 groups: Time: 4.551		6.324		4.879
90 groups: Time: 7.904		6.962		5.335
100 groups: Time: 7.293		7.799		5.857
110 groups: Time: 10.595	8.728		6.517
120 groups: Time: 7.543		9.304		7.082
130 groups: Time: 8.269		10.639		8.007
140 groups: Time: 11.867	8.250		8.302
150 groups: Time: 14.852	8.656		8.662
160 groups: Time: 9.648		9.313		9.541

Mainline seems pretty inconsistent here.

lmbench 0K ctxsw latency bound to CPU0:
tasks
2		2.59		3.42		2.50
4		3.26		3.54		3.09
8		3.01		3.64		3.22
16		3.00		3.66		3.50
32		2.99		3.70		3.49
64		3.09		4.17		3.50
128		4.80		5.58		4.74
256		5.79		6.37		5.76

cfs is noticably disadvantaged.

[*] 500ms didn't make much difference in either test.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  9:33         ` Con Kolivas
@ 2007-04-18 12:14           ` Nick Piggin
  2007-04-18 12:33             ` Con Kolivas
  0 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-18 12:14 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > Again, for comparison 2.6.21-rc7 mainline:
> >
> > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > 509.08user 32.20system 2:17.95elapsed 392%CPU
> >
> > So looking at elapsed time, a granularity of 100ms is just behind the
> > mainline score. However it is using slightly less user time and
> > slightly more idle time, which indicates that balancing might have got
> > a bit less aggressive.
> >
> > But anyway, it conclusively shows the efficiency impact of such tiny
> > timeslices.
> 
> See test.kernel.org for how (the now defunct) SD was performing on kernbench. 
> It had low latency _and_ equivalent throughput to mainline. Set the standard 
> appropriately on both counts please.

I can give it a run. Got an updated patch against -rc7?


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:14           ` Nick Piggin
@ 2007-04-18 12:33             ` Con Kolivas
  2007-04-18 21:49               ` Con Kolivas
  0 siblings, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-18 12:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:14, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> > On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > > Again, for comparison 2.6.21-rc7 mainline:
> > >
> > > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > > 509.08user 32.20system 2:17.95elapsed 392%CPU
> > >
> > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > mainline score. However it is using slightly less user time and
> > > slightly more idle time, which indicates that balancing might have got
> > > a bit less aggressive.
> > >
> > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > timeslices.
> >
> > See test.kernel.org for how (the now defunct) SD was performing on
> > kernbench. It had low latency _and_ equivalent throughput to mainline.
> > Set the standard appropriately on both counts please.
>
> I can give it a run. Got an updated patch against -rc7?

I said I wasn't pursuing it but since you're offering, the rc6 patch should 
apply ok.

http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:13           ` Nick Piggin
@ 2007-04-18 12:49             ` Con Kolivas
  2007-04-19  3:28               ` Nick Piggin
  0 siblings, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-18 12:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:13, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote:
> > * Nick Piggin <npiggin@suse.de> wrote:
> > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > mainline score. However it is using slightly less user time and
> > > slightly more idle time, which indicates that balancing might have got
> > > a bit less aggressive.
> > >
> > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > timeslices.
> >
> > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is
> > not unexpected when going to really frequent preemption. Clearly, the
> > default preemption granularity needs to be tuned up.
> >
> > I think you said you measured ~3msec average preemption rate per CPU?
>
> This was just looking at ctxsw numbers from running 2 cpu hogs on the
> same runqueue.
>
> > That would suggest the average cache-trashing cost was 120 usecs per
> > every 3 msec window. Taking that as a ballpark figure, to get the
> > difference back into the noise range we'd have to either use ~5 msec:
> >
> >     echo 5000000 > /proc/sys/kernel/sched_granularity
> >
> > or 15 msec:
> >
> >     echo 15000000 > /proc/sys/kernel/sched_granularity
> >
> > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i
> > correctly understood your 3msec value. I'd have to know your kernbench
> > workload's approximate 'steady state' context-switch rate to do a more
> > accurate calculation.)
>
> The kernel compile (make -j8 on 4 thread system) is doing 1800 total
> context switches per second (450/s per runqueue) for cfs, and 670
> for mainline. Going up to 20ms granularity for cfs brings the context
> switch numbers similar, but user time is still a % or so higher. I'd
> be more worried about compute heavy threads which naturally don't do
> much context switching.

While kernel compiles are nice and easy to do I've seen enough criticism of 
them in the past to wonder about their usefulness as a standard benchmark on 
their own.

>
> Some other numbers on the same system
> Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
> 10 groups: Time: 1.332		0.743		0.607
> 20 groups: Time: 1.197		1.100		1.241
> 30 groups: Time: 1.754		2.376		1.834
> 40 groups: Time: 3.451		2.227		2.503
> 50 groups: Time: 3.726		3.399		3.220
> 60 groups: Time: 3.548		4.567		3.668
> 70 groups: Time: 4.206		4.905		4.314
> 80 groups: Time: 4.551		6.324		4.879
> 90 groups: Time: 7.904		6.962		5.335
> 100 groups: Time: 7.293		7.799		5.857
> 110 groups: Time: 10.595	8.728		6.517
> 120 groups: Time: 7.543		9.304		7.082
> 130 groups: Time: 8.269		10.639		8.007
> 140 groups: Time: 11.867	8.250		8.302
> 150 groups: Time: 14.852	8.656		8.662
> 160 groups: Time: 9.648		9.313		9.541

Hackbench even more so. A prolonged discussion with Rusty Russell on this 
issue he suggested hackbench was more a pass/fail benchmark to ensure there 
was no starvation scenario that never ended, and very little value should be 
placed on the actual results returned from it.

Wli's concerns regarding some sort of standard framework for a battery of 
accepted meaningful benchmarks comes to mind as important rather than ones 
that highlight one over the other. So while interesting for their own 
endpoints, I certainly wouldn't put either benchmark as some sort of 
yardstick for a "winner". Note I'm not saying that we shouldn't be looking at 
them per se, but since the whole drive for a new scheduler is trying to be 
more objective we need to start expanding the range of benchmarks. Even 
though I don't feel the need to have SD in the "race" I guess it stands for 
more data to compare what is possible/where as well.

> Mainline seems pretty inconsistent here.
>
> lmbench 0K ctxsw latency bound to CPU0:
> tasks
> 2		2.59		3.42		2.50
> 4		3.26		3.54		3.09
> 8		3.01		3.64		3.22
> 16		3.00		3.66		3.50
> 32		2.99		3.70		3.49
> 64		3.09		4.17		3.50
> 128		4.80		5.58		4.74
> 256		5.79		6.37		5.76
>
> cfs is noticably disadvantaged.
>
> [*] 500ms didn't make much difference in either test.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:43                     ` Chris Friesen
@ 2007-04-18 13:00                       ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-18 13:00 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
>> Chris Friesen wrote:
> 
>>> Suppose I have a really high priority task running.  Another very 
>>> high priority task wakes up and would normally preempt the first one. 
>>> However, there happens to be another cpu available.  It seems like it 
>>> would be a win if we moved one of those tasks to the available cpu 
>>> immediately so they can both run simultaneously.  This would seem to 
>>> require some communication between the scheduler and the load balancer.
>>
>>
>> Not really the load balancer can do this on its own AND the decision 
>> should be based on the STATIC priority of the task being woken.
> 
> I guess I don't follow.  How would the load balancer know that it needs 
> to run?  Running on every task wake-up seems expensive.  Also, static 
> priority isn't everything.  What about the gang-scheduler concept where 
> certain tasks must be scheduled simultaneously on different cpus?  What 
> about a resource-group scenario where you have per-cpu resource limits, 
> so that for good latency/fairness you need to force a high priority task 
> to migrate to another cpu once it has consumed the cpu allocation of 
> that group on the current cpu?
> 
> I can see having a generic load balancer core code, but it seems to me 
> that the scheduler proper needs to have some way of triggering the load 
> balancer to run,

It doesn't have to be closely coupled with the load balancer to does 
this.  It just needs to know where the trigger is.

> and some kind of goodness functions to indicate a) 
> which tasks to move, and b) where to move them.

That's the load balancer's job and even if you use dynamic priority for 
load balancing it still wouldn't need to be closely coupled.  The load 
balancer would just need to know how to find a process's dynamic priority.

In fact, in the current set up, the load balancer decides how much load 
needs to be moved based on the static load on the CPUs but uses dynamic 
priority (to a large degree) to decide which ones to move.  This is due 
more to computational efficiency considerations than any deliberate 
design (I suspect) as the fact that tasks are stored on the runqueue in 
dynamic priority order makes looking at processes in dynamic priority 
order is the most efficient strategy.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
       [not found]               ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com>
@ 2007-04-18 13:05                 ` S.Çağlar Onur
  2007-04-18 13:21                 ` Christoph Pfister
  1 sibling, 0 replies; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-18 13:05 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret,
	Jurgen Kofler, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 867 bytes --]

18 Nis 2007 Çar tarihinde, Christoph Pfister şunları yazmıştı: 
> > Okay - so here are some results (it's strange that gdb goes nuts
> > inside the xine_play call). I have three bts (seems to be fairly easy
> > to reproduce that behaviour over here): Twice while playing an audio
> > cd and once while playing a normal file. The hang usually ends if you
> > wait long enough (something around 30 secs over here).

I can confirm this, freeze ends after some wait period (~20-30 secs) if 
kaffine is the only active process. I didn't notice that cause most probably 
CPU is busy with compiling kernel at that time...

Now i'm testing Ingo's msleep patch + xine-lib-1.1.6...

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
@ 2007-04-18 13:08                                 ` William Lee Irwin III
  2007-04-18 19:48                                   ` Davide Libenzi
  2007-04-18 14:48                                 ` Linus Torvalds
  2 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-18 13:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote:
> Why are processes special? Should user A be able to get more CPU time
> for his job than user B by splitting it into N parallel jobs? Should
> we be fair per process, per user, per thread group, per session, per
> controlling terminal? Some weighted combination of the preceding?[2]

On a side note, I think a combination of all of the above is a very
good idea, plus process groups (pgrp's). All the make -j loads should
come up in one pgrp of one session for one user and hence should be
automatically kept isolated in its own corner by such policies. Thread
bombs, forkbombs, and so on get handled too, which is good when on e.g.
a compileserver and someone rudely spawns too many tasks.

Thinking of the scheduler as a CPU bandwidth allocator, this means
handing out shares of CPU bandwidth to all users on the system, which
in turn hand out shares of bandwidth to all sessions, which in turn
hand out shares of bandwidth to all process groups, which in turn hand
out shares of bandwidth to all thread groups, which in turn hand out
shares of bandwidth to threads. The event handlers for the scheduler
need not deal with this apart from task creation and exit and various
sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.).
They just determine what the scheduler sees as ->load_weight or some
analogue of ->static_prio, though it is possible to do this by means of
data structure organization instead of numerical prioritization. It'd
probably have to be calculated on the fly by, say, doing fixpoint
arithmetic something like
    user_share(p)*session_share(p)*pgrp_share(p)*tgrp_share(p)*task_share(p)
so that readjusting the shares of aggregates doesn't have to traverse
lists and remains O(1). Each of the share computations can instead just
do some analogue of the calculation p->load_weight/rq->raw_weighted_load
in fixpoint, though precision issues with this make me queasy. There is
maybe a slight nasty point in that the ->raw_weighted_load analogue for
users or whatever the highest level chosen is ends up being global. One
might as well get users in there and omit intermediate levels if any are
to be omitted so that the truly global state is as read-only as possible.

I suppose jacking up the fixpoint precision to 128-bit or 256-bit all
below the radix point (our max is 1.0 after all) until precision issues
vanish can be done but the idea of that much number crunching in the
scheduler makes me rather uncomfortable. I hope u64 or u32 [2] can be
gotten away with as far as fixpoint goes.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
       [not found]               ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com>
  2007-04-18 13:05                 ` S.Çağlar Onur
@ 2007-04-18 13:21                 ` Christoph Pfister
  2007-04-18 13:25                   ` S.Çağlar Onur
  2007-04-18 15:08                   ` Ingo Molnar
  1 sibling, 2 replies; 712+ messages in thread
From: Christoph Pfister @ 2007-04-18 13:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

2007/4/18, Christoph Pfister <christophpfister@gmail.com>:
> [ Sorry for accidentally dropping CCs ]
>
> 2007/4/18, Christoph Pfister <christophpfister@gmail.com>:
> > 2007/4/18, Ingo Molnar <mingo@elte.hu>:
> > >
> > > * Christoph Pfister <christophpfister@gmail.com> wrote:
> > >
> > > > Or I could try playing around a bit with your patchset and trying to
> > > > reproduce it over here. Because I already have debug builds for
> > > > xine-lib and compiling a new kernel can take place in the background
> > > > it wouldn't be much effort for me.
> > >
> > > that would be great :) Here are the URLs for it. CFS is based on
> > > v2.6.21-rc7:
> > >
> > >   http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.21-rc7.tar.bz2
> > >
> > > And the CFS patch is at:
> > >
> > >   http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.patch
> > >
> > > rebuild your kernel as usual and boot into it. No extra configuration is
> > > needed, you'll get CFS by default.
> > >
> > > if this kernel builds/boots fine for you then you might also want to
> > > send me a quick note about how it feels, interactivity-wise. And of
> > > course i'm interested in any sort of feedback about problems as well.
> > > I'd like to make CFS as media-playback friendly as possible, so if
> > > there's any problem in that area it would be nice for me to know about
> > > it as soon as possible.
> > >
> > >         Ingo
> >
> > Okay - so here are some results (it's strange that gdb goes nuts
> > inside the xine_play call). I have three bts (seems to be fairly easy
> > to reproduce that behaviour over here): Twice while playing an audio
> > cd and once while playing a normal file. The hang usually ends if you
> > wait long enough (something around 30 secs over here).

<big snip>

> > Christoph
> >
> >
> > PS: Haven't analyzed them yet - but doing so now :-)
>
> Ok - one nice thing: In all those bts demux_loop is at demux.c:285 -
> meaing that demux_lock is held and xine_play is waiting for it ...
> The lock should be temporilary unreleased with a sched_yield so that
> the main thread can access it. As you wrote the implementation of this
> function seems to have changed a bit - so I'll replace it with a short
> sleep and try again ...
>
> Christoph

Replacing the sched_yield in demux.c with an usleep(10) stopped those
seeking hangs here (at least I was able to pull the slider back and
forth during 2 mins without trouble compared to the few secs I need
earlier to get a hang).

Christoph

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 13:21                 ` Christoph Pfister
@ 2007-04-18 13:25                   ` S.Çağlar Onur
  2007-04-18 15:48                     ` Ingo Molnar
  2007-04-18 15:08                   ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-18 13:25 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret,
	Jurgen Kofler, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

18 Nis 2007 Çar tarihinde, Christoph Pfister şunları yazmıştı: 
> Replacing the sched_yield in demux.c with an usleep(10) stopped those
> seeking hangs here (at least I was able to pull the slider back and
> forth during 2 mins without trouble compared to the few secs I need
> earlier to get a hang).

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3785,7 +3785,7 @@ asmlinkage long sys_sched_yield(void)
        _raw_spin_unlock(&rq->lock);
        preempt_enable_no_resched();
 
-       schedule();
+       msleep(1);
 
        return 0;
 }

which Ingo sends me to try also has the same effect on me. I cannot reproduce 
hangs anymore with that patch applied top of CFS while one console checks out 
SVN repos and other one compiles a small test software.

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
  2007-04-18 13:08                                 ` William Lee Irwin III
@ 2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
                                                     ` (2 more replies)
  2 siblings, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-18 14:48 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner



On Wed, 18 Apr 2007, Matt Mackall wrote:
> 
> Why is X special? Because it does work on behalf of other processes?
> Lots of things do this. Perhaps a scheduler should focus entirely on
> the implicit and directed wakeup matrix and optimizing that
> instead[1].

I 100% agree - the perfect scheduler would indeed take into account where 
the wakeups come from, and try to "weigh" processes that help other 
processes make progress more. That would naturally give server processes 
more CPU power, because they help others

I don't believe for a second that "fairness" means "give everybody the 
same amount of CPU". That's a totally illogical measure of fairness. All 
processes are _not_ created equal.

That said, even trying to do "fairness by effective user ID" would 
probably already do a lot. In a desktop environment, X would get as much 
CPU time as the user processes, simply because it's in a different 
protection domain (and that's really what "effective user ID" means: it's 
not about "users", it's really about "protection domains").

And "fairness by euid" is probably a hell of a lot easier to do than 
trying to figure out the wakeup matrix.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 13:21                 ` Christoph Pfister
  2007-04-18 13:25                   ` S.Çağlar Onur
@ 2007-04-18 15:08                   ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 15:08 UTC (permalink / raw)
  To: Christoph Pfister
  Cc: S.Çağlar Onur, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Christoph Pfister <christophpfister@gmail.com> wrote:

> Replacing the sched_yield in demux.c with an usleep(10) stopped those 
> seeking hangs here (at least I was able to pull the slider back and 
> forth during 2 mins without trouble compared to the few secs I need 
> earlier to get a hang).

great - thanks for figuring it out!

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
@ 2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
                                                       ` (2 more replies)
  2007-04-19  3:18                                   ` Nick Piggin
  2007-04-21 13:40                                   ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen
  2 siblings, 3 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-18 15:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.

For the record, you actually don't need to track a whole NxN matrix
(or do the implied O(n**3) matrix inversion!) to get to the same
result. You can converge on the same node weightings (ie dynamic
priorities) by applying a damped function at each transition point
(directed wakeup, preemption, fork, exit).

The trouble with any scheme like this is that it needs careful tuning
of the damping factor to converge rapidly and not oscillate and
precise numerical attention to the transition functions so that the sum of
dynamic priorities is conserved.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 13:25                   ` S.Çağlar Onur
@ 2007-04-18 15:48                     ` Ingo Molnar
  2007-04-18 16:07                       ` William Lee Irwin III
  2007-04-18 21:08                       ` S.Çağlar Onur
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 15:48 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: Christoph Pfister, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> -       schedule();
> +       msleep(1);

> which Ingo sends me to try also has the same effect on me. I cannot 
> reproduce hangs anymore with that patch applied top of CFS while one 
> console checks out SVN repos and other one compiles a small test 
> software.

great! Could you please unapply the hack above and try the proper fix 
below, does this one solve the hangs too?

	Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq 
 
 /*
  * sched_yield() support is very simple via the rbtree, we just
- * dequeue and enqueue the task, which causes the task to
- * roundrobin to the end of the tree:
+ * dequeue the task and move it to the rightmost position, which
+ * causes the task to roundrobin to the end of the tree.
  */
 static void requeue_task_fair(struct rq *rq, struct task_struct *p)
 {
 	dequeue_task_fair(rq, p);
 	p->on_rq = 0;
-	enqueue_task_fair(rq, p);
+	/*
+	 * Temporarily insert at the last position of the tree:
+	 */
+	p->fair_key = LLONG_MAX;
+	__enqueue_task_fair(rq, p);
 	p->on_rq = 1;
+
+	/*
+	 * Update the key to the real value, so that when all other
+	 * tasks from before the rightmost position have executed,
+	 * this task is picked up again:
+	 */
+	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
 }
 
 /*
@@ -380,7 +391,10 @@ static void task_tick_fair(struct rq *rq
 	 * Dequeue and enqueue the task to update its
 	 * position within the tree:
 	 */
-	requeue_task_fair(rq, curr);
+	dequeue_task_fair(rq, curr);
+	curr->on_rq = 0;
+	enqueue_task_fair(rq, curr);
+	curr->on_rq = 1;
 
 	/*
 	 * Reschedule if another task tops the current one.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS])
  2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
                   ` (13 preceding siblings ...)
  2007-04-17  7:56 ` Andy Whitcroft
@ 2007-04-18 15:58 ` Christian Hesse
  2007-04-18 16:46   ` Ingo Molnar
  14 siblings, 1 reply; 712+ messages in thread
From: Christian Hesse @ 2007-04-18 15:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

Hi Ingo and all,

On Friday 13 April 2007, Ingo Molnar wrote:
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

I just gave CFS a try on my system. From a user's point of view it looks good 
so far. Thanks for your work.

However I found a problem: When trying to suspend a system patched with 
suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the ESC key 
results in a message that it tries to abort suspend, but then still hangs.

I cced suspend2 devel list, perhaps Nigel is interested as well.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 15:48                     ` Ingo Molnar
@ 2007-04-18 16:07                       ` William Lee Irwin III
  2007-04-18 16:14                         ` Ingo Molnar
  2007-04-18 21:08                       ` S.Çağlar Onur
  1 sibling, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-18 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: S.??a??lar Onur, Christoph Pfister, linux-kernel,
	Michael Lothian, Christophe Thommeret, Jurgen Kofler,
	Ulrich Drepper

On Wed, Apr 18, 2007 at 05:48:11PM +0200, Ingo Molnar wrote:
>  static void requeue_task_fair(struct rq *rq, struct task_struct *p)
>  {
>  	dequeue_task_fair(rq, p);
>  	p->on_rq = 0;
> -	enqueue_task_fair(rq, p);
> +	/*
> +	 * Temporarily insert at the last position of the tree:
> +	 */
> +	p->fair_key = LLONG_MAX;
> +	__enqueue_task_fair(rq, p);
>  	p->on_rq = 1;
> +
> +	/*
> +	 * Update the key to the real value, so that when all other
> +	 * tasks from before the rightmost position have executed,
> +	 * this task is picked up again:
> +	 */
> +	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;

At this point you might as well call the requeue operation something
having to do with yield. I also suspect what goes on during the timer
tick may eventually become something different from requeueing the
current task, and furthermore class-dependent.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 16:07                       ` William Lee Irwin III
@ 2007-04-18 16:14                         ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 16:14 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: S.??a??lar Onur, Christoph Pfister, linux-kernel,
	Michael Lothian, Christophe Thommeret, Jurgen Kofler,
	Ulrich Drepper


* William Lee Irwin III <wli@holomorphy.com> wrote:

> At this point you might as well call the requeue operation something 
> having to do with yield. [...]

agreed - i've just done a requeue_task -> yield_task rename in my tree. 
(patch below)

> [...] I also suspect what goes on during the timer tick may eventually 
> become something different from requeueing the current task, and 
> furthermore class-dependent.

it already is, scheduler tick processing is done in class->task_tick().

	Ingo

---
 include/linux/sched.h |    2 +-
 kernel/sched.c        |    7 +------
 kernel/sched_fair.c   |    4 ++--
 kernel/sched_rt.c     |    2 +-
 4 files changed, 5 insertions(+), 10 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -796,7 +796,7 @@ struct sched_class {
 
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p);
-	void (*requeue_task) (struct rq *rq, struct task_struct *p);
+	void (*yield_task) (struct rq *rq, struct task_struct *p);
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -560,11 +560,6 @@ static void dequeue_task(struct rq *rq, 
 	p->on_rq = 0;
 }
 
-static void requeue_task(struct rq *rq, struct task_struct *p)
-{
-	p->sched_class->requeue_task(rq, p);
-}
-
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
@@ -3773,7 +3768,7 @@ asmlinkage long sys_sched_yield(void)
 	schedstat_inc(rq, yld_cnt);
 	if (rq->nr_running == 1)
 		schedstat_inc(rq, yld_act_empty);
-	requeue_task(rq, current);
+	current->sched_class->yield_task(rq, current);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -273,7 +273,7 @@ static void dequeue_task_fair(struct rq 
  * dequeue the task and move it to the rightmost position, which
  * causes the task to roundrobin to the end of the tree.
  */
-static void requeue_task_fair(struct rq *rq, struct task_struct *p)
+static void yield_task_fair(struct rq *rq, struct task_struct *p)
 {
 	dequeue_task_fair(rq, p);
 	p->on_rq = 0;
@@ -509,7 +509,7 @@ static void task_init_fair(struct rq *rq
 struct sched_class fair_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
-	.requeue_task		= requeue_task_fair,
+	.yield_task		= yield_task_fair,
 
 	.check_preempt_curr	= check_preempt_curr_fair,
 
Index: linux/kernel/sched_rt.c
===================================================================
--- linux.orig/kernel/sched_rt.c
+++ linux/kernel/sched_rt.c
@@ -165,7 +165,7 @@ static void task_init_rt(struct rq *rq, 
 static struct sched_class rt_sched_class __read_mostly = {
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
-	.requeue_task		= requeue_task_rt,
+	.yield_task		= requeue_task_rt,
 
 	.check_preempt_curr	= check_preempt_curr_rt,
 

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS])
  2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
@ 2007-04-18 16:46   ` Ingo Molnar
  2007-04-18 20:45     ` CFS and suspend2: hang in atomic copy Christian Hesse
  2007-04-19  9:32     ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 16:46 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Christian Hesse <mail@earthworm.de> wrote:

> Hi Ingo and all,
> 
> On Friday 13 April 2007, Ingo Molnar wrote:
> > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > more than welcome,
> 
> I just gave CFS a try on my system. From a user's point of view it 
> looks good so far. Thanks for your work.

you are welcome!

> However I found a problem: When trying to suspend a system patched 
> with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the 
> ESC key results in a message that it tries to abort suspend, but then 
> still hangs.

i took a quick look at suspend2 and it makes some use of yield(). 
There's a bug in CFS's yield code, i've attached a patch that should fix 
it, does it make any difference to the hang?

	Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq 
 
 /*
  * sched_yield() support is very simple via the rbtree, we just
- * dequeue and enqueue the task, which causes the task to
- * roundrobin to the end of the tree:
+ * dequeue the task and move it to the rightmost position, which
+ * causes the task to roundrobin to the end of the tree.
  */
 static void requeue_task_fair(struct rq *rq, struct task_struct *p)
 {
 	dequeue_task_fair(rq, p);
 	p->on_rq = 0;
-	enqueue_task_fair(rq, p);
+	/*
+	 * Temporarily insert at the last position of the tree:
+	 */
+	p->fair_key = LLONG_MAX;
+	__enqueue_task_fair(rq, p);
 	p->on_rq = 1;
+
+	/*
+	 * Update the key to the real value, so that when all other
+	 * tasks from before the rightmost position have executed,
+	 * this task is picked up again:
+	 */
+	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
 }
 
 /*
@@ -380,7 +391,10 @@ static void task_tick_fair(struct rq *rq
 	 * Dequeue and enqueue the task to update its
 	 * position within the tree:
 	 */
-	requeue_task_fair(rq, curr);
+	dequeue_task_fair(rq, curr);
+	curr->on_rq = 0;
+	enqueue_task_fair(rq, curr);
+	curr->on_rq = 1;
 
 	/*
 	 * Reschedule if another task tops the current one.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
@ 2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:48                                       ` [ck] " Mark Glines
                                                         ` (4 more replies)
  2007-04-18 19:05                                     ` Davide Libenzi
  2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 5 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-18 17:22 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner



On Wed, 18 Apr 2007, Matt Mackall wrote:
>
> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> > And "fairness by euid" is probably a hell of a lot easier to do than 
> > trying to figure out the wakeup matrix.
> 
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result.

I'm sure you can do things differently, but the reason I think "fairness 
by euid" is actually worth looking at is that it's pretty much the 
*identical* issue that we'll have with "fairness by virtual machine" and a 
number of other "container" issues.

The fact is:

 - "fairness" is *not* about giving everybody the same amount of CPU time 
   (scaled by some niceness level or not). Anybody who thinks that is 
   "fair" is just being silly and hasn't thought it through.

 - "fairness" is multi-level. You want to be fair to threads within a 
   thread group (where "process" may be one good approximation of what a 
   "thread group" is, but not necessarily the only one).

   But you *also* want to be fair in between those "thread groups", and 
   then you want to be fair across "containers" (where "user" may be one 
   such container).

So I claim that anything that cannot be fair by user ID is actually really 
REALLY unfair. I think it's absolutely humongously STUPID to call 
something the "Completely Fair Scheduler", and then just be fair on a 
thread level. That's not fair AT ALL! It's the anti-thesis of being fair!

So if you have 2 users on a machine running CPU hogs, you should *first* 
try to be fair among users. If one user then runs 5 programs, and the 
other one runs just 1, then the *one* program should get 50% of the CPU 
time (the users fair share), and the five programs should get 10% of CPU 
time each. And if one of them uses two threads, each thread should get 5%.

So you should see one thread get 50& CPU (single thread of one user), 4 
threads get 10% CPU (their fair share of that users time), and 2 threads 
get 5% CPU (the fair share within that thread group!).

Any scheduling argument that just considers the above to be "7 threads 
total" and gives each thread 14% of CPU time "fairly" is *anything* but 
fair. It's a joke if that kind of scheduler then calls itself CFS!

And yes, that's largely what the current scheduler will do, but at least 
the current scheduler doesn't claim to be fair! So the current scheduler 
is a lot *better* if only in the sense that it doesn't make ridiculous 
claims that aren't true!

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
@ 2007-04-18 17:48                                       ` Mark Glines
  2007-04-18 19:27                                         ` Chris Friesen
  2007-04-18 17:49                                       ` Ingo Molnar
                                                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 712+ messages in thread
From: Mark Glines @ 2007-04-18 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, Bill Huey, Mike Galbraith,
	Peter Williams, William Lee Irwin III, linux-kernel, ck list,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven

On Wed, 18 Apr 2007 10:22:59 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So if you have 2 users on a machine running CPU hogs, you should
> *first* try to be fair among users. If one user then runs 5 programs,
> and the other one runs just 1, then the *one* program should get 50%
> of the CPU time (the users fair share), and the five programs should
> get 10% of CPU time each. And if one of them uses two threads, each
> thread should get 5%.

This sounds great, to me.

One minor question: is it even possible to be completely fair on SMP?
For instance, if you have a 2-way SMP box running 3 applications, one of
which has 2 threads, will the threaded app have an advantage here?  (The
current system seems to try to keep each thread on a specific CPU, to
reduce cache thrashing, which means threads and processes alike each
get 50% of the CPU.)

Mark

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:48                                       ` [ck] " Mark Glines
@ 2007-04-18 17:49                                       ` Ingo Molnar
  2007-04-18 17:59                                         ` Ingo Molnar
  2007-04-18 19:23                                         ` Linus Torvalds
  2007-04-18 18:02                                       ` William Lee Irwin III
                                                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The fact is:
> 
>  - "fairness" is *not* about giving everybody the same amount of CPU 
>    time (scaled by some niceness level or not). Anybody who thinks 
>    that is "fair" is just being silly and hasn't thought it through.

yeah, very much so.

But note that most of the reported CFS interactivity wins, as surprising 
as it might be, were due to fairness between _the same user's tasks_. In 
the typical case, 99% of the desktop CPU time is executed either as X 
(root user) or under the uid of the logged in user, and X is just one 
task. Even with a bad hack of making X super-high-prio, interactivity as 
experienced by users still sucks without having fairness between the 
other 100-200 user tasks that a desktop system is typically using.

'renicing X to -10' is a broken way of achieving: 'root uid should get 
its share of CPU time too, no matter how many user tasks are running'. 
We can do this much cleaner by saying: 'each uid, if it has any tasks 
running, should get its fair share of CPU time, independently of the 
number of tasks it is running'.

In that sense 'fairness' is not global (and in fact it is almost _never_ 
a global property, as X runs under root uid [*]), it is only the most 
lowlevel scheduling machinery that can then be built upon. Higher-level 
controls to allocate CPU power between groups of tasks very much make 
sense - but according to the CFS interactivity test results i got from 
people so far, they very much need this basic fairness machinery 
_within_ the uid group too. So 'fairness' is still a powerful lower 
level scheduling concept. And this all makes lots of sense to me.

One purpose of doing the hierarchical scheduling classes stuff was to 
enable such higher scope task group decisions too. Next i'll try to 
figure out whether 'task group bandwidth' logic should live right within 
sched_fair.c itself, or whether it should be layered separately as a 
sched_group.c. Intutively i'd say it should live within sched_fair.c.

	Ingo

[*] There are distributions where X does not run under root uid anymore.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:49                                       ` Ingo Molnar
@ 2007-04-18 17:59                                         ` Ingo Molnar
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:23                                         ` Linus Torvalds
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> In that sense 'fairness' is not global (and in fact it is almost 
> _never_ a global property, as X runs under root uid [*]), it is only 
> the most lowlevel scheduling machinery that can then be built upon. 
> [...]

perhaps a more fitting term would be 'precise group-scheduling'. Within 
the lowest level task group entity (be that thread group or uid group, 
etc.) 'precise scheduling' is equivalent to 'fairness'.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:48                                       ` [ck] " Mark Glines
  2007-04-18 17:49                                       ` Ingo Molnar
@ 2007-04-18 18:02                                       ` William Lee Irwin III
  2007-04-18 18:12                                         ` Ingo Molnar
  2007-04-18 18:36                                       ` Diego Calleja
  2007-04-19  0:37                                       ` Peter Williams
  4 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-18 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote:
> So I claim that anything that cannot be fair by user ID is actually really 
> REALLY unfair. I think it's absolutely humongously STUPID to call 
> something the "Completely Fair Scheduler", and then just be fair on a 
> thread level. That's not fair AT ALL! It's the anti-thesis of being fair!
> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.
> So you should see one thread get 50& CPU (single thread of one user), 4 
> threads get 10% CPU (their fair share of that users time), and 2 threads 
> get 5% CPU (the fair share within that thread group!).
> Any scheduling argument that just considers the above to be "7 threads 
> total" and gives each thread 14% of CPU time "fairly" is *anything* but 
> fair. It's a joke if that kind of scheduler then calls itself CFS!

I don't think it's completely fair [sic] to come down on it that hard.
It does largely achieve the sort of fairness it set out for itself as
its design goal. One should also note that the queueing mechanism is
more than flexible enough to handle prioritization by a number of
different methods, and the large precision of its priorities is useful
there. So a rather broad variety of policies can be implemented by
changing the ->fair_key calculations.

In some respects, the vast priority space and very high clock precision
are two of its most crucial advantages.


On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote:
> And yes, that's largely what the current scheduler will do, but at least 
> the current scheduler doesn't claim to be fair! So the current scheduler 
> is a lot *better* if only in the sense that it doesn't make ridiculous 
> claims that aren't true!

The name chosen was somewhat buzzwordy. I'd have named it something more
descriptive of the algorithm, though what's implemented in the current
dynamic priority (i.e. ->fair_key) calculations are somewhat difficult
to precisely categorize.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 18:02                                       ` William Lee Irwin III
@ 2007-04-18 18:12                                         ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 18:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> It does largely achieve the sort of fairness it set out for itself as 
> its design goal. One should also note that the queueing mechanism is 
> more than flexible enough to handle prioritization by a number of 
> different methods, and the large precision of its priorities is useful 
> there. So a rather broad variety of policies can be implemented by 
> changing the ->fair_key calculations.

yeah. Note that i concentrated on the bit that makes the largest 
interactivity improvement: to implement "precise scheduling" (a'ka 
complete fairness) between the 100+ user tasks that do a complex 
scheduling dance on a typical desktop on various workloads.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
                                                         ` (2 preceding siblings ...)
  2007-04-18 18:02                                       ` William Lee Irwin III
@ 2007-04-18 18:36                                       ` Diego Calleja
  2007-04-19  0:37                                       ` Peter Williams
  4 siblings, 0 replies; 712+ messages in thread
From: Diego Calleja @ 2007-04-18 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

El Wed, 18 Apr 2007 10:22:59 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> escribió:

> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.


"Fairness between users" was implemented long time ago by rik van riel
(http://surriel.com/patches/2.4/2.4.19-fairsched). Some people has been
asking for a functionality like that for a long time, ie: universities that want
to avoid gcc processes from one student that is trying to learn how fork()
works from starving the processes of rest of the students.

But not only they want "fairness between users", they also want "priorities
between users and/or groups of users", ie: "the 'students' group shouldn't
starve the 'admins' group".

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
@ 2007-04-18 19:05                                     ` Davide Libenzi
  2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:05 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Matt Mackall wrote:

> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> > And "fairness by euid" is probably a hell of a lot easier to do than 
> > trying to figure out the wakeup matrix.
> 
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result. You can converge on the same node weightings (ie dynamic
> priorities) by applying a damped function at each transition point
> (directed wakeup, preemption, fork, exit).
> 
> The trouble with any scheme like this is that it needs careful tuning
> of the damping factor to converge rapidly and not oscillate and
> precise numerical attention to the transition functions so that the sum of
> dynamic priorities is conserved.

Doing that inside the boundaries of the time constrains imposed by a 
scheduler, is the interesting part. Given also that the size (and members) 
of it (matrix) is dynamic.
Also, a "wakup matrix" (if the name correctly pictures what it is for) 
would help with latencies and priority inheritance, but not for 
global fairness.
The maniacal fairness focus we're seeing now, is due to the fact the 
mainline can have extremely unfair behaviour under certain conditions. 
IMO fairness, although important, should not be main objective of the 
scheduler rewrite. Simplification and predictability should be on higher 
priority, with interactivity achievements bound to decent fariness 
constraints.




- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 19:05                                     ` Davide Libenzi
@ 2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 0 replies; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-18 19:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On 4/18/07, Matt Mackall <mpm@selenic.com> wrote:
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result. You can converge on the same node weightings (ie dynamic
> priorities) by applying a damped function at each transition point
> (directed wakeup, preemption, fork, exit).
>
> The trouble with any scheme like this is that it needs careful tuning
> of the damping factor to converge rapidly and not oscillate and
> precise numerical attention to the transition functions so that the sum of
> dynamic priorities is conserved.

That would be the control theory approach.  And yes, you have to get
both the theoretical transfer function and the numerics right.  It
sometimes helps to use a control-systems framework like the classic
Takagi-Sugeno-Kang fuzzy logic controller; get the numerics right once
and for all, and treat the heuristics as data, not logic.  (I haven't
worked in this area in almost twenty years, but Google -- yes, I do
use Google+brain for fact-checking; what do you do? -- says that
people are still doing active research on TSK models, and solid
fixed-point reference implementations are readily available.)  That
seems like an attractive strategy here because you could easily embed
the control engine in the kernel and load rule sets dynamically.  Done
right, that could give most of the advantages of pluggable schedulers
(different heuristic strokes for different folks) without diluting the
tester pool for the actual engine code.

(Of course, different scheduling strategies require different input
data, and you might not want the overhead of collecting data that your
chosen heuristics won't use.  But that's not much different from the
netfilter situation, and is obviously a solvable problem, if anyone
cares to put that much work in.  The people who ought to be funding
this kind of work are Sun and IBM, who don't have a chance on the
desktop and are in big trouble in the database tier; their future as
processor vendors depends on being able to service presentation-tier
and business-logic-tier loads efficiently on their massively
multi-core chips.  MIPS should pitch in too, on behalf of licensees
like Cavium who need more predictable behavior on multi-core embedded
Linux.)

Note also that you might not even want to persistently prioritize
particular processes or process groups.  You might want a heuristic
that notices that some task (say, the X server) often responds to
being awakened by doing a little work and then unblocking the task
that awakened it.  When it is pinged from some highly interactive
task, you want it to jump the scheduler queue just long enough to
unblock the interactive task, which may mean letting it flush some
work out of its internal queue.  But otherwise you want to batch
things up until there's too much "scheduler pressure" behind it, then
let it work more or less until it runs out of things to do, because
its working set is so large that repeatedly scheduling it in and out
is hell on caches.

(Priority inheritance is the classic solution to the
blocked-high-priority-task problem _in_isolation_.  It is not without
its pitfalls, especially when the designer of the "server" didn't
expect to lose his timeslice instantly on releasing the lock.  True
priority inheritance is probably not something you want to inflict on
a non-real-time system, but you do need some urgency heuristic.  What
a "fuzzy logic" framework does for you is to let you combine competing
heuristics in a way that remains amenable to analysis using control
theory techniques.)

What does any of this have to do with "fairness"?  Nothing whatsoever!
 There's work that has to be done, and choosing when to do it is
almost entirely a matter of staying out of the way of more urgent work
while minimizing the task's negative impact on the rest of the system.
 Does that mean that the X server is "special", kind of the way that
latency-sensitive A/V applications are "special", and belongs in a
separate scheduler class?  No.  Nowadays, workloads where the kernel
has any idea what tasks belong to what "users" are the exception, not
the norm.  The X server is the canary in the coal mine, and a
scheduler that won't do the right thing for X without hand tweaking
won't do the right thing for other eyeball-driven,
multiple-tiers-on-one-box scenarios either.

If you want fairness among users to the extent that their demands
_compete_, you might as well partition the whole machine, and have a
separate fairness-oriented scheduler (let's call it a "hypervisor")
that lives outside the kernel.  (Talk about two students running gcc
on the same shell server, with more important people also doing things
on the same system, is so 1990's!)  Not that the design of scheduler
heuristics shouldn't include "fairness"-like considerations; but
they're probably only interesting as a fallback for when the scheduler
has no idea what it ought to schedule next.

So why is Ingo's scheduler apparently working well for desktop loads?
I haven't tried it or even looked at its code, but from its marketing
I would guess that it effectively penalizes tasks whose I/O requests
can be serviced from (or directed to) cache long enough to actually
consume a whole timeslice.  This is prima facie evidence that their
_current_behavior_ is non-interactive.  Presumably this penalty
expires quickly when the task again asks for information that is not
readily at hand (or writes data that the system is not willing to
cache) -- which usually implies either actual user interaction or a
change of working set, both of which deserve an "urgency premium".

The mainline scheduler seems to contain various heuristics that
mistake a burst of non-interactive _activity_ for a persistently
non-interactive _task_.  Take them away in the name of "fairness", and
the system adapts more quickly to the change of working set implied by
a change of user focus.  There are probably fewer pathological load
patterns too, since manual knob-turning uninformed by control theory
is a lot less likely to get you into trouble when there are few knobs
and no deliberately inserted long-time-constant feedback paths.  But
you can't say there are _no_ pathological load patterns, or even that
the major economic drivers of the Linux ecosystem don't generate them,
until you do some authentic engineering analysis.

In short (too late!) -- alternate schedulers are fun to experiment
with, and the sort of people who would actually try out patches
floated on LKML may find that they improve their desktop experience,
hosting farm throughput, etc.  But even if the mainline scheduler is a
hack atop a kludge covering a crock, it's more or less what production
applications have expected since the last major architectural shift
(NPTL).  There's just no sense in replacing it until you can either
add real value (say, integral clock scaling for power efficiency, with
a reasonable "spinning reserve" for peaking load) or demonstrate
stability by engineering analysis instead of trial and error.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:49                                       ` Ingo Molnar
  2007-04-18 17:59                                         ` Ingo Molnar
@ 2007-04-18 19:23                                         ` Linus Torvalds
  2007-04-18 19:56                                           ` Davide Libenzi
  1 sibling, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-18 19:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner



On Wed, 18 Apr 2007, Ingo Molnar wrote:
> 
> But note that most of the reported CFS interactivity wins, as surprising 
> as it might be, were due to fairness between _the same user's tasks_.

And *ALL* of the CFS interactivity *losses* and complaints have been 
because it did the wrong thing _between different user's tasks_

So what's your point? Your point was that when people try it out as a 
single user, it is indeed fair. But that's no point at all, since it 
totally missed _my_ point.

The problems with X scheduling is exactly that "other user" kind of thing.

The problem with kernel thread starvation due to user threads getting all 
the CPU time is exactly the same issue.

As logn as you think that all threads are equal, and should be treated 
equally, you CANNOT make it work well. People can say "ok, you can renice 
X", but the whole problem stems from the fact that you're trying to be 
fair based on A TOTALLY INVALID NOTION of what "fair" is.

> In the typical case, 99% of the desktop CPU time is executed either as X 
> (root user) or under the uid of the logged in user, and X is just one 
> task.

So? You are ignoring the argument again. You're totally bringing up a red 
herring:

> Even with a bad hack of making X super-high-prio, interactivity as 
> experienced by users still sucks without having fairness between the 
> other 100-200 user tasks that a desktop system is typically using.

I didn't say that you should be *unfair* within one user group. What kind 
of *idiotic* argument are you trying to put forth?

OF COURSE you should be fair "within the user group". Nobody contests that 
the "other 100-200 user tasks" should be scheduled fairly _amongst 
themselves_. 

The only point I had was that you cannot just lump all threads together 
and say "these threads are equally important". The 100-200 user tasks may 
be equally important, and should get equal amounts of preference, but that 
has absolutely _zero_ bearing on the _single_ task run in another 
"scheduling group", ie by other users or by X.

I'm not arguing against fairness. I'm arguing against YOUR notion of 
fairness, which is obviously bogus. It is *not* fair to try to give out 
CPU time evenly!

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:48                                       ` [ck] " Mark Glines
@ 2007-04-18 19:27                                         ` Chris Friesen
  2007-04-19  0:49                                           ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: Chris Friesen @ 2007-04-18 19:27 UTC (permalink / raw)
  To: Mark Glines
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Bill Huey,
	Mike Galbraith, Peter Williams, William Lee Irwin III,
	linux-kernel, ck list, Thomas Gleixner, Andrew Morton,
	Arjan van de Ven

Mark Glines wrote:

> One minor question: is it even possible to be completely fair on SMP?
> For instance, if you have a 2-way SMP box running 3 applications, one of
> which has 2 threads, will the threaded app have an advantage here?  (The
> current system seems to try to keep each thread on a specific CPU, to
> reduce cache thrashing, which means threads and processes alike each
> get 50% of the CPU.)

I think the ideal in this case would be to have both threads on one cpu, 
with the other app on the other cpu.  This gives inter-process fairness 
while minimizing the amount of task migration required.

More interesting is the case of three processes on a 2-cpu system.  Do 
we constantly migrate one of them back and forth to ensure that each of 
them gets 66% of a cpu?

Chris

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:59                                         ` Ingo Molnar
@ 2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
                                                               ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-18 19:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner



On Wed, 18 Apr 2007, Ingo Molnar wrote:
> 
> perhaps a more fitting term would be 'precise group-scheduling'. Within 
> the lowest level task group entity (be that thread group or uid group, 
> etc.) 'precise scheduling' is equivalent to 'fairness'.

Yes. Absolutely. Except I think that at least if you're going to name 
somethign "complete" (or "perfect" or "precise"), you should also admit 
that groups can be hierarchical.

The "threads in a process" thing is a great example of a hierarchical 
group. Imagine if X was running as a collection of threads - then each 
server thread would no longer be more important than the clients! But if 
you have a mix of "bags of threads" and "single process" kind 
applications, then very arguably the single thread in a single traditional 
process should get as much time as the "bag of threads" process gets 
total.

So it really should be a hierarchical notion, where each thread is owned 
by one "process", and each process is owned by one "user", and each user 
is in one "virtual machine" - there's at least three different levels to 
this, and you'd want to schedule this thing top-down: virtual machines 
should be given CPU time "fairly" (which doesn't need to mean "equally", 
of course - nice-values could very well work at that level too), and then 
within each virtual machine users or "scheduling groups" should be 
scheduled fairly, and then within each scheduling group the processes 
should be scheduled, and within each process threads should equally get 
their fair share at _that_ level.

And no, I don't think we necessarily need to do something quite that 
elaborate. But I think that's the kind of "obviously good goal" to keep in 
mind. Can we perhaps _approximate_ something like that by other means? 

For example, maybe we can approximate it by spreading out the statistics: 
right now you have things like

 - last_ran, wait_runtime, sum_wait_runtime..

be per-thread things. Maybe some of those can be spread out, so that you 
put a part of them in the "struct vm_struct" thing (to approximate 
processes), part of them in the "struct user" struct (to approximate the 
user-level thing), and part of it in a per-container thing for when/if we 
support that kind of thing?

IOW, I don't think the scheduling "groups" have to be explicit boxes or 
anything like that. I suspect you can make do with just heurstics that 
penalize the same "struct user" and "struct vm_struct" to get overly much 
scheduling time, and you'll get the same _effect_. 

And I don't think it's wrong to look at the "one hundred processes by the 
same user" case as being an important case. But it should not be the 
*only* case or even necessarily the *main* case that matters. I think a 
benchmark that literally does

	pid_t pid = fork();
	if (pid < 0)
		exit(1);
	if (pid) {
		if (setuid(500) < 0)
			exit(2);
		for (;;)
			/* Do nothing */;
	}
	if (setuid(501) < 0)
		exit(3);
	fork();
	for (;;)
		/* Do nothing in two processes */;

and I think that it's a really valid benchmark: if the scheduler gives 25% 
of time to each of the two processes of user 501, and 50% to user 500, 
then THAT is a good scheduler.

If somebody wants to actually write and test the above as a test-script, 
and add it to a collection of scheduler tests, I think that could be a 
good thing.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
@ 2007-04-18 19:43                                             ` Ingo Molnar
  2007-04-18 20:07                                             ` Davide Libenzi
  2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> For example, maybe we can approximate it by spreading out the 
> statistics: right now you have things like
> 
>  - last_ran, wait_runtime, sum_wait_runtime..
> 
> be per-thread things. [...]

yes, yes, yes! :) My thinking is "struct sched_group" embedded into 
_arbitrary_ other resource containers and abstractions, which 
sched_group's are then in a simple hierarchy and are driven by the core 
scheduling machinery.

> [...] Maybe some of those can be spread out, so that you put a part of 
> them in the "struct vm_struct" thing (to approximate processes), part 
> of them in the "struct user" struct (to approximate the user-level 
> thing), and part of it in a per-container thing for when/if we support 
> that kind of thing?

yes.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 13:08                                 ` William Lee Irwin III
@ 2007-04-18 19:48                                   ` Davide Libenzi
  0 siblings, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:48 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, William Lee Irwin III wrote:

> Thinking of the scheduler as a CPU bandwidth allocator, this means
> handing out shares of CPU bandwidth to all users on the system, which
> in turn hand out shares of bandwidth to all sessions, which in turn
> hand out shares of bandwidth to all process groups, which in turn hand
> out shares of bandwidth to all thread groups, which in turn hand out
> shares of bandwidth to threads. The event handlers for the scheduler
> need not deal with this apart from task creation and exit and various
> sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.).

Yes, it really becomes a hierarchical problem once you consider user and 
processes. Top level sees a "user" can be scheduled (put itself on the 
virtual run queue), and passes the ball to the "process" scheduler inside 
the "user" container, down to maybe "threads". With all the "key" 
calculation parameters kept at each level (with up-propagation).



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:23                                         ` Linus Torvalds
@ 2007-04-18 19:56                                           ` Davide Libenzi
  2007-04-18 20:11                                             ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> I'm not arguing against fairness. I'm arguing against YOUR notion of 
> fairness, which is obviously bogus. It is *not* fair to try to give out 
> CPU time evenly!

"Perhaps on the rare occasion pursuing the right course demands an act of 
 unfairness, unfairness itself can be the right course?"



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
@ 2007-04-18 20:07                                             ` Davide Libenzi
  2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> For example, maybe we can approximate it by spreading out the statistics: 
> right now you have things like
> 
>  - last_ran, wait_runtime, sum_wait_runtime..
> 
> be per-thread things. Maybe some of those can be spread out, so that you 
> put a part of them in the "struct vm_struct" thing (to approximate 
> processes), part of them in the "struct user" struct (to approximate the 
> user-level thing), and part of it in a per-container thing for when/if we 
> support that kind of thing?

I think Ingo's idea of a new sched_group to contain the generic 
parameters needed for the "key" calculation, works better than adding more 
fields to existing strctures (that would, of course, host pointers to it). 
Otherwise I can already the the struct_signal being the target for other 
unrelated fields :)



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:56                                           ` Davide Libenzi
@ 2007-04-18 20:11                                             ` Linus Torvalds
  2007-04-19  0:22                                               ` Davide Libenzi
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-18 20:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner



On Wed, 18 Apr 2007, Davide Libenzi wrote:
> 
> "Perhaps on the rare occasion pursuing the right course demands an act of 
>  unfairness, unfairness itself can be the right course?"

I don't think that's the right issue.

It's just that "fairness" != "equal".

Do you think it "fair" to pay everybody the same regardless of how good a 
job they do? I don't think anybody really believes that. 

Equating "fair" and "equal" is simply a very fundamental mistake. They're 
not the same thing. Never have been, and never will.

Now, there's no question that "equal" is much easier to implement, if only 
because it's a lot easier to agree what it means. "Equal parts" is 
somethign everybody can agree on. "Fair parts" automatically involves a 
balancing act, and people will invariably count things differently and 
thus disagree about what is "fair" and what is not.

I don't think we can ever get a "perfect" setup for that reason, but I 
think we can get something that at least gets reasonably close, at least 
for the obvious cases.

So my suggested test-case of running one process as one user and two 
processes as another one has a fairly "obviously correct" solution if you 
have just one CPU's, and you can probably be pretty fair in practice on 
two CPU's (there's an obvious theoretical solution, whether you can get 
there with a practical algorithm is another thing). On three or more 
CPU's, you obviously wouldn't even *want* to be fair, since you can very 
naturally just give a CPU to each..

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 16:46   ` Ingo Molnar
@ 2007-04-18 20:45     ` Christian Hesse
  2007-04-18 21:16       ` Ingo Molnar
  2007-04-19  9:32     ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen
  1 sibling, 1 reply; 712+ messages in thread
From: Christian Hesse @ 2007-04-18 20:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

On Wednesday 18 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > On Friday 13 April 2007, Ingo Molnar wrote:
> > > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > > more than welcome,
> >
> > When trying to suspend a system patched
> > with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the
> > ESC key results in a message that it tries to abort suspend, but then
> > still hangs.
>
> i took a quick look at suspend2 and it makes some use of yield().
> There's a bug in CFS's yield code, i've attached a patch that should fix
> it, does it make any difference to the hang?

This patch should apply cleanly against what? The second hunk is ignored as it 
has already been applied. Is this correct?

But no, it does not change anything. Let me know if you have any other patches 
to test.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
  2007-04-18 20:07                                             ` Davide Libenzi
@ 2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > perhaps a more fitting term would be 'precise group-scheduling'. 
> > Within the lowest level task group entity (be that thread group or 
> > uid group, etc.) 'precise scheduling' is equivalent to 'fairness'.
> 
> Yes. Absolutely. Except I think that at least if you're going to name 
> somethign "complete" (or "perfect" or "precise"), you should also 
> admit that groups can be hierarchical.

yes. Am i correct to sum up your impression as:

 " Ingo, for you the hierarchy still appears to be an after-thought,
   while in practice it's easily the most important thing! Why are you
   so hung up about 'fairness', it makes no sense!"

right?

and you would definitely be right if you suggested that i neglected the 
'group scheduling' aspects of CFS (except for a minimalistic nice level 
implementation, which is a poor-man's-non-automatic-group-scheduling), 
but i very much know its important and i'll definitely fix it for -v4.

But please let me explain my reasons for my different focus:

yes, group scheduling in practice is the most important first-layer 
thing, and without it any of the other 'CFS wins' can easily be useless.

Firstly, i have not neglected the group scheduling related CFS 
regressions at all, mainly because there _is_ already a quick hack to 
check whether group scheduling would solve these regressions: renice. 
And it was tried in both of the two CFS regression cases i'm aware of: 
Mike's X starvation problem and Willy's "kevents starvation with 
thousands of scheddos tasks running" problem. And in both cases, 
applying the renice hack [which should be properly and automatically 
implemented as uid group scheduling] fixed the regression for them! So i 
was not worried at all, group scheduling _provably solves_ these CFS 
regressions. I rather concentrated on the CFS regressions that were much 
less clear.

But PLEASE believe me: even with perfect cross-group CPU allocation but 
with a simple non-heuristic scheduler underlying it, you can _easily_ 
get a sucky desktop experience! I know it because i tried it and others 
tried it too. (in fact the first version of sched_fair.c was tick based 
and low-res, and it sucked)

Two more things were needed:

  - the high precision of nsec/64-bit accounting
    ('reliability of scheduling')

  - extremely even time-distribution of CPU power 
    ('determinism/smoothness, human perception')

(i'm expanding on these two concepts further below)

take out any of these and group scheduling or not, you are easily going 
to have a sucky desktop! (We know that from years of experiments: many 
people tried to rip out the unfairness from the scheduler and there were 
always nasty corner cases that 'should' have worked but didnt.)

Without these we'd in essence start again at square one, just at a 
different square, this time with another group of people being 
irritated!

But the biggest and hardest to achieve _wins_ of CFS are _NOT_ achieved 
via a simple 'get rid of the unfairness of the upstream scheduler and 
apply group scheduling'. (I know that because i tried it before and 
because others tried it before, for many many years.) You will _easily_ 
get sucky desktop experience. The other two things are very much needed 
too:

 - the high precision of nsec/64-bit accounting, and the many
   corner-cases this solves. (For example on a typical desktop there are
   _lots_ of timing-driven workloads that are in essence 'invisible' to
   low-resolution, timer-tick based accounting and are heavily skewed.)

 - extremely even time-distribution of CPU power. CFS behaves pretty
   well even under the dreaded 'make -jN in an xterm' kernel build
   workload as reported by Mark Lord, because it also distributes CPU
   power in a _finegrained_ way. A shell prompt under CFS still behaves
   acceptably on a single-CPU testbox of mine with a "make -j50"
   workload. (yes, fifty) Humans react alot more negatively to sudden
   changes in application behavior ('lags', pauses, short hangs) than
   they react to fine, gradual, all-encompassing slowdowns. This is a
   key property of CFS.

  ( Otherwise renicing X to -10 would have solved most of the
    interactivity complaints against the vanilla scheduler, otherwise
    renicing X to -10 would have fixed Mike's setup under SD (it didnt)
    while it worked much better under CFS, otherwise Gene wouldnt have
    found CFS markedly better than SD, etc., etc. So getting rid of the
    heuristics is less than 50% of the road to the perfect desktop
    scheduler. )

and i claim that these were the really hard bits, and i spent most of 
the CFS coding only on getting _these_ details 100% right under various 
workloads, and it makes a night and day difference _even without any 
group scheduling help_.

and note another reason here: group scheduling _masks_ many other 
scheduling deficiencies that are possible in scheduler. So since CFS 
doesnt do group scheduling, i get a _fuller_ picture of the behavior of 
the core "precise scheduling" engine. At the initial stage i didnt want 
to hide bugs by masking them via group scheduling, especially because 
the renice workaround/hack was available.

Guess how nice it all will get if we also add group scheduling to the 
mix, and people wouldnt have to add nasty and fragile renice based 
hacks, it will 'just work' out of box?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 15:48                     ` Ingo Molnar
  2007-04-18 16:07                       ` William Lee Irwin III
@ 2007-04-18 21:08                       ` S.Çağlar Onur
  2007-04-18 21:12                         ` Ingo Molnar
  2007-04-20 19:31                         ` Bill Davidsen
  1 sibling, 2 replies; 712+ messages in thread
From: S.Çağlar Onur @ 2007-04-18 21:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Pfister, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

[-- Attachment #1: Type: text/plain, Size: 792 bytes --]

18 Nis 2007 Çar tarihinde, Ingo Molnar şunları yazmıştı: 
> * S.Çağlar Onur <caglar@pardus.org.tr> wrote:
> > -       schedule();
> > +       msleep(1);
> >
> > which Ingo sends me to try also has the same effect on me. I cannot
> > reproduce hangs anymore with that patch applied top of CFS while one
> > console checks out SVN repos and other one compiles a small test
> > software.
>
> great! Could you please unapply the hack above and try the proper fix
> below, does this one solve the hangs too?

Instead of that one, i tried CFSv3 and i cannot reproduce the hang anymore, 
Thanks!...

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 21:08                       ` S.Çağlar Onur
@ 2007-04-18 21:12                         ` Ingo Molnar
  2007-04-20 19:31                         ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:12 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: Christoph Pfister, linux-kernel, Michael Lothian, Linus Torvalds,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> > great! Could you please unapply the hack above and try the proper 
> > fix below, does this one solve the hangs too?
> 
> Instead of that one, i tried CFSv3 and i cannot reproduce the hang 
> anymore, Thanks!...

cool, thanks for the quick turnaround!

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 20:45     ` CFS and suspend2: hang in atomic copy Christian Hesse
@ 2007-04-18 21:16       ` Ingo Molnar
  2007-04-18 21:57         ` Christian Hesse
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:16 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Christian Hesse <mail@earthworm.de> wrote:

> > i took a quick look at suspend2 and it makes some use of yield(). 
> > There's a bug in CFS's yield code, i've attached a patch that should 
> > fix it, does it make any difference to the hang?
> 
> This patch should apply cleanly against what? The second hunk is 
> ignored as it has already been applied. Is this correct?

hm, i think you might have had one of the earlier CFS patches.

> But no, it does not change anything. Let me know if you have any other 
> patches to test.

could you try the -v3 patch i released a few hours ago:

   http://redhat.com/~mingo/cfs-scheduler/

although probably your suspend2 problem is still not fixed, it's worth a 
try nevertheless. Which suspend2 patch did you apply, and was it against 
-rc6 or -rc7?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 20:07                                             ` Davide Libenzi
@ 2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  6:52                                                 ` Mike Galbraith
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:48 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> I think Ingo's idea of a new sched_group to contain the generic 
> parameters needed for the "key" calculation, works better than adding 
> more fields to existing strctures (that would, of course, host 
> pointers to it). Otherwise I can already the the struct_signal being 
> the target for other unrelated fields :)

yeah. Another detail is that for global containers like uids, the 
statistics will have to be percpu_alloc()-ed, both for correctness 
(runqueues are per CPU) and for performance.

That's one reason why i dont think it's necessarily a good idea to 
group-schedule threads, we dont really want to do a per thread group 
percpu_alloc().

In fact for threads the _reverse_ problem exists, threaded apps tend to 
_strive_ for more performance - hence their desperation of using the 
threaded programming model to begin with ;) (just think of media 
playback apps which are typically multithreaded)

I dont think threads are all that different. Also, the 
resource-conserving act of using CLONE_VM to share the VM (and to use a 
different programming environment like Java) should not be 'punished' by 
forcing the thread group to be accounted as a single, shared entity 
against other 'fat' tasks.

so my current impression is that we want per UID accounting to solve the 
X problem, the kernel threads problem and the many-users problem, but 
i'd not want to do it for threads just yet because for them there's not 
really any apparent problem to be solved.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:33             ` Con Kolivas
@ 2007-04-18 21:49               ` Con Kolivas
  0 siblings, 0 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-18 21:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:33, Con Kolivas wrote:
> On Wednesday 18 April 2007 22:14, Nick Piggin wrote:
> > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > > > Again, for comparison 2.6.21-rc7 mainline:
> > > >
> > > > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > > > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > > > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > > > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > > > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > > > 509.08user 32.20system 2:17.95elapsed 392%CPU
> > > >
> > > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > > mainline score. However it is using slightly less user time and
> > > > slightly more idle time, which indicates that balancing might have
> > > > got a bit less aggressive.
> > > >
> > > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > > timeslices.
> > >
> > > See test.kernel.org for how (the now defunct) SD was performing on
> > > kernbench. It had low latency _and_ equivalent throughput to mainline.
> > > Set the standard appropriately on both counts please.
> >
> > I can give it a run. Got an updated patch against -rc7?
>
> I said I wasn't pursuing it but since you're offering, the rc6 patch should
> apply ok.
>
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch

Oh and if you go to the effort of trying you may as well try the timeslice 
tweak to see what effect it has on SD as well.

/proc/sys/kernel/rr_interval

100 is the highest.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 21:16       ` Ingo Molnar
@ 2007-04-18 21:57         ` Christian Hesse
  2007-04-18 22:02           ` Ingo Molnar
  2007-04-18 22:16           ` CFS and suspend2: hang in atomic copy Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Christian Hesse @ 2007-04-18 21:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

On Wednesday 18 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > > i took a quick look at suspend2 and it makes some use of yield().
> > > There's a bug in CFS's yield code, i've attached a patch that should
> > > fix it, does it make any difference to the hang?
> >
> > This patch should apply cleanly against what? The second hunk is
> > ignored as it has already been applied. Is this correct?
>
> hm, i think you might have had one of the earlier CFS patches.

You are right.

> > But no, it does not change anything. Let me know if you have any other
> > patches to test.
>
> could you try the -v3 patch i released a few hours ago:
>
>    http://redhat.com/~mingo/cfs-scheduler/
>
> although probably your suspend2 problem is still not fixed, it's worth a
> try nevertheless. Which suspend2 patch did you apply, and was it against
> -rc6 or -rc7?

You are right again. ;-)

Linux 2.6.21-rc7
Suspend2 2.2.9.11 (applies cleanly to -rc7)
CFS v3 (without any additional patches)

And it still hangs on suspend.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 21:57         ` Christian Hesse
@ 2007-04-18 22:02           ` Ingo Molnar
  2007-04-18 22:22             ` Christian Hesse
                               ` (2 more replies)
  2007-04-18 22:16           ` CFS and suspend2: hang in atomic copy Ingo Molnar
  1 sibling, 3 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 22:02 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Christian Hesse <mail@earthworm.de> wrote:

> > although probably your suspend2 problem is still not fixed, it's 
> > worth a try nevertheless. Which suspend2 patch did you apply, and 
> > was it against -rc6 or -rc7?
> 
> You are right again. ;-)
> 
> Linux 2.6.21-rc7
> Suspend2 2.2.9.11 (applies cleanly to -rc7)
> CFS v3 (without any additional patches)
> 
> And it still hangs on suspend.

what's the easiest way for me to try suspend2? Apply the patch, reboot 
into the kernel, then execute what command to suspend? (there's a 
confusing mismash of initiators of all the suspend variants. Can i drive 
this by echoing to /sys/power/state?)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 21:57         ` Christian Hesse
  2007-04-18 22:02           ` Ingo Molnar
@ 2007-04-18 22:16           ` Ingo Molnar
  2007-04-18 23:12             ` Christian Hesse
  2007-04-19  6:41             ` Ingo Molnar
  1 sibling, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-18 22:16 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Christian Hesse <mail@earthworm.de> wrote:

> Linux 2.6.21-rc7
> Suspend2 2.2.9.11 (applies cleanly to -rc7)
> CFS v3 (without any additional patches)
> 
> And it still hangs on suspend.

i just tried the same and it suspended+resumed just fine:

Restarting tasks ... done.
Suspend2 debugging info:
- Suspend core   : 2.2.9.12
- Kernel Version : 2.6.21-rc7-CFS-v3
- Compiler vers. : 4.0
- Attempt number : 2
- Parameters     : 0 81920 0 0 0 0
- Overall expected compression percentage: 0.
- Compressor is 'lzf'.
  Compressed 31133696 bytes into 14880587 (52 percent compression).
- SwapAllocator active.
  Swap available for image: 512036 pages.
- FileAllocator inactive.
- I/O speed: Write 76 MB/s, Read 42 MB/s.
- Extra pages    : 18 used/500.

could you send me your .config?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:02           ` Ingo Molnar
@ 2007-04-18 22:22             ` Christian Hesse
  2007-04-19  1:37               ` [Suspend2-devel] " Nigel Cunningham
  2007-04-18 22:56             ` Bob Picco
  2007-04-19  1:52             ` [Suspend2-devel] " Nigel Cunningham
  2 siblings, 1 reply; 712+ messages in thread
From: Christian Hesse @ 2007-04-18 22:22 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]

On Thursday 19 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > > although probably your suspend2 problem is still not fixed, it's
> > > worth a try nevertheless. Which suspend2 patch did you apply, and
> > > was it against -rc6 or -rc7?
> >
> > You are right again. ;-)
> >
> > Linux 2.6.21-rc7
> > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > CFS v3 (without any additional patches)
> >
> > And it still hangs on suspend.
>
> what's the easiest way for me to try suspend2? Apply the patch, reboot
> into the kernel, then execute what command to suspend? (there's a
> confusing mismash of initiators of all the suspend variants. Can i drive
> this by echoing to /sys/power/state?)

Perhaps you have to install suspend2-userui as well for the output (I'm not 
shure whether it works without). Then you can trigger the suspend by echoing 
to /sys/power/suspend2/do_suspend.
Useful informations can be found in the Howto:

http://www.suspend2.net/HOWTO

I dropped some ccs to not abuse Linus and friends.
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:02           ` Ingo Molnar
  2007-04-18 22:22             ` Christian Hesse
@ 2007-04-18 22:56             ` Bob Picco
  2007-04-19  1:43               ` [Suspend2-devel] " Nigel Cunningham
  2007-04-19  6:29               ` Ingo Molnar
  2007-04-19  1:52             ` [Suspend2-devel] " Nigel Cunningham
  2 siblings, 2 replies; 712+ messages in thread
From: Bob Picco @ 2007-04-18 22:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, suspend2-devel

Ingo Molnar wrote:	[Wed Apr 18 2007, 06:02:28PM EDT]
> 
> * Christian Hesse <mail@earthworm.de> wrote:
> 
> > > although probably your suspend2 problem is still not fixed, it's 
> > > worth a try nevertheless. Which suspend2 patch did you apply, and 
> > > was it against -rc6 or -rc7?
> > 
> > You are right again. ;-)
> > 
> > Linux 2.6.21-rc7
> > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > CFS v3 (without any additional patches)
> > 
> > And it still hangs on suspend.
> 
> what's the easiest way for me to try suspend2? Apply the patch, reboot 
> into the kernel, then execute what command to suspend? (there's a 
> confusing mismash of initiators of all the suspend variants. Can i drive 
> this by echoing to /sys/power/state?)
> 
> 	Ingo
I had hoped to collect more data with CFS V2. It crashes in
scale_nice_down for s2ram when attempting to disable_nonboot_cpus. 
So part of traceback looks like (typed by hand with obvious omissions):

scale_nice_down
update_stats_wait_end - not shown in traceback because inlined
pick_next_task_fair
migration_call
task_rq_lock
notifier_call_chain
_cpu_down
disable_nonboot_cpus
...

This is standard -rc7 with V2 CFS applied. It could be a completely
unrelated issue. I'll attempt to debug further tomorrow.

bob

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:16           ` CFS and suspend2: hang in atomic copy Ingo Molnar
@ 2007-04-18 23:12             ` Christian Hesse
  2007-04-19  6:28               ` Ingo Molnar
  2007-04-19  6:41             ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Christian Hesse @ 2007-04-18 23:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


[-- Attachment #1.1: Type: text/plain, Size: 1040 bytes --]

On Thursday 19 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > Linux 2.6.21-rc7
> > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > CFS v3 (without any additional patches)
> >
> > And it still hangs on suspend.
>
> i just tried the same and it suspended+resumed just fine:
>
> Restarting tasks ... done.
> Suspend2 debugging info:
> - Suspend core   : 2.2.9.12
> - Kernel Version : 2.6.21-rc7-CFS-v3
> - Compiler vers. : 4.0
> - Attempt number : 2
> - Parameters     : 0 81920 0 0 0 0
> - Overall expected compression percentage: 0.
> - Compressor is 'lzf'.
>   Compressed 31133696 bytes into 14880587 (52 percent compression).
> - SwapAllocator active.
>   Swap available for image: 512036 pages.
> - FileAllocator inactive.
> - I/O speed: Write 76 MB/s, Read 42 MB/s.
> - Extra pages    : 18 used/500.
>
> could you send me your .config?

My config is attached.

I now got some error message from my system:

http://www.eworm.de/tmp/cfs-suspend.jpg
-- 
Regards,
Chris

[-- Attachment #1.2: config-2.6.21-rc7-r1 --]
[-- Type: text/plain, Size: 49289 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc7-r1
# Wed Apr 18 22:25:20 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_IKPATCHES=y
CONFIG_IKPATCHES_PROC=y
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
# CONFIG_TICK_ONESHOT is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
# CONFIG_HPET_TIMER is not set
CONFIG_NR_CPUS=2
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_MCE is not set
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_OLD_INTERFACE=y
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
# CONFIG_VMSPLIT_3G is not set
CONFIG_VMSPLIT_3G_OPT=y
# CONFIG_VMSPLIT_2G is not set
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0xB0000000
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_IRQBALANCE=y
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
CONFIG_PHYSICAL_START=0x100000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PRINTK_NOSAVE is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
# CONFIG_SOFTWARE_SUSPEND is not set
CONFIG_SUSPEND_SMP=y
CONFIG_SUSPEND2_CORE=y

#
# Image Storage (you need at least one allocator)
#
# CONFIG_SUSPEND2_FILE is not set
CONFIG_SUSPEND2_SWAP=y

#
# General Options
#
CONFIG_SUSPEND2_CRYPTO=y
CONFIG_SUSPEND2_DEFAULT_RESUME2="/dev/hda2"
# CONFIG_SUSPEND2_KEEP_IMAGE is not set
CONFIG_SUSPEND2_REPLACE_SWSUSP=y
# CONFIG_SUSPEND2_CHECKSUM is not set
CONFIG_SUSPEND_SHARED=y
CONFIG_SUSPEND2=y

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
# CONFIG_ACPI_PROCFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set

#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
# CONFIG_CPU_FREQ_STAT is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
# CONFIG_CPU_FREQ_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_SPEEDSTEP_ICH is not set
# CONFIG_X86_SPEEDSTEP_SMI is not set
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
# CONFIG_X86_E_POWERSAVER is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCI_MSI is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
# CONFIG_PCMCIA_LOAD_CIS is not set
# CONFIG_PCMCIA_IOCTL is not set
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
CONFIG_PCCARD_NONSTATIC=y

#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
# CONFIG_BINFMT_AOUT is not set
# CONFIG_BINFMT_MISC is not set

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK is not set
CONFIG_NF_CONNTRACK_ENABLED=y
CONFIG_NF_CONNTRACK_SUPPORT=y
# CONFIG_IP_NF_CONNTRACK_SUPPORT is not set
CONFIG_NF_CONNTRACK=y
# CONFIG_NF_CT_ACCT is not set
# CONFIG_NF_CONNTRACK_MARK is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CT_PROTO_SCTP is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
CONFIG_NETFILTER_XTABLES=y
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
# CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
CONFIG_NETFILTER_XT_MATCH_STATE=y
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV4=y
# CONFIG_NF_CONNTRACK_PROC_COMPAT is not set
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_IPRANGE is not set
# CONFIG_IP_NF_MATCH_TOS is not set
CONFIG_IP_NF_MATCH_RECENT=y
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_TTL is not set
# CONFIG_IP_NF_MATCH_OWNER is not set
# CONFIG_IP_NF_MATCH_ADDRTYPE is not set
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
# CONFIG_IP_NF_TARGET_ULOG is not set
CONFIG_NF_NAT=y
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
# CONFIG_IP_NF_TARGET_REDIRECT is not set
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_SAME is not set
# CONFIG_NF_NAT_SNMP_BASIC is not set
# CONFIG_NF_NAT_FTP is not set
# CONFIG_NF_NAT_IRC is not set
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
# CONFIG_NF_NAT_SIP is not set
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_NF_CONNTRACK_IPV6=y
# CONFIG_IP6_NF_QUEUE is not set
CONFIG_IP6_NF_IPTABLES=y
# CONFIG_IP6_NF_MATCH_RT is not set
# CONFIG_IP6_NF_MATCH_OPTS is not set
# CONFIG_IP6_NF_MATCH_FRAG is not set
# CONFIG_IP6_NF_MATCH_HL is not set
# CONFIG_IP6_NF_MATCH_OWNER is not set
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set
# CONFIG_IP6_NF_MATCH_AH is not set
# CONFIG_IP6_NF_MATCH_MH is not set
# CONFIG_IP6_NF_MATCH_EUI64 is not set
CONFIG_IP6_NF_FILTER=y
CONFIG_IP6_NF_TARGET_LOG=y
CONFIG_IP6_NF_TARGET_REJECT=y
# CONFIG_IP6_NF_MANGLE is not set
# CONFIG_IP6_NF_RAW is not set

#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set

#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
# CONFIG_BT_SCO is not set
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
# CONFIG_BT_BNEP is not set
CONFIG_BT_HIDP=y

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=y
# CONFIG_BT_HCIUSB_SCO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
CONFIG_CFG80211=y
CONFIG_CFG80211_WEXT_COMPAT=y
CONFIG_NL80211=y
CONFIG_WIRELESS_EXT=y
# CONFIG_MAC80211 is not set
CONFIG_IEEE80211=y
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=y
# CONFIG_IEEE80211_CRYPT_CCMP is not set
# CONFIG_IEEE80211_CRYPT_TKIP is not set
# CONFIG_IEEE80211_SOFTMAC is not set
# CONFIG_IEEE80211_RADIOTAP is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_SYS_HYPERVISOR is not set

#
# Connector - unified userspace <-> kernelspace linker
#
# CONFIG_CONNECTOR is not set

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y

#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_SONY_LAPTOP is not set

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECS=y
# CONFIG_BLK_DEV_DELKIN is not set
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set
# CONFIG_BLK_DEV_IDEACPI is not set
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
# CONFIG_IDE_GENERIC is not set
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
# CONFIG_IDEPCI_SHARE_IRQ is not set
# CONFIG_BLK_DEV_OFFBOARD is not set
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT8213 is not set
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_BLK_DEV_TC86C001 is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set

#
# PCMCIA SCSI adapter support
#
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set

#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
# CONFIG_ATA is not set

#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_MIRROR is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y
CONFIG_IEEE1394_CONFIG_ROM_IP1394=y

#
# Device Drivers
#
# CONFIG_IEEE1394_PCILYNX is not set
CONFIG_IEEE1394_OHCI1394=y

#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
CONFIG_IEEE1394_SBP2_PHYS_DMA=y
CONFIG_IEEE1394_ETH1394=y
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Macintosh device drivers
#
# CONFIG_MAC_EMUMOUSEBTN is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
CONFIG_DUMMY=y
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#

#
# Ethernet (10 or 100Mbit)
#
# CONFIG_NET_ETHERNET is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set

#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
# CONFIG_NET_WIRELESS_RTNETLINK is not set

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
# CONFIG_PCMCIA_WAVELAN is not set
# CONFIG_PCMCIA_NETWAVE is not set

#
# Wireless 802.11 Frequency Hopping cards support
#
# CONFIG_PCMCIA_RAYCS is not set

#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set

#
# Wireless 802.11b Pcmcia/Cardbus cards support
#
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set

#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
CONFIG_HOSTAP=y
# CONFIG_HOSTAP_FIRMWARE is not set
# CONFIG_HOSTAP_PLX is not set
# CONFIG_HOSTAP_PCI is not set
CONFIG_HOSTAP_CS=y
CONFIG_NET_WIRELESS=y
CONFIG_IPW3945=m
# CONFIG_IPW3945_DEBUG is not set
# CONFIG_IPW3945_MONITOR is not set
# CONFIG_IPW3945_PROMISCUOUS is not set

#
# PCMCIA network device support
#
# CONFIG_NET_PCMCIA is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
CONFIG_PPP=y
# CONFIG_PPP_MULTILINK is not set
# CONFIG_PPP_FILTER is not set
CONFIG_PPP_ASYNC=y
# CONFIG_PPP_SYNC_TTY is not set
CONFIG_PPP_DEFLATE=y
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
# CONFIG_PPPOE is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=y
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=m
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_RX is not set
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=y
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_UINPUT is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_CONSOLE is not set
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
# CONFIG_HW_RANDOM is not set
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set

#
# I2C support
#
CONFIG_I2C=y
# CONFIG_I2C_CHARDEV is not set

#
# I2C Algorithms
#
# CONFIG_I2C_ALGOBIT is not set
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set

#
# I2C Hardware Bus support
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_I810 is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_PASEMI is not set
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_SCx200_ACB is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set
# CONFIG_I2C_PCA_ISA is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_SENSORS_DS1337 is not set
# CONFIG_SENSORS_DS1374 is not set
# CONFIG_SENSORS_EEPROM is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set

#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Hardware Monitoring support
#
# CONFIG_HWMON is not set
# CONFIG_HWMON_VID is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_SM501 is not set

#
# Multimedia devices
#
CONFIG_VIDEO_DEV=y
# CONFIG_VIDEO_V4L1 is not set
# CONFIG_VIDEO_V4L1_COMPAT is not set
CONFIG_VIDEO_V4L2=y

#
# Video Capture Adapters
#

#
# Video Capture Adapters
#
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_HELPER_CHIPS_AUTO is not set

#
# Encoders/decoders and other helper chips
#

#
# Audio decoders
#
# CONFIG_VIDEO_TDA9840 is not set
# CONFIG_VIDEO_TEA6415C is not set
# CONFIG_VIDEO_TEA6420 is not set
# CONFIG_VIDEO_MSP3400 is not set
# CONFIG_VIDEO_CS53L32A is not set
# CONFIG_VIDEO_TLV320AIC23B is not set
# CONFIG_VIDEO_WM8775 is not set
# CONFIG_VIDEO_WM8739 is not set

#
# Video decoders
#
# CONFIG_VIDEO_OV7670 is not set
# CONFIG_VIDEO_SAA711X is not set
# CONFIG_VIDEO_TVP5150 is not set

#
# Video and audio decoders
#
# CONFIG_VIDEO_CX25840 is not set

#
# MPEG video encoders
#
# CONFIG_VIDEO_CX2341X is not set

#
# Video encoders
#
# CONFIG_VIDEO_SAA7127 is not set

#
# Video improvement chips
#
# CONFIG_VIDEO_UPD64031A is not set
# CONFIG_VIDEO_UPD64083 is not set
# CONFIG_VIDEO_VIVI is not set
# CONFIG_VIDEO_SAA5246A is not set
# CONFIG_VIDEO_SAA5249 is not set
# CONFIG_VIDEO_SAA7134 is not set
# CONFIG_VIDEO_HEXIUM_ORION is not set
# CONFIG_VIDEO_HEXIUM_GEMINI is not set
# CONFIG_VIDEO_CX88 is not set
# CONFIG_VIDEO_CAFE_CCIC is not set

#
# V4L USB devices
#
# CONFIG_VIDEO_PVRUSB2 is not set
# CONFIG_VIDEO_USBVISION is not set

#
# Radio Adapters
#
# CONFIG_RADIO_GEMTEK_PCI is not set
# CONFIG_RADIO_MAXIRADIO is not set
# CONFIG_RADIO_MAESTRO is not set
# CONFIG_USB_DSBR is not set

#
# Digital Video Broadcasting Devices
#
CONFIG_DVB=y
CONFIG_DVB_CORE=y
# CONFIG_DVB_CORE_ATTACH is not set

#
# Supported SAA7146 based PCI Adapters
#

#
# Supported USB Adapters
#
# CONFIG_DVB_USB is not set
# CONFIG_DVB_TTUSB_BUDGET is not set
# CONFIG_DVB_TTUSB_DEC is not set
CONFIG_DVB_CINERGYT2=y
# CONFIG_DVB_CINERGYT2_TUNING is not set

#
# Supported FlexCopII (B2C2) Adapters
#
# CONFIG_DVB_B2C2_FLEXCOP is not set

#
# Supported BT878 Adapters
#

#
# Supported Pluto2 Adapters
#
# CONFIG_DVB_PLUTO2 is not set

#
# Supported DVB Frontends
#

#
# Customise DVB Frontends
#
# CONFIG_DVB_FE_CUSTOMISE is not set

#
# DVB-S (satellite) frontends
#
# CONFIG_DVB_STV0299 is not set
# CONFIG_DVB_CX24110 is not set
# CONFIG_DVB_CX24123 is not set
# CONFIG_DVB_TDA8083 is not set
# CONFIG_DVB_MT312 is not set
# CONFIG_DVB_VES1X93 is not set
# CONFIG_DVB_S5H1420 is not set
# CONFIG_DVB_TDA10086 is not set

#
# DVB-T (terrestrial) frontends
#
# CONFIG_DVB_SP8870 is not set
# CONFIG_DVB_SP887X is not set
# CONFIG_DVB_CX22700 is not set
# CONFIG_DVB_CX22702 is not set
# CONFIG_DVB_L64781 is not set
# CONFIG_DVB_TDA1004X is not set
# CONFIG_DVB_NXT6000 is not set
# CONFIG_DVB_MT352 is not set
# CONFIG_DVB_ZL10353 is not set
# CONFIG_DVB_DIB3000MB is not set
# CONFIG_DVB_DIB3000MC is not set
# CONFIG_DVB_DIB7000M is not set
# CONFIG_DVB_DIB7000P is not set

#
# DVB-C (cable) frontends
#
# CONFIG_DVB_VES1820 is not set
# CONFIG_DVB_TDA10021 is not set
# CONFIG_DVB_STV0297 is not set

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
# CONFIG_DVB_NXT200X is not set
# CONFIG_DVB_OR51211 is not set
# CONFIG_DVB_OR51132 is not set
# CONFIG_DVB_BCM3510 is not set
# CONFIG_DVB_LGDT330X is not set

#
# Tuners/PLL support
#
# CONFIG_DVB_TDA826X is not set
# CONFIG_DVB_TUNER_QT1010 is not set
# CONFIG_DVB_TUNER_MT2060 is not set
# CONFIG_DVB_TUNER_LGH06XF is not set

#
# Miscellaneous devices
#
# CONFIG_DVB_LNBP21 is not set
# CONFIG_DVB_ISL6421 is not set
# CONFIG_DVB_TUA6100 is not set
# CONFIG_USB_DABUSB is not set

#
# Graphics support
#
# CONFIG_BACKLIGHT_LCD_SUPPORT is not set
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
# CONFIG_FB_DDC is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set

#
# Frambuffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

#
# Logo configuration
#
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_FB_SPLASH is not set

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_OSSEMUL=y
# CONFIG_SND_MIXER_OSS is not set
CONFIG_SND_PCM_OSS=y
# CONFIG_SND_PCM_OSS_PLUGINS is not set
CONFIG_SND_RTCTIMER=y
# CONFIG_SND_DYNAMIC_MINORS is not set
# CONFIG_SND_SUPPORT_OLD_API is not set
# CONFIG_SND_VERBOSE_PROCFS is not set
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# USB devices
#
CONFIG_SND_USB_AUDIO=y
# CONFIG_SND_USB_USX2Y is not set

#
# PCMCIA devices
#
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set

#
# SoC audio support
#
# CONFIG_SND_SOC is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set

#
# HID Devices
#
CONFIG_HID=y
# CONFIG_HID_DEBUG is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_EHCI_BIG_ENDIAN_MMIO is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set
# CONFIG_USB_GTCO is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
CONFIG_USB_USBNET=y
CONFIG_USB_NET_CDCETHER=y
# CONFIG_USB_NET_DM9601 is not set
# CONFIG_USB_NET_GL620A is not set
# CONFIG_USB_NET_NET1080 is not set
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
# CONFIG_USB_NET_CDC_SUBSET is not set
# CONFIG_USB_NET_ZAURUS is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#

#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
# CONFIG_USB_SERIAL_GENERIC is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_AIRPRIME is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP2101 is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
CONFIG_USB_SERIAL_GARMIN=y
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
CONFIG_USB_SERIAL_PL2303=y
# CONFIG_USB_SERIAL_HP4X is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
CONFIG_MMC=y
# CONFIG_MMC_DEBUG is not set
CONFIG_MMC_BLOCK=y
CONFIG_MMC_SDHCI=y
# CONFIG_MMC_WBSD is not set
# CONFIG_MMC_TIFM_SD is not set

#
# LED devices
#
# CONFIG_NEW_LEDS is not set

#
# LED drivers
#

#
# LED Triggers
#

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
# CONFIG_EDAC_AMD76X is not set
# CONFIG_EDAC_E7XXX is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82875P is not set
# CONFIG_EDAC_I82860 is not set
# CONFIG_EDAC_R82600 is not set
CONFIG_EDAC_POLL=y

#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set

#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set

#
# DMA Clients
#

#
# DMA Devices
#

#
# Auxiliary Display support
#

#
# Virtualization
#
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
# CONFIG_KVM_AMD is not set

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_FS_XATTR is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_FS_POSIX_ACL is not set
CONFIG_XFS_FS=m
# CONFIG_XFS_QUOTA is not set
# CONFIG_XFS_SECURITY is not set
# CONFIG_XFS_POSIX_ACL is not set
# CONFIG_XFS_RT is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=y
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y

#
# Distributed Lock Manager
#
# CONFIG_DLM is not set

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
# CONFIG_KPROBES is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_KERNEL is not set
CONFIG_LOG_BUF_SHIFT=15
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_EARLY_PRINTK=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_LZF=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_TEST is not set

#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_GEODE is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
# CONFIG_CRC16 is not set
CONFIG_CRC32=y
# CONFIG_LIBCRC32C is not set
CONFIG_DYN_PAGEFLAGS=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 21:48                                               ` Ingo Molnar
@ 2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 17:39                                                   ` Bernd Eckenfels
  2007-04-19  6:52                                                 ` Mike Galbraith
  1 sibling, 2 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-18 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Ingo Molnar wrote:

> That's one reason why i dont think it's necessarily a good idea to 
> group-schedule threads, we dont really want to do a per thread group 
> percpu_alloc().

I still do not have clear how much overhead this will bring into the 
table, but I think (like Linus was pointing out) the hierarchy should look 
like:

Top (VCPU maybe?)
    User
        Process
            Thread

The "run_queue" concept (and data) that now is bound to a CPU, need to be 
replicated in:

ROOT <- VCPUs add themselves here
    VCPU <- USERs add themselves here
        USER <- PROCs add themselves here
            PROC <- THREADs add themselves here
                THREAD (ultimate fine grained scheduling unit)

So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking up a 
new task would mean:

VCPU = ROOT->lookup();
USER = VCPU->lookup();
PROC = USER->lookup();
THREAD = PROC->lookup();

Run-time statistics should propagate back the other way around.


> In fact for threads the _reverse_ problem exists, threaded apps tend to 
> _strive_ for more performance - hence their desperation of using the 
> threaded programming model to begin with ;) (just think of media 
> playback apps which are typically multithreaded)

The same user nicing two different multi-threaded processes would expect a 
predictable CPU distribution too. Doing that efficently (the old per-cpu 
run-queue is pretty nice from many POVs) is the real challenge.



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 20:11                                             ` Linus Torvalds
@ 2007-04-19  0:22                                               ` Davide Libenzi
  2007-04-19  0:30                                                 ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-19  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> On Wed, 18 Apr 2007, Davide Libenzi wrote:
> > 
> > "Perhaps on the rare occasion pursuing the right course demands an act of 
> >  unfairness, unfairness itself can be the right course?"
> 
> I don't think that's the right issue.
> 
> It's just that "fairness" != "equal".
> 
> Do you think it "fair" to pay everybody the same regardless of how good a 
> job they do? I don't think anybody really believes that. 
> 
> Equating "fair" and "equal" is simply a very fundamental mistake. They're 
> not the same thing. Never have been, and never will.

I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :)



- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  0:22                                               ` Davide Libenzi
@ 2007-04-19  0:30                                                 ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-19  0:30 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner



On Wed, 18 Apr 2007, Davide Libenzi wrote:
> 
> I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :)

Ahh, I'm clearly not cultured enough, I didn't catch that reference.

	Linus "yes, I've seen the movie, but it
		 apparently left more of a mark in other people" Torvalds

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
                                                         ` (3 preceding siblings ...)
  2007-04-18 18:36                                       ` Diego Calleja
@ 2007-04-19  0:37                                       ` Peter Williams
  4 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-19  0:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Linus Torvalds wrote:
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
>> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
>>> And "fairness by euid" is probably a hell of a lot easier to do than 
>>> trying to figure out the wakeup matrix.
>> For the record, you actually don't need to track a whole NxN matrix
>> (or do the implied O(n**3) matrix inversion!) to get to the same
>> result.
> 
> I'm sure you can do things differently, but the reason I think "fairness 
> by euid" is actually worth looking at is that it's pretty much the 
> *identical* issue that we'll have with "fairness by virtual machine" and a 
> number of other "container" issues.
> 
> The fact is:
> 
>  - "fairness" is *not* about giving everybody the same amount of CPU time 
>    (scaled by some niceness level or not). Anybody who thinks that is 
>    "fair" is just being silly and hasn't thought it through.
> 
>  - "fairness" is multi-level. You want to be fair to threads within a 
>    thread group (where "process" may be one good approximation of what a 
>    "thread group" is, but not necessarily the only one).
> 
>    But you *also* want to be fair in between those "thread groups", and 
>    then you want to be fair across "containers" (where "user" may be one 
>    such container).
> 
> So I claim that anything that cannot be fair by user ID is actually really 
> REALLY unfair. I think it's absolutely humongously STUPID to call 
> something the "Completely Fair Scheduler", and then just be fair on a 
> thread level. That's not fair AT ALL! It's the anti-thesis of being fair!
> 
> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.
> 
> So you should see one thread get 50& CPU (single thread of one user), 4 
> threads get 10% CPU (their fair share of that users time), and 2 threads 
> get 5% CPU (the fair share within that thread group!).
> 
> Any scheduling argument that just considers the above to be "7 threads 
> total" and gives each thread 14% of CPU time "fairly" is *anything* but 
> fair. It's a joke if that kind of scheduler then calls itself CFS!
> 
> And yes, that's largely what the current scheduler will do, but at least 
> the current scheduler doesn't claim to be fair! So the current scheduler 
> is a lot *better* if only in the sense that it doesn't make ridiculous 
> claims that aren't true!
> 
> 			Linus

Sounds a lot like the PLFS (process level fair sharing) scheduler in 
Aurema's ARMTech (for whom I used to work).  The "fair" in the title is 
a bit misleading as it's all about unfair scheduling in order to meet 
specific policies.  But it's based on the principle that if you can 
allocate CPU band width "fairly" (which really means in proportion to 
the entitlement each process is allocated) then you can allocate CPU 
band width "fairly" between higher level entities such as process 
groups, users groups and so on by subdividing the entitlements downwards.

The tricky part of implementing this was the fact that not all entities 
at the various levels have sufficient demand for CPU band width to use 
their entitlements and this in turn means that the entities above them 
will have difficulty using their entitlements even if other of their 
subordinates have sufficient demand (because their entitlements will be 
too small).  The trick is to have a measure of each entity's demand for 
CPU bandwidth and use that to modify the way entitlement is divided 
among subordinates.

As a first guess, an entity's CPU band width usage is an indicator of 
demand but doesn't take into account unmet demand due to tasks waiting 
on a run queue waiting for access to the CPU.  On the other hand, usage 
plus time waiting on the queue isn't a good measure of demand either 
(although it's probably a good upper bound) as it's unlikely that the 
task would have used the same amount of CPU as the waiting time if it 
had gone straight to the CPU.

But my main point is that it is possible to build schedulers that can 
achieve higher level scheduling policies.  Versions of PLFS work on 
Windows from user space by twiddling process priorities.  Part of my 
more recent work at Aurema had been involved in patching Linux's 
scheduler so that nice worked more predictably so that we could release 
a user space version of PLFS for Linux.  The other part was to add hard 
CPU band width caps for processes so that ARMTech could enforce hard CPU 
bandwidth caps on higher level entities (as this can't be done without 
the kernel being able to do it at that level.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:27                                         ` Chris Friesen
@ 2007-04-19  0:49                                           ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-19  0:49 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Mark Glines, Linus Torvalds, Matt Mackall, Nick Piggin,
	Bill Huey, Mike Galbraith, William Lee Irwin III, linux-kernel,
	ck list, Thomas Gleixner, Andrew Morton, Arjan van de Ven

Chris Friesen wrote:
> Mark Glines wrote:
> 
>> One minor question: is it even possible to be completely fair on SMP?
>> For instance, if you have a 2-way SMP box running 3 applications, one of
>> which has 2 threads, will the threaded app have an advantage here?  (The
>> current system seems to try to keep each thread on a specific CPU, to
>> reduce cache thrashing, which means threads and processes alike each
>> get 50% of the CPU.)
> 
> I think the ideal in this case would be to have both threads on one cpu, 
> with the other app on the other cpu.  This gives inter-process fairness 
> while minimizing the amount of task migration required.

Solving this sort of issue was one of the reasons for the smpnice patches.

> 
> More interesting is the case of three processes on a 2-cpu system.  Do 
> we constantly migrate one of them back and forth to ensure that each of 
> them gets 66% of a cpu?

Depends how keen you are on fairness.  Unless the process are long term 
continuously active tasks that never sleep it's probably not an issue as 
they'll probably move around enough in the long term for them each to 
get 66% over the long term.

Exact load balancing for real work loads (where tasks are coming and 
going, sleeping and waking semi randomly and over relatively brief 
periods) is probably unattainable because by the time you've work out 
the ideal placement of the currently runnable tasks on the available 
CPUs it's all changed and the solution is invalid.  The best you can 
hope for that change isn't so great as to completely invalidate the 
solution and the changes you make as a result are an improvement on the 
current allocation of processes to CPUs.

The above probably doesn't hold for some systems such as those large 
super computer jobs that run for several days but they're probably best 
served by explicit allocation of processes to CPUs using the process 
affinity mechanism.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:22             ` Christian Hesse
@ 2007-04-19  1:37               ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-19  1:37 UTC (permalink / raw)
  To: Christian Hesse; +Cc: Ingo Molnar, linux-kernel, suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

Hi.

On Thu, 2007-04-19 at 00:22 +0200, Christian Hesse wrote:
> On Thursday 19 April 2007, Ingo Molnar wrote:
> > * Christian Hesse <mail@earthworm.de> wrote:
> > > > although probably your suspend2 problem is still not fixed, it's
> > > > worth a try nevertheless. Which suspend2 patch did you apply, and
> > > > was it against -rc6 or -rc7?
> > >
> > > You are right again. ;-)
> > >
> > > Linux 2.6.21-rc7
> > > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > > CFS v3 (without any additional patches)
> > >
> > > And it still hangs on suspend.
> >
> > what's the easiest way for me to try suspend2? Apply the patch, reboot
> > into the kernel, then execute what command to suspend? (there's a
> > confusing mismash of initiators of all the suspend variants. Can i drive
> > this by echoing to /sys/power/state?)
> 
> Perhaps you have to install suspend2-userui as well for the output (I'm not 
> shure whether it works without). Then you can trigger the suspend by echoing 
> to /sys/power/suspend2/do_suspend.
> Useful informations can be found in the Howto:
> 
> http://www.suspend2.net/HOWTO
> 
> I dropped some ccs to not abuse Linus and friends.

You can suspend and resume without it.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:56             ` Bob Picco
@ 2007-04-19  1:43               ` Nigel Cunningham
  2007-04-19  6:29               ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-19  1:43 UTC (permalink / raw)
  To: Bob Picco
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1649 bytes --]

Hi.

On Wed, 2007-04-18 at 18:56 -0400, Bob Picco wrote:
> Ingo Molnar wrote:	[Wed Apr 18 2007, 06:02:28PM EDT]
> > 
> > * Christian Hesse <mail@earthworm.de> wrote:
> > 
> > > > although probably your suspend2 problem is still not fixed, it's 
> > > > worth a try nevertheless. Which suspend2 patch did you apply, and 
> > > > was it against -rc6 or -rc7?
> > > 
> > > You are right again. ;-)
> > > 
> > > Linux 2.6.21-rc7
> > > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > > CFS v3 (without any additional patches)
> > > 
> > > And it still hangs on suspend.
> > 
> > what's the easiest way for me to try suspend2? Apply the patch, reboot 
> > into the kernel, then execute what command to suspend? (there's a 
> > confusing mismash of initiators of all the suspend variants. Can i drive 
> > this by echoing to /sys/power/state?)
> > 
> > 	Ingo
> I had hoped to collect more data with CFS V2. It crashes in
> scale_nice_down for s2ram when attempting to disable_nonboot_cpus. 
> So part of traceback looks like (typed by hand with obvious omissions):
> 
> scale_nice_down
> update_stats_wait_end - not shown in traceback because inlined
> pick_next_task_fair
> migration_call
> task_rq_lock
> notifier_call_chain
> _cpu_down
> disable_nonboot_cpus
> ...
> 
> This is standard -rc7 with V2 CFS applied. It could be a completely
> unrelated issue. I'll attempt to debug further tomorrow.

That - and Christian's other reply with the jpg - look to me more like
this is an interaction between CFS and cpu hotplugging than Suspend2
itself. Can you also reproduce this with swsusp?

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:02           ` Ingo Molnar
  2007-04-18 22:22             ` Christian Hesse
  2007-04-18 22:56             ` Bob Picco
@ 2007-04-19  1:52             ` Nigel Cunningham
  2007-04-19  7:04               ` Ingo Molnar
  2 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-19  1:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

Hi.

On Thu, 2007-04-19 at 00:02 +0200, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> 
> > > although probably your suspend2 problem is still not fixed, it's 
> > > worth a try nevertheless. Which suspend2 patch did you apply, and 
> > > was it against -rc6 or -rc7?
> > 
> > You are right again. ;-)
> > 
> > Linux 2.6.21-rc7
> > Suspend2 2.2.9.11 (applies cleanly to -rc7)
> > CFS v3 (without any additional patches)
> > 
> > And it still hangs on suspend.
> 
> what's the easiest way for me to try suspend2? Apply the patch, reboot 
> into the kernel, then execute what command to suspend? (there's a 
> confusing mismash of initiators of all the suspend variants. Can i drive 
> this by echoing to /sys/power/state?)

From subsequent emails, I think you already got your answer, but just in
case...

Yes, if you enabled "Replace swsusp by default" and you already had it
set up for getting swsusp to resume. If not, and you're using an
initrd/ramfs, you'll need to modify it to echo
> /sys/power/suspend2/do_resume after /sys and /proc are mounted but
prior to mounting / and so on.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:44                 ` Ingo Molnar
@ 2007-04-19  2:20                   ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-19  2:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>>> And my scheduler for example cuts down the amount of policy code and 
>>> code size significantly.
>> Yours is one of the smaller patches mainly because you perpetuate (or 
>> you did in the last one I looked at) the (horrible to my eyes) dual 
>> array (active/expired) mechanism.  That this idea was bad should have 
>> been apparent to all as soon as the decision was made to excuse some 
>> tasks from being moved from the active array to the expired array.  
>> This essentially meant that there would be circumstances where extreme 
>> unfairness (to the extent of starvation in some cases) -- the very 
>> things that the mechanism was originally designed to ensure (as far as 
>> I can gather).  Right about then in the development of the O(1) 
>> scheduler alternative solutions should have been sought.
> 
> in hindsight i'd agree.

Hindsight's a wonderful place isn't it :-) and, of course, it's where I 
was making my comments from.

> But back then we were clearly not ready for 
> fine-grained accurate statistics + trees (cpus are alot faster at more 
> complex arithmetics today, plus people still believed that low-res can 
> be done well enough),  and taking out any of these two concepts from CFS
> would result in a similarly complex runqueue implementation.

I disagree.  The single priority array with a promotion mechanism that I 
use in the SPA schedulers can do the job of avoiding starvation with no 
measurable increase in the overhead.  Fairness, nice, good interactive 
responsiveness can then be managed by how you determine tasks' dynamic 
priorities.

> Also, the 
> array switch was just thought to be of another piece of 'if the 
> heuristics go wrong, we fall back to an array switch' logic, right in 
> line with the other heuristics. And you have to accept it, mainline's 
> ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
> plus for developers, so it had (and probably still has) quite some 
> inertia.

I agree, it wasn't totally useless especially for the average user.  My 
main problem with it was that the effect of "nice" wasn't consistent or 
predictable enough for reliable resource allocation.

I also agree with the aims of the various heuristics i.e. you have to be 
unfair and give some tasks preferential treatment in order to give the 
users the type of responsiveness that they want.  It's just a shame that 
it got broken in the process but as you say it's easier to see these 
things in hindsight than in the middle of the melee.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
@ 2007-04-19  3:18                                   ` Nick Piggin
  2007-04-19  5:14                                     ` Andrew Morton
  2007-04-21 13:40                                   ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen
  2 siblings, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-19  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
> > 
> > Why is X special? Because it does work on behalf of other processes?
> > Lots of things do this. Perhaps a scheduler should focus entirely on
> > the implicit and directed wakeup matrix and optimizing that
> > instead[1].
> 
> I 100% agree - the perfect scheduler would indeed take into account where 
> the wakeups come from, and try to "weigh" processes that help other 
> processes make progress more. That would naturally give server processes 
> more CPU power, because they help others
> 
> I don't believe for a second that "fairness" means "give everybody the 
> same amount of CPU". That's a totally illogical measure of fairness. All 
> processes are _not_ created equal.

I believe that unless the kernel is told of these inequalities, then it
must schedule fairly.

And yes, by fairly, I mean fairly among all threads as a base resource
class, because that's what Linux has always done (and if you aggregate
into higher classes, you still need that per-thread scheduling).

So I'm not excluding extra scheduling classes like per-process, per-user,
but among any class of equal schedulable entities, fair scheduling is the
only option because the alternative of unfairness is just insane.


> That said, even trying to do "fairness by effective user ID" would 
> probably already do a lot. In a desktop environment, X would get as much 
> CPU time as the user processes, simply because it's in a different 
> protection domain (and that's really what "effective user ID" means: it's 
> not about "users", it's really about "protection domains").
> 
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.

Well my X server has an euid of root, which would mean my X clients can
cause X to do work and eat into root's resources. Or as Ingo said, X
may not be running as root. Seems like just another hack to try to
implicitly solve the X problem and probably create a lot of others
along the way.

All fairness issues aside, in the context of keeping a very heavily
loaded desktop interactive, X is special. That you are trying to think
up funny rules that would implicitly give X better priority is kind of
indicative of that.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:49             ` Con Kolivas
@ 2007-04-19  3:28               ` Nick Piggin
  0 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-19  3:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 10:49:45PM +1000, Con Kolivas wrote:
> On Wednesday 18 April 2007 22:13, Nick Piggin wrote:
> >
> > The kernel compile (make -j8 on 4 thread system) is doing 1800 total
> > context switches per second (450/s per runqueue) for cfs, and 670
> > for mainline. Going up to 20ms granularity for cfs brings the context
> > switch numbers similar, but user time is still a % or so higher. I'd
> > be more worried about compute heavy threads which naturally don't do
> > much context switching.
> 
> While kernel compiles are nice and easy to do I've seen enough criticism of 
> them in the past to wonder about their usefulness as a standard benchmark on 
> their own.

Actually it is a real workload for most kernel developers including you
no doubt :)

The criticism's of kernbench for the kernel are probably fair in that
kernel compiles don't exercise a lot of kernel functionality (page
allocator and fault paths mostly, IIRC). However as far as I'm concerned,
they're great for testing the CPU scheduler, because it doesn't actually
matter whether you're running in userspace or kernel space for a context
switch to blow your caches. The results are quite stable.

You could actually make up a benchmark that hurts a whole lot more from
context switching, but I figure that kernbench is a real world thing
that shows it up quite well.


> > Some other numbers on the same system
> > Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
> > 10 groups: Time: 1.332		0.743		0.607
> > 20 groups: Time: 1.197		1.100		1.241
> > 30 groups: Time: 1.754		2.376		1.834
> > 40 groups: Time: 3.451		2.227		2.503
> > 50 groups: Time: 3.726		3.399		3.220
> > 60 groups: Time: 3.548		4.567		3.668
> > 70 groups: Time: 4.206		4.905		4.314
> > 80 groups: Time: 4.551		6.324		4.879
> > 90 groups: Time: 7.904		6.962		5.335
> > 100 groups: Time: 7.293		7.799		5.857
> > 110 groups: Time: 10.595	8.728		6.517
> > 120 groups: Time: 7.543		9.304		7.082
> > 130 groups: Time: 8.269		10.639		8.007
> > 140 groups: Time: 11.867	8.250		8.302
> > 150 groups: Time: 14.852	8.656		8.662
> > 160 groups: Time: 9.648		9.313		9.541
> 
> Hackbench even more so. A prolonged discussion with Rusty Russell on this 
> issue he suggested hackbench was more a pass/fail benchmark to ensure there 
> was no starvation scenario that never ended, and very little value should be 
> placed on the actual results returned from it.

Yeah, cfs seems to do a little worse than nicksched here, but I
include the numbers not because I think that is significant, but to
show mainline's poor characteristics.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  3:18                                   ` Nick Piggin
@ 2007-04-19  5:14                                     ` Andrew Morton
  2007-04-19  6:38                                       ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Andrew Morton @ 2007-04-19  5:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Bill Huey, linux-kernel, Arjan van de Ven,
	Thomas Gleixner

On Thu, 19 Apr 2007 05:18:07 +0200 Nick Piggin <npiggin@suse.de> wrote:

> And yes, by fairly, I mean fairly among all threads as a base resource
> class, because that's what Linux has always done

Yes, there are potential compatibility problems.  Example: a machine with
100 busy httpd processes and suddenly a big gzip starts up from console or
cron.

Under current kernels, that gzip will take ages and the httpds will take a
1% slowdown, which may well be exactly the behaviour which is desired.

If we were to schedule by UID then the gzip suddenly gets 50% of the CPU
and those httpd's all take a 50% hit, which could be quite serious.

That's simple to fix via nicing, but people have to know to do that, and
there will be a transition period where some disruption is possible.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 23:12             ` Christian Hesse
@ 2007-04-19  6:28               ` Ingo Molnar
  2007-04-19 20:32                 ` Christian Hesse
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  6:28 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Christian Hesse <mail@earthworm.de> wrote:

> I now got some error message from my system:
> 
> http://www.eworm.de/tmp/cfs-suspend.jpg

ah, this pinpoints a bug: for performance reasons pick_next_task() 
assumes that the runqueue is not empty - which is true for schedule(), 
but not in migrate_dead_tasks(). Does the patch below fix the crash for 
you?

	Ingo

---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned 
 	struct task_struct *next;
 
 	for (;;) {
+		if (!rq->nr_running)
+			break;
 		next = pick_next_task(rq, rq->curr);
 		if (!next)
 			break;

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:56             ` Bob Picco
  2007-04-19  1:43               ` [Suspend2-devel] " Nigel Cunningham
@ 2007-04-19  6:29               ` Ingo Molnar
  2007-04-19 11:10                 ` Bob Picco
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  6:29 UTC (permalink / raw)
  To: Bob Picco
  Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, suspend2-devel


* Bob Picco <bob.picco@hp.com> wrote:

> I had hoped to collect more data with CFS V2. It crashes in 
> scale_nice_down for s2ram when attempting to disable_nonboot_cpus. So 
> part of traceback looks like (typed by hand with obvious omissions):
> 
> scale_nice_down
> update_stats_wait_end - not shown in traceback because inlined
> pick_next_task_fair
> migration_call
> task_rq_lock
> notifier_call_chain
> _cpu_down
> disable_nonboot_cpus

ok, this looks similar to the jpeg Christian did. Does the patch below 
fix the crash for you?

	Ingo

---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned 
 	struct task_struct *next;
 
 	for (;;) {
+		if (!rq->nr_running)
+			break;
 		next = pick_next_task(rq, rq->curr);
 		if (!next)
 			break;

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  5:14                                     ` Andrew Morton
@ 2007-04-19  6:38                                       ` Ingo Molnar
  2007-04-19  7:57                                         ` William Lee Irwin III
                                                           ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  6:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > And yes, by fairly, I mean fairly among all threads as a base 
> > resource class, because that's what Linux has always done
> 
> Yes, there are potential compatibility problems.  Example: a machine 
> with 100 busy httpd processes and suddenly a big gzip starts up from 
> console or cron.
> 
> Under current kernels, that gzip will take ages and the httpds will 
> take a 1% slowdown, which may well be exactly the behaviour which is 
> desired.
> 
> If we were to schedule by UID then the gzip suddenly gets 50% of the 
> CPU and those httpd's all take a 50% hit, which could be quite 
> serious.
> 
> That's simple to fix via nicing, but people have to know to do that, 
> and there will be a transition period where some disruption is 
> possible.

hmmmm. How about the following then: default to nice -10 for all 
(SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
special: root already has disk space reserved to it, root has special 
memory allocation allowances, etc. I dont see a reason why we couldnt by 
default make all root tasks have nice -10. This would be instantly loved 
by sysadmins i suspect ;-)

(distros that go the extra mile of making Xorg run under non-root could 
also go another extra one foot to renice that X server to -10.)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-18 22:16           ` CFS and suspend2: hang in atomic copy Ingo Molnar
  2007-04-18 23:12             ` Christian Hesse
@ 2007-04-19  6:41             ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  6:41 UTC (permalink / raw)
  To: Christian Hesse
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel


* Ingo Molnar <mingo@elte.hu> wrote:

> i just tried the same and it suspended+resumed just fine:
> 
> Restarting tasks ... done.
> Suspend2 debugging info:
> - Suspend core   : 2.2.9.12
> - Kernel Version : 2.6.21-rc7-CFS-v3

the key difference was that i should have attempted to sw-suspend to 
disk on an SMP box - that's where the bug triggered.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 23:30                                                 ` Davide Libenzi
@ 2007-04-19  6:52                                                 ` Mike Galbraith
  2007-04-19  7:09                                                   ` Ingo Molnar
  2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 2 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-19  6:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Con Kolivas, ck list,
	Bill Huey, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote:

> so my current impression is that we want per UID accounting to solve the 
> X problem, the kernel threads problem and the many-users problem, but 
> i'd not want to do it for threads just yet because for them there's not 
> really any apparent problem to be solved.

If you really mean UID vs EUID as Linus mentioned, I suppose I could
learn to login as !root, and set KDE up to always give me root shells.

With a heavily reniced X (perfectly fine), that should indeed solve my
daily usage pattern nicely (always need godmode for shells, but not for
mozilla and ilk. 50/50 split automatic without renice of entire gui)

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy
  2007-04-19  1:52             ` [Suspend2-devel] " Nigel Cunningham
@ 2007-04-19  7:04               ` Ingo Molnar
  2007-04-19  9:05                 ` Nigel Cunningham
  2007-04-24 20:23                 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  7:04 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


* Nigel Cunningham <nigel@nigel.suspend2.net> wrote:

> From subsequent emails, I think you already got your answer, but just 
> in case...
> 
> Yes, if you enabled "Replace swsusp by default" and you already had it 
> set up for getting swsusp to resume. If not, and you're using an 
> initrd/ramfs, you'll need to modify it to echo
> > /sys/power/suspend2/do_resume after /sys and /proc are mounted but
> prior to mounting / and so on.

yeah, went with the default suggested by your patch:

   CONFIG_SUSPEND2_REPLACE_SWSUSP=y

and it was pretty easy to set things up. I used "echo disk > 
/sys/power/state" to trigger it.

In hindsight it was all pretty straightforward and suspend2 worked 
beautifully on an UP and on an SMP system i tried. So in exchange for 
suspend2 folks debugging a bug in CFS here's some suspend2 review 
feedback ;) Any plans about moving suspend2 to the upstream kernel? It 
should be pretty easy for it to co-exist with the current swsuspend 
code.

The patch has quite some size:

 89 files changed, 16452 insertions(+), 69 deletions(-)

that should obviously be split up into more than a dozen sub-patches, 
and fed to lkml with the small ones first. (unless it already is split 
up?)

i cannot comment on the kernel/power/ bits (they are way too large 
anyway), other than that they look pretty clean visually, but the 
lowlevel arch and generic kernel bits look sane in detail too, sans a 
few mostly trivial cleanliness issues:

+int suspend2_faulted = 0;
+EXPORT_SYMBOL(suspend2_faulted);

should be done via the pagefault notifier chain mechanism. Also, all the 
exports you added should be EXPORT_SYMBOL_GPL().

this:

-               ClearPageReserved(virt_to_page(addr));
-               init_page_count(virt_to_page(addr));
+               //ClearPageReserved(virt_to_page(addr));
+               //init_page_count(virt_to_page(addr));

looks like there's a buglet in there still somewhere?

+       if(PageHighMem(page))
+               return 0;

coding style.

+       BUG_ON( test_suspend_state(SUSPEND_RUNNING) &&  /* Suspend2, that is */

make this a WARN_ON() or a WARN_ON_ONCE() - that way you have a chance 
to even get feedback from users, instead of a 'uhm, X froze' report.

+#define FREEZER_OFF 0
+#define FREEZER_USERSPACE_FROZEN 1
+#define FREEZER_FULLY_ON 2

should be:

+#define FREEZER_OFF			0
+#define FREEZER_USERSPACE_FROZEN	1
+#define FREEZER_FULLY_ON		2

(you want your reviewers have an pleasant time reading your code :)

+#define NETLINK_SUSPEND2_USERUI        20      /* For suspend2's userui */

IIRC userui was at the center of suspend2 merge flames, right? So you 
might want to layer it ontop a less flashy suspend2-core and thus get 
90% of your patch upstream?

+++ linux/mm/vmscan.c

the MM impact looks quite nontrivial. But i suspect this is unavoidable, 
because you zap portions of the pagecache on the way to disk, so when it 
comes back it results in a different pagecache (new lru lists, etc.), 
right?

+++ linux/lib/dyn_pageflags.c

shouldnt this be in mm/dyn_pageflags.c? Plus it would be nice to use 
some other core kernel user for this infrastructure. (but it's not a 
necessity i guess)

but ... again, the patch looks sane all around.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:52                                                 ` Mike Galbraith
@ 2007-04-19  7:09                                                   ` Ingo Molnar
  2007-04-19  7:32                                                     ` Mike Galbraith
  2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  7:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel


* Mike Galbraith <efault@gmx.de> wrote:

> With a heavily reniced X (perfectly fine), that should indeed solve my 
> daily usage pattern nicely (always need godmode for shells, but not 
> for mozilla and ilk. 50/50 split automatic without renice of entire 
> gui)

how about the first-approximation solution i suggested in the previous 
mail: to add a per UID default nice level? (With this default defaulting 
to '-10' for all root-owned processes, and defaulting to '0' for 
everything else.) That would solve most of the current CFS regressions 
at hand.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:52                                                 ` Mike Galbraith
  2007-04-19  7:09                                                   ` Ingo Molnar
@ 2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-19  7:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Con Kolivas, ck list,
	Bill Huey, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Thu, 2007-04-19 at 08:52 +0200, Mike Galbraith wrote:
> On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote:
> 
> > so my current impression is that we want per UID accounting to solve the 
> > X problem, the kernel threads problem and the many-users problem, but 
> > i'd not want to do it for threads just yet because for them there's not 
> > really any apparent problem to be solved.
> 
> If you really mean UID vs EUID as Linus mentioned, I suppose I could
> learn to login as !root, and set KDE up to always give me root shells.
> 
> With a heavily reniced X (perfectly fine), that should indeed solve my
> daily usage pattern nicely (always need godmode for shells, but not for
> mozilla and ilk. 50/50 split automatic without renice of entire gui)

Backward, needs to be EUID as Linus suggested.  Kernel builds etc along
with reniced X in root's bucket, surfing and whatnot in Joe-User's
bucket.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:09                                                   ` Ingo Molnar
@ 2007-04-19  7:32                                                     ` Mike Galbraith
  2007-04-19 16:55                                                       ` Davide Libenzi
  0 siblings, 1 reply; 712+ messages in thread
From: Mike Galbraith @ 2007-04-19  7:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > daily usage pattern nicely (always need godmode for shells, but not 
> > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > gui)
> 
> how about the first-approximation solution i suggested in the previous 
> mail: to add a per UID default nice level? (With this default defaulting 
> to '-10' for all root-owned processes, and defaulting to '0' for 
> everything else.) That would solve most of the current CFS regressions 
> at hand.

That would make my kernel builds etc interfere with my other self's
surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
X portion of my Joe-User activity pushes the compile portion of root
down in bandwidth utilization automagically, which is exactly the right
thing, because the root me in not as important as the Joe-User me using
the GUI at that time.  If the idea of X disturbing root upsets some,
they can move X to another UID.  Generally, it seems perfect for here.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:38                                       ` Ingo Molnar
@ 2007-04-19  7:57                                         ` William Lee Irwin III
  2007-04-19 11:50                                           ` Peter Williams
  2007-04-19  8:33                                         ` Nick Piggin
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
  2 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-19  7:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

* Andrew Morton <akpm@linux-foundation.org> wrote:
>> Yes, there are potential compatibility problems.  Example: a machine 
>> with 100 busy httpd processes and suddenly a big gzip starts up from 
>> console or cron.
[...]

On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
> hmmmm. How about the following then: default to nice -10 for all 
> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
> special: root already has disk space reserved to it, root has special 
> memory allocation allowances, etc. I dont see a reason why we couldnt by 
> default make all root tasks have nice -10. This would be instantly loved 
> by sysadmins i suspect ;-)
> (distros that go the extra mile of making Xorg run under non-root could 
> also go another extra one foot to renice that X server to -10.)

I'd further recommend making priority levels accessible to kernel threads
that are not otherwise accessible to processes, both above and below
user-available priority levels. Basically, if you can get SCHED_RR and
SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
scheduler class can coexist with SCHED_OTHER in like fashion, but with
availability of higher and lower priorities than any userspace process
is allowed, and potentially some differing scheduling semantics. In such
a manner nonessential background processing intended not to ever disturb
userspace can be given priorities appropriate to it (perhaps even con's
SCHED_IDLEPRIO would make sense), and other, urgent processing can be
given priority over userspace altogether.

I believe root's default priority can be adjusted in userspace as
things now stand somewhere in /etc/ but I'm not sure of the specifics.
Word is somewhere in /etc/security/limits.conf


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 23:30                                                 ` Davide Libenzi
@ 2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 15:43                                                     ` Davide Libenzi
  2007-04-21 14:09                                                     ` Bill Davidsen
  2007-04-19 17:39                                                   ` Bernd Eckenfels
  1 sibling, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  8:00 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > That's one reason why i dont think it's necessarily a good idea to 
> > group-schedule threads, we dont really want to do a per thread group 
> > percpu_alloc().
> 
> I still do not have clear how much overhead this will bring into the 
> table, but I think (like Linus was pointing out) the hierarchy should 
> look like:
> 
> Top (VCPU maybe?)
>     User
>         Process
>             Thread
> 
> The "run_queue" concept (and data) that now is bound to a CPU, need to be 
> replicated in:
> 
> ROOT <- VCPUs add themselves here
>     VCPU <- USERs add themselves here
>         USER <- PROCs add themselves here
>             PROC <- THREADs add themselves here
>                 THREAD (ultimate fine grained scheduling unit)
> 
> So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking 
> up a new task would mean:
> 
> VCPU = ROOT->lookup();
> USER = VCPU->lookup();
> PROC = USER->lookup();
> THREAD = PROC->lookup();
> 
> Run-time statistics should propagate back the other way around.

yeah, but this looks quite bad from an overhead POV ... i think we can 
do alot simpler to solve X and kernel threads prioritization.

> > In fact for threads the _reverse_ problem exists, threaded apps tend 
> > to _strive_ for more performance - hence their desperation of using 
> > the threaded programming model to begin with ;) (just think of media 
> > playback apps which are typically multithreaded)
> 
> The same user nicing two different multi-threaded processes would 
> expect a predictable CPU distribution too. [...]

i disagree that the user 'would expect' this. Some users might. Others 
would say: 'my 10-thread rendering engine is more important than a 
1-thread job because it's using 10 threads for a reason'. And the CFS 
feedback so far strengthens this point: the default behavior of treating 
the thread as a single scheduling (and CPU time accounting) unit works 
pretty well on the desktop.

think about it in another, 'kernel policy' way as well: we'd like to 
_encourage_ more parallel user applications. Hurting them by accounting 
all threads together sends the exact opposite message.

> [...] Doing that efficently (the old per-cpu run-queue is pretty nice 
> from many POVs) is the real challenge.

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:38                                       ` Ingo Molnar
  2007-04-19  7:57                                         ` William Lee Irwin III
@ 2007-04-19  8:33                                         ` Nick Piggin
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
  2 siblings, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-19  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven,
	Thomas Gleixner

On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > And yes, by fairly, I mean fairly among all threads as a base 
> > > resource class, because that's what Linux has always done
> > 
> > Yes, there are potential compatibility problems.  Example: a machine 
> > with 100 busy httpd processes and suddenly a big gzip starts up from 
> > console or cron.
> > 
> > Under current kernels, that gzip will take ages and the httpds will 
> > take a 1% slowdown, which may well be exactly the behaviour which is 
> > desired.
> > 
> > If we were to schedule by UID then the gzip suddenly gets 50% of the 
> > CPU and those httpd's all take a 50% hit, which could be quite 
> > serious.
> > 
> > That's simple to fix via nicing, but people have to know to do that, 
> > and there will be a transition period where some disruption is 
> > possible.
> 
> hmmmm. How about the following then: default to nice -10 for all 
> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
> special: root already has disk space reserved to it, root has special 
> memory allocation allowances, etc. I dont see a reason why we couldnt by 
> default make all root tasks have nice -10. This would be instantly loved 
> by sysadmins i suspect ;-)

I have no problem with doing fancy new fairness classes and things.

But considering that we _need_ to have per-thread fairness and that
is also what the current scheduler has and what we need to do well for
obvious reasons, the best path to take is to get per-thread scheduling
up to a point where it is able to replace the current scheduler, then
look at more complex things after that.
 

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
  2007-04-15  7:54               ` Mike Galbraith
@ 2007-04-19  9:01               ` Ingo Molnar
  2007-04-19 12:54                 ` Willy Tarreau
  2007-04-19 17:32                 ` Gene Heskett
  2 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19  9:01 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> Good idea. The machine I'm typing from now has 1000 scheddos running 
> at +19, and 12 gears at nice 0. [...]

> From time to time, one of the 12 aligned gears will quickly perform a 
> full quarter of round while others slowly turn by a few degrees. In 
> fact, while I don't know this process's CPU usage pattern, there's 
> something useful in it : it allows me to visually see when process 
> accelerate/decelerate. [...]

cool idea - i have just tried this and it rocks - you can easily see the 
'nature' of CPU time distribution just via visual feedback. (Is there 
any easy way to start up 12 glxgears fully aligned, or does one always 
have to mouse around to get them into proper position?)

btw., i am using another method to quickly judge X's behavior: i started 
the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth 
opengl-rendered snow fall on the desktop background. That gives me an 
idea about how well X is scheduling under various workloads, without 
having to instrument it explicitly.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy
  2007-04-19  7:04               ` Ingo Molnar
@ 2007-04-19  9:05                 ` Nigel Cunningham
  2007-04-24 20:23                 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-19  9:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 6210 bytes --]

Hi Ingo.

On Thu, 2007-04-19 at 09:04 +0200, Ingo Molnar wrote:
> * Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> 
> > From subsequent emails, I think you already got your answer, but just 
> > in case...
> > 
> > Yes, if you enabled "Replace swsusp by default" and you already had it 
> > set up for getting swsusp to resume. If not, and you're using an 
> > initrd/ramfs, you'll need to modify it to echo
> > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but
> > prior to mounting / and so on.
> 
> yeah, went with the default suggested by your patch:
> 
>    CONFIG_SUSPEND2_REPLACE_SWSUSP=y
> 
> and it was pretty easy to set things up. I used "echo disk > 
> /sys/power/state" to trigger it.
> 
> In hindsight it was all pretty straightforward and suspend2 worked 
> beautifully on an UP and on an SMP system i tried. So in exchange for 
> suspend2 folks debugging a bug in CFS here's some suspend2 review 
> feedback ;) Any plans about moving suspend2 to the upstream kernel? It 
> should be pretty easy for it to co-exist with the current swsuspend 
> code.

I really would like to get it into Linus' tree but Pavel doesn't want it
(obviously!) and I haven't got together enough of a case yet to convince
Andrew. I yet another here's-why-I-think-it-should-be-merged email in
the works (poor Andrew!) but there are too many other things on my plate
at the mo.

> The patch has quite some size:
> 
>  89 files changed, 16452 insertions(+), 69 deletions(-)
> 
> that should obviously be split up into more than a dozen sub-patches, 
> and fed to lkml with the small ones first. (unless it already is split 
> up?)

Right. A good portion (~2000 lines) of that is documentation.

> i cannot comment on the kernel/power/ bits (they are way too large 
> anyway), other than that they look pretty clean visually, but the 
> lowlevel arch and generic kernel bits look sane in detail too, sans a 
> few mostly trivial cleanliness issues:
> 
> +int suspend2_faulted = 0;
> +EXPORT_SYMBOL(suspend2_faulted);
> 
> should be done via the pagefault notifier chain mechanism. Also, all the 
> exports you added should be EXPORT_SYMBOL_GPL().

I'll look at that, but I'm not sure if it's a good idea - this is for
during the atomic copy & restore, when DEBUG_PAGEALLOC is enabled on
x86. Other things might touch memory in ways we don't want. It's only
needed for slab pages that get unmapped but not freed.

As far as the module exports go, I'm not expecting them to get merged. I
like building Suspend2 as modules (it helps speed the development
cycle), and see it as potentially useful for embedded but IMO there are
too many export symbols to make merging that code a possibility. This is
why they're all in one file rather than sprinkled through the files that
define the symbols.

> this:
> 
> -               ClearPageReserved(virt_to_page(addr));
> -               init_page_count(virt_to_page(addr));
> +               //ClearPageReserved(virt_to_page(addr));
> +               //init_page_count(virt_to_page(addr));
> 
> looks like there's a buglet in there still somewhere?

Yeah. When I was recently debugging, I found that cpu hotplugging is
using something marked __init which is causing the machine to
spontaneously reboot when cpus are replugged if DEBUG_PAGEALLOC is
enabled. Haven't had the time to get back to it, and also need some help
with the approach (what makes the machine reboot in this case instead of
oopsing, and how do I stop it?).

> +       if(PageHighMem(page))
> +               return 0;
> 
> coding style.

Oh. The space missing after the if? Ok.

> +       BUG_ON( test_suspend_state(SUSPEND_RUNNING) &&  /* Suspend2, that is */
> 
> make this a WARN_ON() or a WARN_ON_ONCE() - that way you have a chance 
> to even get feedback from users, instead of a 'uhm, X froze' report.
> 
> +#define FREEZER_OFF 0
> +#define FREEZER_USERSPACE_FROZEN 1
> +#define FREEZER_FULLY_ON 2
> 
> should be:
> 
> +#define FREEZER_OFF			0
> +#define FREEZER_USERSPACE_FROZEN	1
> +#define FREEZER_FULLY_ON		2
> 
> (you want your reviewers have an pleasant time reading your code :)

Ok.

> +#define NETLINK_SUSPEND2_USERUI        20      /* For suspend2's userui */
> 
> IIRC userui was at the center of suspend2 merge flames, right? So you 
> might want to layer it ontop a less flashy suspend2-core and thus get 
> 90% of your patch upstream?

Ok. I've just separated that into it's own file/module, so that will be
straightforward to do.

> +++ linux/mm/vmscan.c
> 
> the MM impact looks quite nontrivial. But i suspect this is unavoidable, 
> because you zap portions of the pagecache on the way to disk, so when it 
> comes back it results in a different pagecache (new lru lists, etc.), 
> right?

The modifications do three things.

First, we're seeking to keep the LRU static once while we're suspending.
I originally sought to achieve that by avoiding entering the vmscan.c
logic (not as drastic as it sounds - Suspend2 is the only thing
running!). I think it was Nick who said he'd rather see it the pages
unlinked and kept safe that way, so now I do that. Oh, as part of this,
I separated out the code from shrink_inactive_list that returns isolated
pages, since relink_lru_lists uses it too.

The other part is (prior to the above) seeking to get in a situation
where we have enough memory available to do the cycle. I used to use
shrink_all_zones, but some users with lots of Highmem were finding
issues that let me to take a more per-zone based approach, hence
shrink_one_zone.

The last thing is a patch Rafael recently posted to improve kswapd
freezing.

> +++ linux/lib/dyn_pageflags.c
> 
> shouldnt this be in mm/dyn_pageflags.c? Plus it would be nice to use 
> some other core kernel user for this infrastructure. (but it's not a 
> necessity i guess)

Yeah, I guess mm makes more sense.

> but ... again, the patch looks sane all around.

Thanks for the feedback. Although I'd like to see Suspend2 merged, I've
been feeling a bit like it's a lost cause. It's nice to get some
encouragement.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS])
  2007-04-18 16:46   ` Ingo Molnar
  2007-04-18 20:45     ` CFS and suspend2: hang in atomic copy Christian Hesse
@ 2007-04-19  9:32     ` Esben Nielsen
  2007-04-19 10:11       ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Esben Nielsen @ 2007-04-19  9:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, suspend2-devel



On Wed, 18 Apr 2007, Ingo Molnar wrote:

>
> * Christian Hesse <mail@earthworm.de> wrote:
>
>> Hi Ingo and all,
>>
>> On Friday 13 April 2007, Ingo Molnar wrote:
>>> as usual, any sort of feedback, bugreports, fixes and suggestions are
>>> more than welcome,
>>
>> I just gave CFS a try on my system. From a user's point of view it
>> looks good so far. Thanks for your work.
>
> you are welcome!
>
>> However I found a problem: When trying to suspend a system patched
>> with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the
>> ESC key results in a message that it tries to abort suspend, but then
>> still hangs.
>
> i took a quick look at suspend2 and it makes some use of yield().
> There's a bug in CFS's yield code, i've attached a patch that should fix
> it, does it make any difference to the hang?
>
> 	Ingo
>
> Index: linux/kernel/sched_fair.c
> ===================================================================
> --- linux.orig/kernel/sched_fair.c
> +++ linux/kernel/sched_fair.c
> @@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq
>
> /*
>  * sched_yield() support is very simple via the rbtree, we just
> - * dequeue and enqueue the task, which causes the task to
> - * roundrobin to the end of the tree:
> + * dequeue the task and move it to the rightmost position, which
> + * causes the task to roundrobin to the end of the tree.
>  */
> static void requeue_task_fair(struct rq *rq, struct task_struct *p)
> {
> 	dequeue_task_fair(rq, p);
> 	p->on_rq = 0;
> -	enqueue_task_fair(rq, p);
> +	/*
> +	 * Temporarily insert at the last position of the tree:
> +	 */
> +	p->fair_key = LLONG_MAX;
> +	__enqueue_task_fair(rq, p);
> 	p->on_rq = 1;
> +
> +	/*
> +	 * Update the key to the real value, so that when all other
> +	 * tasks from before the rightmost position have executed,
> +	 * this task is picked up again:
> +	 */
> +	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;

I don't think it safe to change the key after inserting the element in the 
tree. You end up with an unsorted tree giving where new entries end up in 
wrong places "randomly".
I think a better approach would be to keep track of the rightmost entry, 
set the key to the rightmost's key +1 and then simply insert it there.

Esben

>

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS])
  2007-04-19  9:32     ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen
@ 2007-04-19 10:11       ` Ingo Molnar
  2007-04-19 10:18         ` Ingo Molnar
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19 10:11 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, suspend2-devel


* Esben Nielsen <nielsen.esben@googlemail.com> wrote:

> >+	/*
> >+	 * Temporarily insert at the last position of the tree:
> >+	 */
> >+	p->fair_key = LLONG_MAX;
> >+	__enqueue_task_fair(rq, p);
> >	p->on_rq = 1;
> >+
> >+	/*
> >+	 * Update the key to the real value, so that when all other
> >+	 * tasks from before the rightmost position have executed,
> >+	 * this task is picked up again:
> >+	 */
> >+	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
> 
> I don't think it safe to change the key after inserting the element in 
> the tree. You end up with an unsorted tree giving where new entries 
> end up in wrong places "randomly".

yeah, indeed. I hoped that once this rightmost entry is removed (as soon 
as it gets scheduled next time) the tree goes back to a correct shape, 
but that's not the case - the left sub-tree and the right sub-tree is 
merged by the rbtree code with the assumption that the entry had a 
correct key.

> I think a better approach would be to keep track of the rightmost 
> entry, set the key to the rightmost's key +1 and then simply insert it 
> there.

yeah. I had that implemented at a stage but was trying to be too clever 
for my own good ;-)

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS])
  2007-04-19 10:11       ` Ingo Molnar
@ 2007-04-19 10:18         ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19 10:18 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, suspend2-devel


* Ingo Molnar <mingo@elte.hu> wrote:

> > I think a better approach would be to keep track of the rightmost 
> > entry, set the key to the rightmost's key +1 and then simply insert 
> > it there.
> 
> yeah. I had that implemented at a stage but was trying to be too 
> clever for my own good ;-)

i have fixed it via the patch below. (I'm using rb_last() because that 
way the normal scheduling codepaths are not burdened with the 
maintainance of a rightmost entry.)

	Ingo

---
 kernel/sched.c      |    3 ++-
 kernel/sched_fair.c |   24 +++++++++++++-----------
 2 files changed, 15 insertions(+), 12 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3806,7 +3806,8 @@ asmlinkage long sys_sched_yield(void)
 	schedstat_inc(rq, yld_cnt);
 	if (rq->nr_running == 1)
 		schedstat_inc(rq, yld_act_empty);
-	current->sched_class->yield_task(rq, current);
+	else
+		current->sched_class->yield_task(rq, current);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -275,21 +275,23 @@ static void dequeue_task_fair(struct rq 
  */
 static void yield_task_fair(struct rq *rq, struct task_struct *p)
 {
+	struct rb_node *entry;
+	struct task_struct *last;
+
 	dequeue_task_fair(rq, p);
 	p->on_rq = 0;
+
 	/*
-	 * Temporarily insert at the last position of the tree:
+	 * Temporarily insert at the last position of the tree.
+	 * The key will be updated back to (near) its old value
+	 * when the task gets scheduled.
 	 */
-	p->fair_key = LLONG_MAX;
+	entry = rb_last(&rq->tasks_timeline);
+	last = rb_entry(entry, struct task_struct, run_node);
+
+	p->fair_key = last->fair_key + 1;
 	__enqueue_task_fair(rq, p);
 	p->on_rq = 1;
-
-	/*
-	 * Update the key to the real value, so that when all other
-	 * tasks from before the rightmost position have executed,
-	 * this task is picked up again:
-	 */
-	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
 }
 
 /*

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-19  6:29               ` Ingo Molnar
@ 2007-04-19 11:10                 ` Bob Picco
  0 siblings, 0 replies; 712+ messages in thread
From: Bob Picco @ 2007-04-19 11:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Bob Picco, Christian Hesse, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, suspend2-devel

Ingo Molnar wrote:	[Thu Apr 19 2007, 02:29:36AM EDT]
> 
> * Bob Picco <bob.picco@hp.com> wrote:
> 
> > I had hoped to collect more data with CFS V2. It crashes in 
> > scale_nice_down for s2ram when attempting to disable_nonboot_cpus. So 
> > part of traceback looks like (typed by hand with obvious omissions):
> > 
> > scale_nice_down
> > update_stats_wait_end - not shown in traceback because inlined
> > pick_next_task_fair
> > migration_call
> > task_rq_lock
> > notifier_call_chain
> > _cpu_down
> > disable_nonboot_cpus
> 
> ok, this looks similar to the jpeg Christian did. Does the patch below 
> fix the crash for you?
> 
> 	Ingo
> 
> ---
>  kernel/sched.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned 
>  	struct task_struct *next;
>  
>  	for (;;) {
> +		if (!rq->nr_running)
> +			break;
>  		next = pick_next_task(rq, rq->curr);
>  		if (!next)
>  			break;
This patch repairs s2ram issue. 

Thanks.

bob

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:57                                         ` William Lee Irwin III
@ 2007-04-19 11:50                                           ` Peter Williams
  2007-04-20  5:26                                             ` William Lee Irwin III
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-19 11:50 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>>> Yes, there are potential compatibility problems.  Example: a machine 
>>> with 100 busy httpd processes and suddenly a big gzip starts up from 
>>> console or cron.
> [...]
> 
> On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
>> hmmmm. How about the following then: default to nice -10 for all 
>> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
>> special: root already has disk space reserved to it, root has special 
>> memory allocation allowances, etc. I dont see a reason why we couldnt by 
>> default make all root tasks have nice -10. This would be instantly loved 
>> by sysadmins i suspect ;-)
>> (distros that go the extra mile of making Xorg run under non-root could 
>> also go another extra one foot to renice that X server to -10.)
> 
> I'd further recommend making priority levels accessible to kernel threads
> that are not otherwise accessible to processes, both above and below
> user-available priority levels. Basically, if you can get SCHED_RR and
> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
> scheduler class can coexist with SCHED_OTHER in like fashion, but with
> availability of higher and lower priorities than any userspace process
> is allowed, and potentially some differing scheduling semantics. In such
> a manner nonessential background processing intended not to ever disturb
> userspace can be given priorities appropriate to it (perhaps even con's
> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
> given priority over userspace altogether.
> 
> I believe root's default priority can be adjusted in userspace as
> things now stand somewhere in /etc/ but I'm not sure of the specifics.
> Word is somewhere in /etc/security/limits.conf

This is sounding very much like System V Release 4 (and descendants) 
except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
priority inversion, I believe).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Renice X for cpu schedulers
  2007-04-19  6:38                                       ` Ingo Molnar
  2007-04-19  7:57                                         ` William Lee Irwin III
  2007-04-19  8:33                                         ` Nick Piggin
@ 2007-04-19 11:59                                         ` Con Kolivas
  2007-04-19 12:42                                           ` Peter Williams
                                                             ` (3 more replies)
  2 siblings, 4 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-19 11:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Peter Williams, Mike Galbraith, ck list,
	Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner

Ok, there are 3 known schedulers currently being "promoted" as solid 
replacements for the mainline scheduler which address most of the issues with 
mainline (and about 10 other ones not currently being promoted). The main way 
they do this is through attempting to maintain solid fairness. There is 
enough evidence mounting now from the numerous test cases fixed by much 
fairer designs that this is the way forward for a general purpose cpu 
scheduler which is what linux needs. 

Interactivity of just about everything that needs low latency (ie audio and 
video players) are easily managed by maintaining low latency between wakeups 
and scheduling of all these low cpu users. The one fly in the ointment for 
linux remains X. I am still, to this moment, completely and utterly stunned 
at why everyone is trying to find increasingly complex unique ways to manage 
X when all it needs is more cpu[1]. Now most of these are actually very good 
ideas about _extra_ features that would be desirable in the long run for 
linux, but given the ludicrous simplicity of renicing X I cannot fathom why 
people keep promoting these alternatives. At the time of 2.6.0 coming out we 
were desparately trying to get half decent interactivity within a reasonable 
time frame to release 2.6.0 without rewiring the whole scheduler. So I 
tweaked the crap out of the tunables that were already there[2].

So let's hear from the 3 people who generated the schedulers under the 
spotlight. These are recent snippets and by no means the only time these 
comments have been said. Without sounding too bold, we do know a thing or two 
about scheduling.

CFS:
On Thursday 19 April 2007 16:38, Ingo Molnar wrote:
> hmmmm. How about the following then: default to nice -10 for all
> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_
> special: root already has disk space reserved to it, root has special
> memory allocation allowances, etc. I dont see a reason why we couldnt by
> default make all root tasks have nice -10. This would be instantly loved
> by sysadmins i suspect ;-)
>
> (distros that go the extra mile of making Xorg run under non-root could
> also go another extra one foot to renice that X server to -10.)

Nicksched:
On Wednesday 18 April 2007 15:00, Nick Piggin wrote:
> What's wrong with allowing X to get more than it's fair share of CPU
> time by "fiddling with nice levels"? That's what they're there for.

and

Staircase-Deadline:
On Thursday 19 April 2007 09:59, Con Kolivas wrote:
> Remember to renice X to -10 for nicest desktop behaviour :)


[1]The one caveat I can think of is that when you share X sessions across 
multiple users -with a fair cpu scheduler-, having them all nice 0 also makes 
the distribution of cpu across the multiple users very even and smooth, 
without the expense of burning away the other person's cpu time they'd like 
for compute intensive non gui things. If you make a scheduler that always 
favours X this becomes impossible. I've had enough users offlist ask me to 
help them set up multiuser environments just like this with my schedulers 
because they were unable to do it with mainline, even with SCHED_BATCH, short 
of nicing everything +19. This makes the argument for not favouring X within 
the scheduler with tweaks even stronger.

[2] Nick was promoting renicing X with his Nicksched alternative at the time 
of 2.6.0 and while we were not violently opposed to renicing X, Nicksched was 
still very new on the scene and didn't have the luxury of extended testing 
and reiteration in time for 2.6 and he put the project on hold for some time 
after that. The correctness of his views on renicing certainly have become 
more obvious over time.


So yes go ahead and think up great ideas for other ways of metering out cpu 
bandwidth for different purposes, but for X, given the absurd simplicity of 
renicing, why keep fighting it? Again I reiterate that most users of SD have 
not found the need to renice X anyway except if they stick to old habits of 
make -j4 on uniprocessor and the like, and I expect that those on CFS and 
Nicksched would also have similar experiences.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
@ 2007-04-19 12:42                                           ` Peter Williams
  2007-04-19 13:20                                             ` Peter Williams
  2007-04-19 13:17                                           ` Mark Lord
                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-19 12:42 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list,
	Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner

Con Kolivas wrote:
> Ok, there are 3 known schedulers currently being "promoted" as solid 
> replacements for the mainline scheduler which address most of the issues with 
> mainline (and about 10 other ones not currently being promoted). The main way 
> they do this is through attempting to maintain solid fairness. There is 
> enough evidence mounting now from the numerous test cases fixed by much 
> fairer designs that this is the way forward for a general purpose cpu 
> scheduler which is what linux needs. 
> 
> Interactivity of just about everything that needs low latency (ie audio and 
> video players) are easily managed by maintaining low latency between wakeups 
> and scheduling of all these low cpu users.

On a "fair" scheduler these will all get high priority (and good 
response) because their CPU bandwidth usage will be much smaller than 
their entitlement and the scheduler will be trying to help them "catch 
up".  So (as you say) they shouldn't be a problem.

> The one fly in the ointment for 
> linux remains X. I am still, to this moment, completely and utterly stunned 
> at why everyone is trying to find increasingly complex unique ways to manage 
> X when all it needs is more cpu[1]. Now most of these are actually very good 
> ideas about _extra_ features that would be desirable in the long run for 
> linux, but given the ludicrous simplicity of renicing X I cannot fathom why 
> people keep promoting these alternatives. At the time of 2.6.0 coming out we 
> were desparately trying to get half decent interactivity within a reasonable 
> time frame to release 2.6.0 without rewiring the whole scheduler. So I 
> tweaked the crap out of the tunables that were already there[2].

X's needs are more complex than that (from my observations) in that the 
part of X that processes input doesn't use much CPU but the part that 
does output can be quite a heavy user of CPU (e.g. do a "ls -lR /" in an 
xterm and watch X chew up the CPU).  At the same time, the part of X 
that processes input needs quick responsiveness as it's part of the 
interactive chain where this is less so for the output part.

Where X comes unstuck in the current scheduler is that when the output 
part goes on one of its CPU storms it ceases to look like an interactive 
task and gets given lower priority.  Ironically, this doesn't effect the 
output part of X but it does effect the input part and is manifest as 
crappy interactive response.  One wonders whether modifying X so that it 
has two threads: one for output and one for input; that could be 
scheduled separately might help.  I guess it would depend on whether 
there is insufficient independence between the two halves.

Part of this issue is that giving X a high static priority runs the risk 
of the CPU hog output part disrupting scheduling of other important 
tasks.  So don't give it too big a boost.

> 
> So let's hear from the 3 people who generated the schedulers under the 
> spotlight. These are recent snippets and by no means the only time these 
> comments have been said. Without sounding too bold, we do know a thing or two 
> about scheduling.
> 
> CFS:
> On Thursday 19 April 2007 16:38, Ingo Molnar wrote:
>> hmmmm. How about the following then: default to nice -10 for all
>> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_
>> special: root already has disk space reserved to it, root has special
>> memory allocation allowances, etc. I dont see a reason why we couldnt by
>> default make all root tasks have nice -10. This would be instantly loved
>> by sysadmins i suspect ;-)

It's worth noting that the -10 mentioned is roughly equivalent (in the 
old scheduler) to restoring interactive task status to X in those cases 
where it loses it due to a CPU storm in its output part.

>>
>> (distros that go the extra mile of making Xorg run under non-root could
>> also go another extra one foot to renice that X server to -10.)
> 
> Nicksched:
> On Wednesday 18 April 2007 15:00, Nick Piggin wrote:
>> What's wrong with allowing X to get more than it's fair share of CPU
>> time by "fiddling with nice levels"? That's what they're there for.
> 
> and
> 
> Staircase-Deadline:
> On Thursday 19 April 2007 09:59, Con Kolivas wrote:
>> Remember to renice X to -10 for nicest desktop behaviour :)

I'd like to add the EBS scheduler (posted by Aurema Pty Ltd a couple of 
years back) to this list as it also recommended running X at nice -5 to -10.

Also some of the "interactive bonus" mechanisms in my SPA schedulers 
could be removed if X was reniced.  In fact, with a reniced X the 
spa_svr (server oriented scheduler which attempts to minimise the time 
tasks spend on the queue waiting for CPU access and which doesn't have 
interactive bonuses) might be usable on a work station.

> 
> 
> [1]The one caveat I can think of is that when you share X sessions across 
> multiple users -with a fair cpu scheduler-, having them all nice 0 also makes 
> the distribution of cpu across the multiple users very even and smooth, 
> without the expense of burning away the other person's cpu time they'd like 
> for compute intensive non gui things. If you make a scheduler that always 
> favours X this becomes impossible. I've had enough users offlist ask me to 
> help them set up multiuser environments just like this with my schedulers 
> because they were unable to do it with mainline, even with SCHED_BATCH, short 
> of nicing everything +19. This makes the argument for not favouring X within 
> the scheduler with tweaks even stronger.
> 
> [2] Nick was promoting renicing X with his Nicksched alternative at the time 
> of 2.6.0 and while we were not violently opposed to renicing X, Nicksched was 
> still very new on the scene and didn't have the luxury of extended testing 
> and reiteration in time for 2.6 and he put the project on hold for some time 
> after that. The correctness of his views on renicing certainly have become 
> more obvious over time.
> 
> 
> So yes go ahead and think up great ideas for other ways of metering out cpu 
> bandwidth for different purposes, but for X, given the absurd simplicity of 
> renicing, why keep fighting it? Again I reiterate that most users of SD have 
> not found the need to renice X anyway except if they stick to old habits of 
> make -j4 on uniprocessor and the like, and I expect that those on CFS and 
> Nicksched would also have similar experiences.
> 

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  9:01               ` Ingo Molnar
@ 2007-04-19 12:54                 ` Willy Tarreau
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:32                 ` Gene Heskett
  1 sibling, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-19 12:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Hi Ingo,

On Thu, Apr 19, 2007 at 11:01:44AM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Good idea. The machine I'm typing from now has 1000 scheddos running 
> > at +19, and 12 gears at nice 0. [...]
> 
> > From time to time, one of the 12 aligned gears will quickly perform a 
> > full quarter of round while others slowly turn by a few degrees. In 
> > fact, while I don't know this process's CPU usage pattern, there's 
> > something useful in it : it allows me to visually see when process 
> > accelerate/decelerate. [...]
> 
> cool idea - i have just tried this and it rocks - you can easily see the 
> 'nature' of CPU time distribution just via visual feedback. (Is there 
> any easy way to start up 12 glxgears fully aligned, or does one always 
> have to mouse around to get them into proper position?)

-- Replying quickly, I'm short in time --

You can certainly script it with -geometry. But it is the wrong application
for this matter, because you benchmark X more than glxgears itself. What would
be better is something like a line rotating 360 degrees and doing some short
stuff between each degree, so that X is not much sollicitated, but the CPU
would be spent more on the processes themselves.

Benchmarking interactions between X and multiple clients is a completely
different test IMHO. Glxgears is between those two, making it inappropriate
for scheduler tuning.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
  2007-04-19 12:42                                           ` Peter Williams
@ 2007-04-19 13:17                                           ` Mark Lord
  2007-04-19 15:10                                             ` Con Kolivas
  2007-04-20  3:57                                             ` Nick Piggin
  2007-04-19 18:16                                           ` Gene Heskett
  2007-04-19 19:26                                           ` Ray Lee
  3 siblings, 2 replies; 712+ messages in thread
From: Mark Lord @ 2007-04-19 13:17 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Con Kolivas wrote:
s go ahead and think up great ideas for other ways of metering out cpu 
> bandwidth for different purposes, but for X, given the absurd simplicity of 
> renicing, why keep fighting it? Again I reiterate that most users of SD have 
> not found the need to renice X anyway except if they stick to old habits of 
> make -j4 on uniprocessor and the like, and I expect that those on CFS and 
> Nicksched would also have similar experiences.

Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
on my 2GHz P-M single-core non-HT machine with SD.

But with the very first posted version of CFS by Ingo,
I can do "make -j2" no problem and still have a nicely interactive destop.

-ml

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 12:42                                           ` Peter Williams
@ 2007-04-19 13:20                                             ` Peter Williams
  2007-04-19 14:22                                               ` Lee Revell
  0 siblings, 1 reply; 712+ messages in thread
From: Peter Williams @ 2007-04-19 13:20 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list,
	Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Con Kolivas wrote:
>> Ok, there are 3 known schedulers currently being "promoted" as solid 
>> replacements for the mainline scheduler which address most of the 
>> issues with mainline (and about 10 other ones not currently being 
>> promoted). The main way they do this is through attempting to maintain 
>> solid fairness. There is enough evidence mounting now from the 
>> numerous test cases fixed by much fairer designs that this is the way 
>> forward for a general purpose cpu scheduler which is what linux needs.
>> Interactivity of just about everything that needs low latency (ie 
>> audio and video players) are easily managed by maintaining low latency 
>> between wakeups and scheduling of all these low cpu users.
> 
> On a "fair" scheduler these will all get high priority (and good 
> response) because their CPU bandwidth usage will be much smaller than 
> their entitlement and the scheduler will be trying to help them "catch 
> up".  So (as you say) they shouldn't be a problem.
> 
>> The one fly in the ointment for linux remains X. I am still, to this 
>> moment, completely and utterly stunned at why everyone is trying to 
>> find increasingly complex unique ways to manage X when all it needs is 
>> more cpu[1]. Now most of these are actually very good ideas about 
>> _extra_ features that would be desirable in the long run for linux, 
>> but given the ludicrous simplicity of renicing X I cannot fathom why 
>> people keep promoting these alternatives. At the time of 2.6.0 coming 
>> out we were desparately trying to get half decent interactivity within 
>> a reasonable time frame to release 2.6.0 without rewiring the whole 
>> scheduler. So I tweaked the crap out of the tunables that were already 
>> there[2].
> 
> X's needs are more complex than that (from my observations) in that the 
> part of X that processes input doesn't use much CPU but the part that 
> does output can be quite a heavy user of CPU (e.g. do a "ls -lR /" in an 
> xterm and watch X chew up the CPU).  At the same time, the part of X 
> that processes input needs quick responsiveness as it's part of the 
> interactive chain where this is less so for the output part.
> 
> Where X comes unstuck in the current scheduler is that when the output 
> part goes on one of its CPU storms it ceases to look like an interactive 
> task and gets given lower priority.  Ironically, this doesn't effect the 
> output part of X but it does effect the input part and is manifest as 
> crappy interactive response.  One wonders whether modifying X so that it 
> has two threads: one for output and one for input; that could be 
> scheduled separately might help.  I guess it would depend on whether 
> there is insufficient independence between the two halves.

I forgot to make my point here and that was that if X could be split in 
two neither half would need to be reniced.  As a very low CPU bandwidth 
user the input half would get along just fine like the other interactive 
tasks that you mention.  And the output put part isn't adversely 
effected by not having a boost so it would get along just fine as well 
and you don't want it having a boost when it's in a CPU storm anyway.

Of course, if the interdependence between the two halves is such that 
the equivalent of priority inversion occurs between the two threads. 
However, that might be solved by making the division between the two 
halves on a dimension other than the input/output one.

Peter
PS I think that the tasks most likely to be adversely effected by X's 
CPU storms (enough to annoy the user) are audio streamers so when you're 
doing tests to determine the best nice value for X I suggest that would 
be a good criterion.  Video streamers are also susceptible but glitches 
in video don't seem to annoy users as much as audio ones.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 13:20                                             ` Peter Williams
@ 2007-04-19 14:22                                               ` Lee Revell
  2007-04-20  1:32                                                 ` Michael K. Edwards
  0 siblings, 1 reply; 712+ messages in thread
From: Lee Revell @ 2007-04-19 14:22 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> PS I think that the tasks most likely to be adversely effected by X's
> CPU storms (enough to annoy the user) are audio streamers so when you're
> doing tests to determine the best nice value for X I suggest that would
> be a good criterion.  Video streamers are also susceptible but glitches
> in video don't seem to annoy users as much as audio ones.

IMHO audio streamers should use SCHED_FIFO thread for time critical
work.  I think it's insane to expect the scheduler to figure out that
these processes need low latency when they can just be explicit about
it.  "Professional" audio software does it already, on Linux as well
as other OS...

Lee

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 13:17                                           ` Mark Lord
@ 2007-04-19 15:10                                             ` Con Kolivas
  2007-04-19 16:15                                               ` Mark Lord
  2007-04-20  3:57                                             ` Nick Piggin
  1 sibling, 1 reply; 712+ messages in thread
From: Con Kolivas @ 2007-04-19 15:10 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007 23:17, Mark Lord wrote:
> Con Kolivas wrote:
> s go ahead and think up great ideas for other ways of metering out cpu
>
> > bandwidth for different purposes, but for X, given the absurd simplicity
> > of renicing, why keep fighting it? Again I reiterate that most users of
> > SD have not found the need to renice X anyway except if they stick to old
> > habits of make -j4 on uniprocessor and the like, and I expect that those
> > on CFS and Nicksched would also have similar experiences.
>
> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
> on my 2GHz P-M single-core non-HT machine with SD.
>
> But with the very first posted version of CFS by Ingo,
> I can do "make -j2" no problem and still have a nicely interactive destop.

Cool. Then there's clearly a bug with SD that manifests on your machine as it 
should not have that effect at all (and doesn't on other people's machines). 
I suggest trying the latest version which fixes some bugs.

Thanks.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 12:54                 ` Willy Tarreau
@ 2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
                                       ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-19 15:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> You can certainly script it with -geometry. But it is the wrong 
> application for this matter, because you benchmark X more than 
> glxgears itself. What would be better is something like a line 
> rotating 360 degrees and doing some short stuff between each degree, 
> so that X is not much sollicitated, but the CPU would be spent more on 
> the processes themselves.

at least on my setup glxgears goes via DRI/DRM so there's no X 
scheduling inbetween at all, and the visual appearance of glxgears is a 
direct function of its scheduling.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  8:00                                                   ` Ingo Molnar
@ 2007-04-19 15:43                                                     ` Davide Libenzi
  2007-04-21 14:09                                                     ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Davide Libenzi @ 2007-04-19 15:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Thu, 19 Apr 2007, Ingo Molnar wrote:

> i disagree that the user 'would expect' this. Some users might. Others 
> would say: 'my 10-thread rendering engine is more important than a 
> 1-thread job because it's using 10 threads for a reason'. And the CFS 
> feedback so far strengthens this point: the default behavior of treating 
> the thread as a single scheduling (and CPU time accounting) unit works 
> pretty well on the desktop.
> 
> think about it in another, 'kernel policy' way as well: we'd like to 
> _encourage_ more parallel user applications. Hurting them by accounting 
> all threads together sends the exact opposite message.

There are counter argouments too. Like, not every user knows if a certain 
process is MT or not. I agree though that doing accounting and fairness at 
a depth lower then USER is messy, and not only for performance.


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 15:10                                             ` Con Kolivas
@ 2007-04-19 16:15                                               ` Mark Lord
  2007-04-19 18:21                                                 ` Gene Heskett
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Mark Lord @ 2007-04-19 16:15 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Con Kolivas wrote:
> On Thursday 19 April 2007 23:17, Mark Lord wrote:
>> Con Kolivas wrote:
>> s go ahead and think up great ideas for other ways of metering out cpu
>>
>>> bandwidth for different purposes, but for X, given the absurd simplicity
>>> of renicing, why keep fighting it? Again I reiterate that most users of
>>> SD have not found the need to renice X anyway except if they stick to old
>>> habits of make -j4 on uniprocessor and the like, and I expect that those
>>> on CFS and Nicksched would also have similar experiences.
>> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
>> on my 2GHz P-M single-core non-HT machine with SD.
>>
>> But with the very first posted version of CFS by Ingo,
>> I can do "make -j2" no problem and still have a nicely interactive destop.
> 
> Cool. Then there's clearly a bug with SD that manifests on your machine as it 
> should not have that effect at all (and doesn't on other people's machines). 
> I suggest trying the latest version which fixes some bugs.

SD just doesn't do nearly as good as the stock scheduler, or CFS, here.

I'm quite likely one of the few single-CPU/non-HT testers of this stuff.
If it should ever get more widely used I think we'd hear a lot more complaints.

Cheers

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:32                                                     ` Mike Galbraith
@ 2007-04-19 16:55                                                       ` Davide Libenzi
  2007-04-20  5:16                                                         ` Mike Galbraith
  0 siblings, 1 reply; 712+ messages in thread
From: Davide Libenzi @ 2007-04-19 16:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel

On Thu, 19 Apr 2007, Mike Galbraith wrote:

> On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > > daily usage pattern nicely (always need godmode for shells, but not 
> > > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > > gui)
> > 
> > how about the first-approximation solution i suggested in the previous 
> > mail: to add a per UID default nice level? (With this default defaulting 
> > to '-10' for all root-owned processes, and defaulting to '0' for 
> > everything else.) That would solve most of the current CFS regressions 
> > at hand.
> 
> That would make my kernel builds etc interfere with my other self's
> surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
> X portion of my Joe-User activity pushes the compile portion of root
> down in bandwidth utilization automagically, which is exactly the right
> thing, because the root me in not as important as the Joe-User me using
> the GUI at that time.  If the idea of X disturbing root upsets some,
> they can move X to another UID.  Generally, it seems perfect for here.

Now guys, I did not follow the whole lengthy and feisty thread, but IIRC 
Con's scheduler has been attacked because, among other argouments, was 
requiring X to be reniced. This happened like a month ago IINM.
I did not have time to look at Con's scheduler, and I only had a brief 
look at Ingo's one (looks very promising IMO, but so was the initial O(1) 
post before all the corner-cases fixes went in).
But this is not a about technical merit, this is about applying the same 
rules of judgement to others as well to ourselves.
We went from a "renicing X to -10 is bad because the scheduler should 
be able to correctly handle the problem w/out additional external plugs" 
to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads 
class, on top of all the tasks owned by root" [1].
>From a spectator POV like myself in this case, this looks rather "unfair".



[1] I think, before and now, that that's more a duck tape patch than a 
    real solution. OTOH if the "solution" is gonna be another maze of 
    macros and heuristics filled with pretty bad corner cases, I may 
    prefer the former.


- Davide



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  9:01               ` Ingo Molnar
  2007-04-19 12:54                 ` Willy Tarreau
@ 2007-04-19 17:32                 ` Gene Heskett
  1 sibling, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-19 17:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007, Ingo Molnar wrote:
>* Willy Tarreau <w@1wt.eu> wrote:
>> Good idea. The machine I'm typing from now has 1000 scheddos running
>> at +19, and 12 gears at nice 0. [...]
>>
>> From time to time, one of the 12 aligned gears will quickly perform a
>> full quarter of round while others slowly turn by a few degrees. In
>> fact, while I don't know this process's CPU usage pattern, there's
>> something useful in it : it allows me to visually see when process
>> accelerate/decelerate. [...]
>
>cool idea - i have just tried this and it rocks - you can easily see the
>'nature' of CPU time distribution just via visual feedback. (Is there
>any easy way to start up 12 glxgears fully aligned, or does one always
>have to mouse around to get them into proper position?)
>
>btw., i am using another method to quickly judge X's behavior: i started
>the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth
>opengl-rendered snow fall on the desktop background. That gives me an
>idea about how well X is scheduling under various workloads, without
>having to instrument it explicitly.
>
yes, its a  cute idea, till you switch away from that screen to check progress 
on something else, like to compose this message.

===========
5913 frames in 5.0 seconds = 1182.499 FPS
6238 frames in 5.0 seconds = 1247.556 FPS
11380 frames in 5.0 seconds = 2275.905 FPS
10691 frames in 5.0 seconds = 2138.173 FPS
8707 frames in 5.0 seconds = 1741.305 FPS
10669 frames in 5.0 seconds = 2133.708 FPS
11392 frames in 5.0 seconds = 2278.037 FPS
11379 frames in 5.0 seconds = 2275.711 FPS
11310 frames in 5.0 seconds = 2261.861 FPS
11386 frames in 5.0 seconds = 2277.081 FPS
11292 frames in 5.0 seconds = 2258.353 FPS
11352 frames in 5.0 seconds = 2270.297 FPS
11415 frames in 5.0 seconds = 2282.886 FPS
11406 frames in 5.0 seconds = 2281.037 FPS
11483 frames in 5.0 seconds = 2296.533 FPS
11510 frames in 5.0 seconds = 2301.883 FPS
11123 frames in 5.0 seconds = 2224.266 FPS
8980 frames in 5.0 seconds = 1795.861 FPS
=======
The over 2000fps reports were while I was either looking at htop, or starting 
this message, both on different screens.  htop said it was using 95+ % of the 
cpu even when its display was going to /dev/null.  So 'Kewl' doesn't seem to 
get us apples to apples numbers we can go to the window and bet 
win-place-show based on them alone.

FWIW, running the nvidia-9755 drivers here.

So if we are going to use that as a judgement operator, it obviously needs 
some intelligently applied scaling before they are worth more than a 
subjective feel is.

>	Ingo
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The confusion of a staff member is measured by the length of his memos.
		-- New York Times, Jan. 20, 1981

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
@ 2007-04-19 17:34                     ` Gene Heskett
  2007-04-19 18:45                     ` Willy Tarreau
  2007-04-19 23:52                     ` Jan Knutar
  2 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-19 17:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007, Ingo Molnar wrote:
>* Willy Tarreau <w@1wt.eu> wrote:
>> You can certainly script it with -geometry. But it is the wrong
>> application for this matter, because you benchmark X more than
>> glxgears itself. What would be better is something like a line
>> rotating 360 degrees and doing some short stuff between each degree,
>> so that X is not much sollicitated, but the CPU would be spent more on
>> the processes themselves.
>
>at least on my setup glxgears goes via DRI/DRM so there's no X
>scheduling inbetween at all, and the visual appearance of glxgears is a
>direct function of its scheduling.
>
>	Ingo

That doesn't appear to be the case here Ingo. Even when I know the rest of the 
system is lagged, glxgears continues to show very smooth and steady movement.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yow!  I just went below the poverty line!

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  8:00                                                   ` Ingo Molnar
@ 2007-04-19 17:39                                                   ` Bernd Eckenfels
  1 sibling, 0 replies; 712+ messages in thread
From: Bernd Eckenfels @ 2007-04-19 17:39 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.64.0704181515290.25880@alien.or.mcafeemobile.com> you wrote:
> Top (VCPU maybe?)
>    User
>        Process
>            Thread

The problem with that is, that not all Schedulers might work on the User
level. You can think of Batch/Job, Parent, Group, Session or namespace
level. That would be IMHO a generic Top, with no need for a level above.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
  2007-04-19 12:42                                           ` Peter Williams
  2007-04-19 13:17                                           ` Mark Lord
@ 2007-04-19 18:16                                           ` Gene Heskett
  2007-04-19 21:35                                             ` Michael K. Edwards
  2007-04-19 22:47                                             ` Con Kolivas
  2007-04-19 19:26                                           ` Ray Lee
  3 siblings, 2 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-19 18:16 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007, Con Kolivas wrote:

[and I snipped a good overview]

>So yes go ahead and think up great ideas for other ways of metering out cpu
>bandwidth for different purposes, but for X, given the absurd simplicity of
>renicing, why keep fighting it? Again I reiterate that most users of SD have
>not found the need to renice X anyway except if they stick to old habits of
>make -j4 on uniprocessor and the like, and I expect that those on CFS and
>Nicksched would also have similar experiences.

FWIW folks, I have never touched x's niceness, its running at the default -1 
for all of my so-called 'tests', and I have another set to be rebooted to 
right now.  And yes, my kernel makeit script uses -j4 by default, and has 
used -j8 just for effects, which weren't all that different from what I 
expected in 'abusing' a UP system that way.  The system DID remain usable, 
not snappy, but usable.

Having tried re-nicing X a while back, and having the rest of the system 
suffer in quite obvious ways for even 1 + or - from its default felt pretty 
bad from this users perspective.

It is my considered opinion (yeah I know, I'm just a leaf in the hurricane of 
this list) that if X has to be re-niced from the 1 point advantage its had 
for ages, then something is basicly wrong with the overall scheduling, cpu or 
i/o, or both in combination.  FWIW I'm using cfq for i/o.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Moore's Constant:
	Everybody sets out to do something, and everybody
	does something, but no one does what he sets out to do.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 16:15                                               ` Mark Lord
@ 2007-04-19 18:21                                                 ` Gene Heskett
  2007-04-20  0:17                                                 ` Con Kolivas
  2007-04-20  1:17                                                 ` Ed Tomlinson
  2 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-19 18:21 UTC (permalink / raw)
  To: Mark Lord
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007, Mark Lord wrote:
>Con Kolivas wrote:
>> On Thursday 19 April 2007 23:17, Mark Lord wrote:
>>> Con Kolivas wrote:
>>> s go ahead and think up great ideas for other ways of metering out cpu
>>>
>>>> bandwidth for different purposes, but for X, given the absurd simplicity
>>>> of renicing, why keep fighting it? Again I reiterate that most users of
>>>> SD have not found the need to renice X anyway except if they stick to
>>>> old habits of make -j4 on uniprocessor and the like, and I expect that
>>>> those on CFS and Nicksched would also have similar experiences.
>>>
>>> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
>>> on my 2GHz P-M single-core non-HT machine with SD.
>>>
>>> But with the very first posted version of CFS by Ingo,
>>> I can do "make -j2" no problem and still have a nicely interactive
>>> destop.
>>
>> Cool. Then there's clearly a bug with SD that manifests on your machine as
>> it should not have that effect at all (and doesn't on other people's
>> machines). I suggest trying the latest version which fixes some bugs.
>
>SD just doesn't do nearly as good as the stock scheduler, or CFS, here.

I found the early SD's much friendlier here, but I also think that at that 
point I was comparing SD to stock 2.6.21-rc5 and 6, and to say that it sucked 
would be a slight understatement.

>I'm quite likely one of the few single-CPU/non-HT testers of this stuff.
>If it should ever get more widely used I think we'd hear a lot more
> complaints.

I'm in that row of seats too Mark.  Someday I have to build a new box, that's 
all there is to it...

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Lots of folks confuse bad management with destiny.
		-- Frank Hubbard

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
@ 2007-04-19 18:45                     ` Willy Tarreau
  2007-04-21 10:31                       ` Ingo Molnar
  2007-04-19 23:52                     ` Jan Knutar
  2 siblings, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-19 18:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Thu, Apr 19, 2007 at 05:18:03PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > You can certainly script it with -geometry. But it is the wrong 
> > application for this matter, because you benchmark X more than 
> > glxgears itself. What would be better is something like a line 
> > rotating 360 degrees and doing some short stuff between each degree, 
> > so that X is not much sollicitated, but the CPU would be spent more on 
> > the processes themselves.
> 
> at least on my setup glxgears goes via DRI/DRM so there's no X 
> scheduling inbetween at all, and the visual appearance of glxgears is a 
> direct function of its scheduling.

OK, I thought that somethink looking like a clock would be useful, especially
if we could tune the amount of CPU spent per task instead of being limited by
graphics drivers.

I searched freashmeat for a clock and found "orbitclock" by Jeremy Weatherford,
which was exactly what I was looking for :
  - small
  - C only
  - X11 only
  - needed less than 5 minutes and no knowledge of X11 for the complete hack !
  => Kudos to its author, sincerely !

I hacked it a bit to make it accept two parameters :
  -R <run_time_in_microsecond> : time spent burning CPU cycles at each round
  -S <sleep_time_in_microsecond> : time spent getting a rest

It now advances what it thinks is a second at each iteration, so that it makes
it easy to compare its progress with other instances (there are seconds,
minutes and hours, so it's easy to visually count up to around 43200).

The modified code is here :

  http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz

What is interesting to note is that it's easy to make X work a lot (99%) by
using 0 as the sleeping time, and it's easy to make the process work a lot
by using large values for the running time associated with very low values
(or 0) for the sleep time.

Ah, and it supports -geometry ;-)

It could become a useful scheduler benchmark !

Have fun !
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
                                                             ` (2 preceding siblings ...)
  2007-04-19 18:16                                           ` Gene Heskett
@ 2007-04-19 19:26                                           ` Ray Lee
  2007-04-19 22:56                                             ` Con Kolivas
  2007-04-20  4:09                                             ` Nick Piggin
  3 siblings, 2 replies; 712+ messages in thread
From: Ray Lee @ 2007-04-19 19:26 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote:
> The one fly in the ointment for
> linux remains X. I am still, to this moment, completely and utterly stunned
> at why everyone is trying to find increasingly complex unique ways to manage
> X when all it needs is more cpu[1].
[...and hence should be reniced]

The problem is that X is not unique. There's postgresql, memcached,
mysql, db2, a little embedded app I wrote... all of these perform work
on behalf of another process. It's just most *noticeable* with X, as
pretty much everyone is running that.

If we had some way for the scheduler to decide to donate part of a
client process's time slice to the server it just spoke to (with an
exponential dampening factor -- take 50% from the client, give 25% to
the server, toss the rest on the floor), that -- from my naive point
of view -- would be a step toward fixing the underlying issue. Or I
might be spouting crap, who knows.

The problem is real, though, and not limited to X.

While I have the floor, thank you, Con, for all your work.

Ray

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: CFS and suspend2: hang in atomic copy
  2007-04-19  6:28               ` Ingo Molnar
@ 2007-04-19 20:32                 ` Christian Hesse
  0 siblings, 0 replies; 712+ messages in thread
From: Christian Hesse @ 2007-04-19 20:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	suspend2-devel

[-- Attachment #1: Type: text/plain, Size: 997 bytes --]

On Thursday 19 April 2007, Ingo Molnar wrote:
> * Christian Hesse <mail@earthworm.de> wrote:
> > I now got some error message from my system:
> >
> > http://www.eworm.de/tmp/cfs-suspend.jpg
>
> ah, this pinpoints a bug: for performance reasons pick_next_task()
> assumes that the runqueue is not empty - which is true for schedule(),
> but not in migrate_dead_tasks(). Does the patch below fix the crash for
> you?
>
>  kernel/sched.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned
>  	struct task_struct *next;
>
>  	for (;;) {
> +		if (!rq->nr_running)
> +			break;
>  		next = pick_next_task(rq, rq->curr);
>  		if (!next)
>  			break;

Suspend works perfectly with this patch. Thanks a lot and keep up the good 
work!
-- 
Regards,
Chris

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 18:16                                           ` Gene Heskett
@ 2007-04-19 21:35                                             ` Michael K. Edwards
  2007-04-19 22:47                                             ` Con Kolivas
  1 sibling, 0 replies; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-19 21:35 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, Gene Heskett <gene.heskett@gmail.com> wrote:
> Having tried re-nicing X a while back, and having the rest of the system
> suffer in quite obvious ways for even 1 + or - from its default felt pretty
> bad from this users perspective.
>
> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane of
> this list) that if X has to be re-niced from the 1 point advantage its had
> for ages, then something is basicly wrong with the overall scheduling, cpu or
> i/o, or both in combination.  FWIW I'm using cfq for i/o.

I think I just realized why the X server is such a problem.  If it
gets preempted when it's not actually selecting/polling over a set of
fds that includes the input devices, the scheduler doesn't know that
it's a good candidate for scheduling when data arrives on those
devices.  (That's all that any of these dynamic priority heuristics
really seem to do -- weight the scheduler towards switching to
conspicuously I/O bound tasks when they become runnable, without the
forced preemption on lock release that would result from a true
priority inheritance mechanism.)

One way of looking at this is that "fairness-driven" scheduling is a
poor man's priority ceiling protocol for I/O bound workloads, with the
implicit priority of an fd or lock given by how desperately the reader
side needs more data in order to accomplish anything.  "Nice" on a
task is sort of an indirect way of boosting or dropping the base
priority of the fds it commonly waits on.  I recognize this is a
drastic oversimplification, and possibly even a misrepresentation of
the design _intent_; but I think it's fairly accurate in terms of the
design _effect_.

The event-driven, non-threaded design of the X server makes it
particularly vulnerable to "non-interactive behavior" penalties, which
is appropriate to the extent that it's an output device having trouble
keeping up with rendering -- in fact, that's exactly the throttling
mechanism you need in order to exert back-pressure on the X client.
(Trying to exert back-pressure over Linux's local domain sockets seems
to be like pushing on a rope, but that's a different problem.)  That
same event-driven design would prioritize input events just fine --
except the scheduler won't wake the task in order to deliver them,
because as far as it's concerned the X server is getting more than
enough I/O to keep it busy.  It's not only not blocked on the input
device, it isn't even selecting on it at the moment that its timeslice
expires -- so no amount of poor-man's PCP emulation is going to help.

What "more negative nice on the X server than on any CPU-bound
process" seems to do is to put the X server on a hair-trigger,
boosting its dynamic priority in a render-limited scenario (on some
graphics cards!) just enough to cancel the penalty for non-interactive
behavior.  It's forced to share _some_ CPU cycles, but nobody else is
allowed a long enough timeslice to keep the X server off the CPU (and
insensitive to input events) for long.  Not terribly efficient in
terms of context switch / cache eviction overhead, but certainly
friendlier to the PEBCAK (who is clearly putting totally inappropriate
load on a single-threaded CPU by running both a local X server and
non-SCHED_BATCH compute jobs) than a frozen mouse cursor.

So what's the right answer?  Not special-casing the X server, that's
for sure.  If this analysis is correct (and as of now it's pure
speculation), any event-driven application that does compute work
opportunistically in the absence of user interaction is vulnerable to
the same overzealous squelching.  I wouldn't design a new application
that way, of course -- user interaction belongs in a separate thread
on any UNIX-legacy system which assigns priorities to threads of
control instead of to patterns of activity.  But all sorts of Linux
applications have been designed to implicitly elevate artificial
throughput benchmarks over user responsiveness -- that has been the
UNIX way at least since SVR4, and Linux's history of expensive thread
switches prior to NPTL didn't help.

If you want responsiveness when the CPU is oversubscribed -- and I for
one do, which is one reason why I abandoned the Linux desktop once
both Microsoft and Apple figured out how to make hyperthreading work
in their favor -- you should probably think about how to get it
without rewriting half of userspace.  IMHO, dinking around with
"fairness", as if there were any relationship these days between UIDs
or process groups or any other control structure and the work that's
trying to flow through the system, is not going to get you there.

If this were my problem, I might start by attaching urgency to
behavior instead of to thread ID, which demands a scheduler queue
built around a data structure with a cheap decrease-key operation.
I'd figure out how to propagate this urgency not just along lock
chains but also along chains of fds that need flushing (or refilling)
-- even if the reader (or writer) got preempted for unrelated reasons.
 Tie appropriate urgency to audio and input devices, and SCHED_FIFO
can pretty much go away along with nice -10 X.

Then I would use the fact that taking an uncontended futex is
impressively cheap, ask Ulrich for an extra class of thread-private
recursive mutexes streamlined for the never-contended case, and
encourage application developers to use them to bracket any short
section that would prefer not to have its cache footprint evicted out
from under it.  Unless things get pretty nasty, when you're too
impatient to wait for a task to block, you want to preempt it at a
local minimum in its working set; the easy way to do this is to have a
per-CPU "soft preemption" timer that causes a per-CPU kernel thread to
attempt to take the foreground task's thread-private mutex.  The
foreground task will then block on the next entry into a "cache-hot"
section, and you can remember its appetite for CPU cycles by the
urgency on its thread-private futex.

For better or for worse, this is far more work than _I'm_ likely to do
on a volunteer basis, and whether or not this analysis is right, it is
guaranteed to fail with -ENOPATCH.

Oh, by the way -- maybe someone should look at whether a little
backpressure on /tmp/.X11-unix/X0 and friends helps, too.  (Evidently
it works right over pipes, or else hardly anything on Linux would
DWIM.)  That might succeed in papering over the problem, without
requiring actual design effort before wading into the code.  Then
again, I haven't actually looked at the local domain socket code, so
for all I know there's already backpressure there but X.org's
excessive cleverness defeats it.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 18:16                                           ` Gene Heskett
  2007-04-19 21:35                                             ` Michael K. Edwards
@ 2007-04-19 22:47                                             ` Con Kolivas
  2007-04-20  2:00                                               ` Gene Heskett
                                                                 ` (2 more replies)
  1 sibling, 3 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-19 22:47 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Friday 20 April 2007 04:16, Gene Heskett wrote:
> On Thursday 19 April 2007, Con Kolivas wrote:
>
> [and I snipped a good overview]
>
> >So yes go ahead and think up great ideas for other ways of metering out
> > cpu bandwidth for different purposes, but for X, given the absurd
> > simplicity of renicing, why keep fighting it? Again I reiterate that most
> > users of SD have not found the need to renice X anyway except if they
> > stick to old habits of make -j4 on uniprocessor and the like, and I
> > expect that those on CFS and Nicksched would also have similar
> > experiences.
>
> FWIW folks, I have never touched x's niceness, its running at the default
> -1 for all of my so-called 'tests', and I have another set to be rebooted
> to right now.  And yes, my kernel makeit script uses -j4 by default, and
> has used -j8 just for effects, which weren't all that different from what I
> expected in 'abusing' a UP system that way.  The system DID remain usable,
> not snappy, but usable.

Gene, you're agreeing with me. You've shown that you're very happy with a fair 
distribution of cpu and leaving X at nice 0.
>
> Having tried re-nicing X a while back, and having the rest of the system
> suffer in quite obvious ways for even 1 + or - from its default felt pretty
> bad from this users perspective.
>
> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane
> of this list) that if X has to be re-niced from the 1 point advantage its
> had for ages, then something is basicly wrong with the overall scheduling,
> cpu or i/o, or both in combination.  FWIW I'm using cfq for i/o.

It's those who want X to have an unfair advantage that want it to do 
something "special". Your agreement that it works fine at nice 0 shows you 
don't want it to have an unfair advantage. Others who want it to have an 
unfair advantage _can_ renice it if they desire. But if the cpu scheduler 
gives X an unfair advantage within the kernel by default then you have _no_ 
choice. If you leave the choice up to userspace (renice or not) then both 
parties get their way. If you put it into the kernel only one party wins and 
there is no way for the Genes (and Cons) of this world to get it back.

Your opinion is as valuable as eveyone else's Gene. It is hard to get people 
to speak on as frightening a playground as the linux kernel mailing list so 
please do. 

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 19:26                                           ` Ray Lee
@ 2007-04-19 22:56                                             ` Con Kolivas
  2007-04-20  0:20                                               ` Michael K. Edwards
  2007-04-20  0:56                                               ` Ray Lee
  2007-04-20  4:09                                             ` Nick Piggin
  1 sibling, 2 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-19 22:56 UTC (permalink / raw)
  To: ray-gmail
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Friday 20 April 2007 05:26, Ray Lee wrote:
> On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote:
> > The one fly in the ointment for
> > linux remains X. I am still, to this moment, completely and utterly
> > stunned at why everyone is trying to find increasingly complex unique
> > ways to manage X when all it needs is more cpu[1].
>
> [...and hence should be reniced]
>
> The problem is that X is not unique. There's postgresql, memcached,
> mysql, db2, a little embedded app I wrote... all of these perform work
> on behalf of another process. It's just most *noticeable* with X, as
> pretty much everyone is running that.
>
> If we had some way for the scheduler to decide to donate part of a
> client process's time slice to the server it just spoke to (with an
> exponential dampening factor -- take 50% from the client, give 25% to
> the server, toss the rest on the floor), that -- from my naive point
> of view -- would be a step toward fixing the underlying issue. Or I
> might be spouting crap, who knows.
>
> The problem is real, though, and not limited to X.
>
> While I have the floor, thank you, Con, for all your work.

You're welcome and thanks for taking the floor to speak. I would say you have 
actually agreed with me though. X is not unique, it's just an obvious so 
let's not design the cpu scheduler around the problem with X. Same goes for 
every other application. Leaving the choice to hand out differential cpu 
usage when they seem to need is should be up to the users. The donation idea 
has been done before in some fashion or other in things like "back-boost" 
which Linus himself tried in 2.5.X days. It worked lovely till it did the 
wrong thing and wreaked havoc. As is shown repeatedly, the workarounds and 
the tweaks and the bonuses and the decide on who to give advantage to, when 
done by the cpu scheduler, is also what is its undoing as it can't always get 
it right. The consequences of getting it wrong on the other hand are 
disastrous. The cpu scheduler core is a cpu bandwidth and latency 
proportionator and should be nothing more or less.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
  2007-04-19 18:45                     ` Willy Tarreau
@ 2007-04-19 23:52                     ` Jan Knutar
  2007-04-20  5:05                       ` Willy Tarreau
  2 siblings, 1 reply; 712+ messages in thread
From: Jan Knutar @ 2007-04-19 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Willy Tarreau, Nick Piggin, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007 18:18, Ingo Molnar wrote:
> * Willy Tarreau <w@1wt.eu> wrote:
> > You can certainly script it with -geometry. But it is the wrong
> > application for this matter, because you benchmark X more than
> > glxgears itself. What would be better is something like a line
> > rotating 360 degrees and doing some short stuff between each
> > degree, so that X is not much sollicitated, but the CPU would be
> > spent more on the processes themselves.
>
> at least on my setup glxgears goes via DRI/DRM so there's no X
> scheduling inbetween at all, and the visual appearance of glxgears is
> a direct function of its scheduling.

How much of the subjective interactiveness-feel of the desktop is at the 
mercy of the X server's scheduling and not the cpu scheduler?

I've noticed that video playback is significantly smoother and resistant 
to other load, when using MPlayer's opengl output, especially if 
"heavy" programs are running at the same time. Especially firefox and 
ksysguard seem to have found a way to cause video through Xv to look 
annoyingly jittery.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 16:15                                               ` Mark Lord
  2007-04-19 18:21                                                 ` Gene Heskett
@ 2007-04-20  0:17                                                 ` Con Kolivas
  2007-04-20  1:17                                                 ` Ed Tomlinson
  2 siblings, 0 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-20  0:17 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Friday 20 April 2007 02:15, Mark Lord wrote:
> Con Kolivas wrote:
> > On Thursday 19 April 2007 23:17, Mark Lord wrote:
> >> Con Kolivas wrote:
> >> s go ahead and think up great ideas for other ways of metering out cpu
> >>
> >>> bandwidth for different purposes, but for X, given the absurd
> >>> simplicity of renicing, why keep fighting it? Again I reiterate that
> >>> most users of SD have not found the need to renice X anyway except if
> >>> they stick to old habits of make -j4 on uniprocessor and the like, and
> >>> I expect that those on CFS and Nicksched would also have similar
> >>> experiences.
> >>
> >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
> >> on my 2GHz P-M single-core non-HT machine with SD.
> >>
> >> But with the very first posted version of CFS by Ingo,
> >> I can do "make -j2" no problem and still have a nicely interactive
> >> destop.
> >
> > Cool. Then there's clearly a bug with SD that manifests on your machine
> > as it should not have that effect at all (and doesn't on other people's
> > machines). I suggest trying the latest version which fixes some bugs.
>
> SD just doesn't do nearly as good as the stock scheduler, or CFS, here.
>
> I'm quite likely one of the few single-CPU/non-HT testers of this stuff.
> If it should ever get more widely used I think we'd hear a lot more
> complaints.

You are not really one of the few. A lot of my own work is done on a single 
core pentium M 1.7Ghz laptop. I am not endowed with truckloads of hardware 
like all the paid developers are. I recall extreme frustration myself when a 
developer a few years ago (around 2002) said he couldn't reproduce poor 
behaviour on his 4GB ram 4 x Xeon machine. Even today if I add up every 
machine I have in my house and work at my disposal it doesn't amount to that 
many cpus and that much ram.

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 22:56                                             ` Con Kolivas
@ 2007-04-20  0:20                                               ` Michael K. Edwards
  2007-04-20  5:34                                                 ` Bill Huey
  2007-04-20  0:56                                               ` Ray Lee
  1 sibling, 1 reply; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-20  0:20 UTC (permalink / raw)
  To: Con Kolivas
  Cc: ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote:
> The cpu scheduler core is a cpu bandwidth and latency
> proportionator and should be nothing more or less.

Not really.  The CPU scheduler is (or ought to be) what electric
utilities call an economic dispatch mechanism -- a real-time
controller whose goal is to service competing demands cost-effectively
from a limited supply, without compromising system stability.

If you live in the 1960's, coal and nuclear (and a little bit of
fig-leaf hydro) are all you have, it takes you twelve hours to bring
plants on and off line, and there's no live operational control or
pricing signal between you and your customers.  So you're stuck
running your system at projected peak + operating margin, dumping
excess power as waste heat most of the time, and browning or blacking
people out willy-nilly when there's excess demand.  Maybe you get to
trade off shedding the loads with the worst transmission efficiency
against degrading the customers with the most tolerance for brownouts
(or the least regulatory clout).  That's life without modern economic
dispatch.

If you live in 2007, natural gas and (outside the US) better control
over nuclear plants give you more ability to ramp supply up and down
with demand on something like a 15-minute cycle.  Better yet, you can
store a little energy "in the grid" to smooth out instantaneous demand
fluctuations; if you're lucky, you also have enough fast-twitch hydro
(thanks, Canada!) that you can run your coal and lame-ass nuclear very
close to base load even when gas is expensive, and even pump water
back uphill when demand dips.  (Coal is nasty stuff and a worse
contributor by far to radiation exposure than nuclear generation; but
on current trends it's going to last a lot longer than oil and gas,
and it's a lot easier to stockpile next to the generator.)

Best of all, you have industrial customers who will trade you live
control (within limits) over when and how much power they take in
return for a lower price per unit energy.  Some of them will even dump
power back into the grid when you ask them to.  So now the biggest
challenge in making supply and demand meet (in the short term) is to
damp all the different ways that a control feedback path might result
in an oscillation -- or in runaway pricing.  Because there's always
some asshole greedhead who will gamble with system stability in order
to game the pricing mechanism.  Lots of 'em, if you're in California
and your legislature is so dumb, or so bought, that they let the
asshole greedheads design the whole system so they can game it to the
max.  (But that's a whole 'nother rant.)

Embedded systems are already in 2007, and the mainline Linux scheduler
frankly sucks on them, because it thinks it's back in the 1960's with
a fixed supply and captive demand, pissing away "CPU bandwidth" as
waste heat.  Not to say it's an easy problem; even academics with a
dozen publications in this area don't seem to be able to model energy
usage to the nearest big O, let alone design a stable economic
dispatch engine.  But it helps to acknowledge what the problem is:
even in a 1960's raised-floor screaming-air-conditioners
screw-the-power-bill machine room, you can't actually run a
half-decent CPU flat out any more without burning it to a crisp.

You can act ignorant and let the PMIC brown you out when it has to.
Or you can start coping in mainline the way that organizations big
enough (and smart enough) to feel the heat in their pocketbooks do in
their pet kernels.  (Boo on Google for not sharing, and props to IBM
for doing their damnedest.)  And guess what?  The system will actually
get simpler, and stabler, and faster, and easier to maintain, because
it'll be based on a real theory of operation with equations and things
instead of a bunch of opaque, undocumented shotgun heuristics.

This hypothetical economic-dispatch scheduler will still _have_
heuristics, of course -- you can't begin to model a modern CPU
accurately on-line.  But they will be contained in _data_ rather than
_code_, and issues of numerical stability will be separated cleanly
from the rule set.  You'll be able to characterize the rule set's
domain of stability, given a conservative set of assumptions about the
feedback paths in the system under control, with the sort of
techniques they teach in the engineering schools that none of us (me
included) seem to have attended.  (I went to school thinking I was
going to be a physicist.  Wishful thinking -- but I was young and
stupid.  What's your excuse?  ;-)

OK, it feels better to have that off my chest.  Apologies to those
readers -- doubtless the vast majority of LKML, including everyone
else in this thread -- for whom it's irrelevant, pseudo-learned
pontification with no patch attached.  And my sincere thanks to Ingo,
Con, and really everyone else CC'ed, without whom Linux wouldn't be as
good as it is (really quite good, all things considered) and wouldn't
contribute as much as it does to my own livelihood.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 22:56                                             ` Con Kolivas
  2007-04-20  0:20                                               ` Michael K. Edwards
@ 2007-04-20  0:56                                               ` Ray Lee
  1 sibling, 0 replies; 712+ messages in thread
From: Ray Lee @ 2007-04-20  0:56 UTC (permalink / raw)
  To: Con Kolivas
  Cc: ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Con Kolivas wrote:
> You're welcome and thanks for taking the floor to speak. I would say you have 
> actually agreed with me though. X is not unique, it's just an obvious so 
> let's not design the cpu scheduler around the problem with X. Same goes for 
> every other application. Leaving the choice to hand out differential cpu 
> usage when they seem to need is should be up to the users. The donation idea 
> has been done before in some fashion or other in things like "back-boost" 
> which Linus himself tried in 2.5.X days. It worked lovely till it did the 
> wrong thing and wreaked havoc.

<nod> I know. I came to the party late, or I would have played with it back
then. Perhaps you could correct me, but it seems his back-boost didn't do
any dampening, which means the system could get into nasty capture scenarios,
where two processes bouncing messages back and forth could take over the
scheduler and starve out the rest. It seems pretty obvious in hind-sight
that something without exponential dampening would allow feedback loops.

Regardless, perhaps we are in agreement. I just don't like the idea of having
to guess how much work postgresql is going to be doing on my client processes'
behalf. Worse, I don't necessarily want it to have that -10 priority when
it's going and updating statistics or whatnot, or any other housekeeping
activity that shouldn't make a noticeable impact on the rest of the system.
Worst, I'm leery of the idea that if I get its nice level wrong, that I'm
going to be affecting the overall throughput of the server.

All of which are only hypothetical worries, granted.

Anyway, I'll shut up now. Thanks again for stickin' with it.

Ray

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 16:15                                               ` Mark Lord
  2007-04-19 18:21                                                 ` Gene Heskett
  2007-04-20  0:17                                                 ` Con Kolivas
@ 2007-04-20  1:17                                                 ` Ed Tomlinson
  2007-04-20  1:27                                                   ` Linus Torvalds
  2 siblings, 1 reply; 712+ messages in thread
From: Ed Tomlinson @ 2007-04-20  1:17 UTC (permalink / raw)
  To: Mark Lord
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007 12:15, Mark Lord wrote:
> Con Kolivas wrote:
> > On Thursday 19 April 2007 23:17, Mark Lord wrote:
> >> Con Kolivas wrote:
> >> s go ahead and think up great ideas for other ways of metering out cpu
> >>
> >>> bandwidth for different purposes, but for X, given the absurd simplicity
> >>> of renicing, why keep fighting it? Again I reiterate that most users of
> >>> SD have not found the need to renice X anyway except if they stick to old
> >>> habits of make -j4 on uniprocessor and the like, and I expect that those
> >>> on CFS and Nicksched would also have similar experiences.
> >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
> >> on my 2GHz P-M single-core non-HT machine with SD.
> >>
> >> But with the very first posted version of CFS by Ingo,
> >> I can do "make -j2" no problem and still have a nicely interactive destop.
> > 
> > Cool. Then there's clearly a bug with SD that manifests on your machine as it 
> > should not have that effect at all (and doesn't on other people's machines). 
> > I suggest trying the latest version which fixes some bugs.
> 
> SD just doesn't do nearly as good as the stock scheduler, or CFS, here.
> 
> I'm quite likely one of the few single-CPU/non-HT testers of this stuff.
> If it should ever get more widely used I think we'd hear a lot more complaints.

amd64 UP here.  SD with several makes running works just fine.

Ed Tomlinson

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  1:17                                                 ` Ed Tomlinson
@ 2007-04-20  1:27                                                   ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-20  1:27 UTC (permalink / raw)
  To: Ed Tomlinson
  Cc: Mark Lord, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner



On Thu, 19 Apr 2007, Ed Tomlinson wrote:
> > 
> > SD just doesn't do nearly as good as the stock scheduler, or CFS, here.
> > 
> > I'm quite likely one of the few single-CPU/non-HT testers of this stuff.
> > If it should ever get more widely used I think we'd hear a lot more complaints.
> 
> amd64 UP here.  SD with several makes running works just fine.

The thing is, it probably depends *heavily* on just how much work the X 
server ends up doing. Fast video hardware? The X server doesn't need to 
busy-wait much. Not a lot of eye-candy? The X server is likely fast enough 
even with a slower card that it still gets sufficient CPU time and isn't 
getting dinged by any balancing. DRI vs non-DRI? Which window manager 
(maybe some of the user-visible lags come from there..) etc etc.

Anyway, I'd ask people to look a bit at the current *regressions* instead 
of spending all their time on something that won't even be merged before 
2.6.21 is released, and we thus have some mroe pressing issues. Please?

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 14:22                                               ` Lee Revell
@ 2007-04-20  1:32                                                 ` Michael K. Edwards
  2007-04-20  5:25                                                   ` Bill Huey
  0 siblings, 1 reply; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-20  1:32 UTC (permalink / raw)
  To: Lee Revell
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, Andrew Morton,
	Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, Lee Revell <rlrevell@joe-job.com> wrote:
> IMHO audio streamers should use SCHED_FIFO thread for time critical
> work.  I think it's insane to expect the scheduler to figure out that
> these processes need low latency when they can just be explicit about
> it.  "Professional" audio software does it already, on Linux as well
> as other OS...

It is certainly true that SCHED_FIFO is currently necessary in the
layers of an audio application lying closest to the hardware, if you
don't want to throw a monstrous hardware ring buffer at the problem.
See the alsa-devel archives for a patch to aplay (sched_setscheduler
plus some cleanups) that converts it from "unsafe at any speed" (on a
non-RT kernel) to a rock-solid 18ms round trip from PCM in to PCM out.
 (The hardware and driver aren't terribly exotic for an SoC, and the
measurement was done with aplay -C | aplay -P -- on a
not-particularly-tuned CONFIG_PREEMPT kernel with a 12ms+ peak
scheduling latency according to cyclictest.  A similar test via
/dev/dsp, done through a slightly modified OSS emulation layer to the
same driver, measures at 40ms and is probably tuned too
conservatively.)

Note that SCHED_FIFO may be less necessary on an -rt kernel, but I
haven't had that option on the embedded hardware I've been working
with lately.  Ingo, please please pretty please pick a -stable branch
one of these days and provide a git repo with -rt integrated against
that branch.  Then I could port our chip support to it -- all of which
will be GPLed after the impending code review -- after which I might
have a prayer of strong-arming our chip vendor into porting their WiFi
driver onto -rt.  It's really a much more interesting scheduler use
case than make -j200 under X, because it's a best-effort
SCHED_BATCH-ish load that wants to be temporally clustered for power
management reasons.

(Believe it or not, a stable -rt branch with a clock-scaling-aware
scheduler is the one thing that might lead to this major WiFi vendor's
GPLing their driver core.  They're starting to see the light on the
biz dev side, and the nature of the devices their chip will go in
makes them somewhat less concerned about the regulatory fig leaf
aspect of a closed-source driver; but they would have to port off of
the third-party real-time executive embedded within the driver, and
mainline's task and timer granularity won't cut it.  I can't even get
more detail about _why_ it won't cut it unless there's some remotely
supportable -rt base they could port to.)

But I think SCHED_FIFO on a chain of tasks is fundamentally not the
right way to handle low audio latency.  The object with a low latency
requirement isn't the task, it's the device.  When it's starting to
get urgent to deliver more data to the device, the task that it's
waiting on should slide up the urgency scale; and if it's waiting on
something else, that something else should slide up the scale; and so
forth.  Similarly, responding to user input is urgent; so when user
input is available (by whatever mechanism), the task that's waiting
for it should slide up the urgency scale, etc.

In practice, you probably don't want to burden desktop Linux with
priority inheritance where you don't have to.  Priority queues with
algorithmically efficient decrease-key operations (Fibonacci heaps and
their ilk) are complicated to implement and have correspondingly high
constant factors.  (However, a sufficiently clever heuristic for
assigning quasi-static task priorities would usually short-circuit the
priority cascade; if you can keep N small in the
tasks-with-unpredictable-priority queue, you can probably use a
simpler flavor with O(log N) decrease-key.  Ask someone who knows more
about data structures than I do.)

More importantly, non-real-time application coders aren't very smart
about grouping data structure accesses on one side or the other of a
system call that is likely to release a lock and let something else
run, flushing application data out of cache.  (Kernel coders aren't
always smart about this either; see LKML threads a few weeks ago about
racy, and cache-stall-prone, f_pos handling in VFS.)  So switching
tasks immediately on lock release is usually the wrong thing to do if
letting the task run a little longer would allow it to reach a point
where it has to block anyway.

Anyway, I already described the urgency-driven strategy to the extent
that I've thought it out, elsewhere in this thread.  I only held this
draft back because I wanted to double-check my latency measurements.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 22:47                                             ` Con Kolivas
@ 2007-04-20  2:00                                               ` Gene Heskett
  2007-04-20  2:01                                               ` Gene Heskett
  2007-04-20  5:24                                               ` Mike Galbraith
  2 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-20  2:00 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007, Con Kolivas wrote:
>On Friday 20 April 2007 04:16, Gene Heskett wrote:
>> On Thursday 19 April 2007, Con Kolivas wrote:
>>
>> [and I snipped a good overview]
>>
>> >So yes go ahead and think up great ideas for other ways of metering out
>> > cpu bandwidth for different purposes, but for X, given the absurd
>> > simplicity of renicing, why keep fighting it? Again I reiterate that
>> > most users of SD have not found the need to renice X anyway except if
>> > they stick to old habits of make -j4 on uniprocessor and the like, and I
>> > expect that those on CFS and Nicksched would also have similar
>> > experiences.
>>
>> FWIW folks, I have never touched x's niceness, its running at the default
>> -1 for all of my so-called 'tests', and I have another set to be rebooted
>> to right now.  And yes, my kernel makeit script uses -j4 by default, and
>> has used -j8 just for effects, which weren't all that different from what
>> I expected in 'abusing' a UP system that way.  The system DID remain
>> usable, not snappy, but usable.
>
>Gene, you're agreeing with me. You've shown that you're very happy with a
> fair distribution of cpu and leaving X at nice 0.

I was quite happy till Ingo's first patch came out, and it was even better, 
but I over-wrote it, and we're still figuring out just exactly what the magic 
twanger was that made it all click for me.  OTOH, I don't think that patch 
passed muster with Mike G., either.  We have obviously different workloads, 
and critical points in them.

>> Having tried re-nicing X a while back, and having the rest of the system
>> suffer in quite obvious ways for even 1 + or - from its default felt
>> pretty bad from this users perspective.
>>
>> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane
>> of this list) that if X has to be re-niced from the 1 point advantage its
>> had for ages, then something is basicly wrong with the overall scheduling,
>> cpu or i/o, or both in combination.  FWIW I'm using cfq for i/o.
>
>It's those who want X to have an unfair advantage that want it to do
>something "special". Your agreement that it works fine at nice 0 shows you
>don't want it to have an unfair advantage. Others who want it to have an
>unfair advantage _can_ renice it if they desire. But if the cpu scheduler
>gives X an unfair advantage within the kernel by default then you have _no_
>choice. If you leave the choice up to userspace (renice or not) then both
>parties get their way. If you put it into the kernel only one party wins and
>there is no way for the Genes (and Cons) of this world to get it back.
>
>Your opinion is as valuable as eveyone else's Gene. It is hard to get people
>to speak on as frightening a playground as the linux kernel mailing list so
>please do.

In the FWIW category, htop has always told me that x is running at -1, not 
zero.  Now, I have NDI where this is actually set at, so I'd have to ask 
stupid questions here if I did wanna play with it.  Which I really don't, the 
last time I tried to -5 x, kde got a whole lot LESS responsive.  But heck, 
2.6.2 was freshly minted then too and I've long since forgot how I went about 
that unless I used htop to change it, the most likely scenario that I can 
picture at this late date. 

As for speaking my mind, yes, and I've been slapped down a few times, as much 
because I do a lot of bitching and microscopic amounts of patch submission. 
The only patch I ever submitted was for something in the floppy driver, way 
back in the middle of 2.2 days, rejected because I didn't know how to use the 
tools correctly.  I didn't, so it was a shrug and my feelings weren't hurt.

Some see that as an unbalanced set of books and I'm aware of it.  OTOH, I 
think I do a pretty good job of playing the canary here, and that should be 
worth something if for no other reason than I can turn into a burr under 
somebodies saddle when things go all aglay.  But I figure if its happening to 
me, then if I don't fuss, and that gotcha gets into a distro kernel, there 
are gonna be a hell of a lot more folks than me trying to grab the 
microphone.

BTW, I'm glad you are feeling well enough to get into this again.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
There cannot be a crisis next week.  My schedule is already full.
		-- Henry Kissinger

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 22:47                                             ` Con Kolivas
  2007-04-20  2:00                                               ` Gene Heskett
@ 2007-04-20  2:01                                               ` Gene Heskett
  2007-04-20  5:24                                               ` Mike Galbraith
  2 siblings, 0 replies; 712+ messages in thread
From: Gene Heskett @ 2007-04-20  2:01 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thursday 19 April 2007, Con Kolivas wrote:
>On Friday 20 April 2007 04:16, Gene Heskett wrote:
>> On Thursday 19 April 2007, Con Kolivas wrote:
>>
>> [and I snipped a good overview]
>>
>> >So yes go ahead and think up great ideas for other ways of metering out
>> > cpu bandwidth for different purposes, but for X, given the absurd
>> > simplicity of renicing, why keep fighting it? Again I reiterate that
>> > most users of SD have not found the need to renice X anyway except if
>> > they stick to old habits of make -j4 on uniprocessor and the like, and I
>> > expect that those on CFS and Nicksched would also have similar
>> > experiences.
>>
>> FWIW folks, I have never touched x's niceness, its running at the default
>> -1 for all of my so-called 'tests', and I have another set to be rebooted
>> to right now.  And yes, my kernel makeit script uses -j4 by default, and
>> has used -j8 just for effects, which weren't all that different from what
>> I expected in 'abusing' a UP system that way.  The system DID remain
>> usable, not snappy, but usable.
>
>Gene, you're agreeing with me. You've shown that you're very happy with a
> fair distribution of cpu and leaving X at nice 0.

I was quite happy till Ingo's first patch came out, and it was even better, 
but I over-wrote it, and we're still figuring out just exactly what the magic 
twanger was that made it all click for me.  OTOH, I don't think that patch 
passed muster with Mike G., either.  We have obviously different workloads, 
and critical points in them.

>> Having tried re-nicing X a while back, and having the rest of the system
>> suffer in quite obvious ways for even 1 + or - from its default felt
>> pretty bad from this users perspective.
>>
>> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane
>> of this list) that if X has to be re-niced from the 1 point advantage its
>> had for ages, then something is basicly wrong with the overall scheduling,
>> cpu or i/o, or both in combination.  FWIW I'm using cfq for i/o.
>
>It's those who want X to have an unfair advantage that want it to do
>something "special". Your agreement that it works fine at nice 0 shows you
>don't want it to have an unfair advantage. Others who want it to have an
>unfair advantage _can_ renice it if they desire. But if the cpu scheduler
>gives X an unfair advantage within the kernel by default then you have _no_
>choice. If you leave the choice up to userspace (renice or not) then both
>parties get their way. If you put it into the kernel only one party wins and
>there is no way for the Genes (and Cons) of this world to get it back.
>
>Your opinion is as valuable as eveyone else's Gene. It is hard to get people
>to speak on as frightening a playground as the linux kernel mailing list so
>please do.

In the FWIW category, htop has always told me that x is running at -1, not 
zero.  Now, I have NDI where this is actually set at, so I'd have to ask 
stupid questions here if I did wanna play with it.  Which I really don't, the 
last time I tried to -5 x, kde got a whole lot LESS responsive.  But heck, 
2.6.2 was freshly minted then too and I've long since forgot how I went about 
that unless I used htop to change it, the most likely scenario that I can 
picture at this late date. 

As for speaking my mind, yes, and I've been slapped down a few times, as much 
because I do a lot of bitching and microscopic amounts of patch submission. 
The only patch I ever submitted was for something in the floppy driver, way 
back in the middle of 2.2 days, rejected because I didn't know how to use the 
tools correctly.  I didn't, so it was a shrug and my feelings weren't hurt.

Some see that as an unbalanced set of books and I'm aware of it.  OTOH, I 
think I do a pretty good job of playing the canary here, and that should be 
worth something if for no other reason than I can turn into a burr under 
somebodies saddle when things go all aglay.  But I figure if its happening to 
me, then if I don't fuss, and that gotcha gets into a distro kernel, there 
are gonna be a hell of a lot more folks than me trying to grab the 
microphone.

BTW, I'm glad you are feeling well enough to get into this again.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
There cannot be a crisis next week.  My schedule is already full.
		-- Henry Kissinger

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 13:17                                           ` Mark Lord
  2007-04-19 15:10                                             ` Con Kolivas
@ 2007-04-20  3:57                                             ` Nick Piggin
  2007-04-21 14:55                                               ` Mark Lord
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-20  3:57 UTC (permalink / raw)
  To: Mark Lord
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thu, Apr 19, 2007 at 09:17:25AM -0400, Mark Lord wrote:
> Con Kolivas wrote:
> s go ahead and think up great ideas for other ways of metering out cpu 
> >bandwidth for different purposes, but for X, given the absurd simplicity 
> >of renicing, why keep fighting it? Again I reiterate that most users of SD 
> >have not found the need to renice X anyway except if they stick to old 
> >habits of make -j4 on uniprocessor and the like, and I expect that those 
> >on CFS and Nicksched would also have similar experiences.
> 
> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
> on my 2GHz P-M single-core non-HT machine with SD.

Is this with or without X reniced?


> But with the very first posted version of CFS by Ingo,
> I can do "make -j2" no problem and still have a nicely interactive destop.

How well does cfs run if you have the granularity set to something
like 30ms (30000000)?

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 19:26                                           ` Ray Lee
  2007-04-19 22:56                                             ` Con Kolivas
@ 2007-04-20  4:09                                             ` Nick Piggin
  2007-04-24 15:50                                               ` Ray Lee
  1 sibling, 1 reply; 712+ messages in thread
From: Nick Piggin @ 2007-04-20  4:09 UTC (permalink / raw)
  To: ray-gmail
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Thu, Apr 19, 2007 at 12:26:03PM -0700, Ray Lee wrote:
> On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote:
> >The one fly in the ointment for
> >linux remains X. I am still, to this moment, completely and utterly stunned
> >at why everyone is trying to find increasingly complex unique ways to 
> >manage
> >X when all it needs is more cpu[1].
> [...and hence should be reniced]
> 
> The problem is that X is not unique. There's postgresql, memcached,
> mysql, db2, a little embedded app I wrote... all of these perform work
> on behalf of another process. It's just most *noticeable* with X, as
> pretty much everyone is running that.

But for most of those apps, we don't actually care if they do fairly
degrade in performance as other loads on the system ramp up. However
the user prefers X to be given priority in these situations. Whether
that is the design of X, x clients, or the human condition really
doesn't matter two hoots to the scheduler.


> If we had some way for the scheduler to decide to donate part of a
> client process's time slice to the server it just spoke to (with an
> exponential dampening factor -- take 50% from the client, give 25% to
> the server, toss the rest on the floor), that -- from my naive point
> of view -- would be a step toward fixing the underlying issue. Or I
> might be spouting crap, who knows.

Firstly, lots of clients in your list are remote. X usually isn't.
However for X, a syscall or something to donate time might not be
such a bad idea... but given a couple of X clients and a server
against a parallel make, this is probably just going to make the
clients slow down as well without giving enough priority to the
server.

X isn't special so much because it does work on behalf of others
(as you said, lots of things do that). It is special simply because
we _want_ rendering to have priority of the CPU (if you shifed CPU
intensive rendering to the clients, you'd most likely want to give
them priority to); nice, right?


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 23:52                     ` Jan Knutar
@ 2007-04-20  5:05                       ` Willy Tarreau
  0 siblings, 0 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-20  5:05 UTC (permalink / raw)
  To: Jan Knutar
  Cc: linux-kernel, Ingo Molnar, Nick Piggin, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Fri, Apr 20, 2007 at 02:52:38AM +0300, Jan Knutar wrote:
> On Thursday 19 April 2007 18:18, Ingo Molnar wrote:
> > * Willy Tarreau <w@1wt.eu> wrote:
> > > You can certainly script it with -geometry. But it is the wrong
> > > application for this matter, because you benchmark X more than
> > > glxgears itself. What would be better is something like a line
> > > rotating 360 degrees and doing some short stuff between each
> > > degree, so that X is not much sollicitated, but the CPU would be
> > > spent more on the processes themselves.
> >
> > at least on my setup glxgears goes via DRI/DRM so there's no X
> > scheduling inbetween at all, and the visual appearance of glxgears is
> > a direct function of its scheduling.
> 
> How much of the subjective interactiveness-feel of the desktop is at the 
> mercy of the X server's scheduling and not the cpu scheduler?

probably a lot. Hence the reason why I wanted something visually noticeable
but using far less X resources than glxgears. The modified orbitclock is
perfect IMHO.

Regards,
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 16:55                                                       ` Davide Libenzi
@ 2007-04-20  5:16                                                         ` Mike Galbraith
  0 siblings, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-20  5:16 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Ingo Molnar, linux-kernel

On Thu, 2007-04-19 at 09:55 -0700, Davide Libenzi wrote:
> On Thu, 19 Apr 2007, Mike Galbraith wrote:
> 
> > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> > > * Mike Galbraith <efault@gmx.de> wrote:
> > > 
> > > > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > > > daily usage pattern nicely (always need godmode for shells, but not 
> > > > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > > > gui)
> > > 
> > > how about the first-approximation solution i suggested in the previous 
> > > mail: to add a per UID default nice level? (With this default defaulting 
> > > to '-10' for all root-owned processes, and defaulting to '0' for 
> > > everything else.) That would solve most of the current CFS regressions 
> > > at hand.
> > 
> > That would make my kernel builds etc interfere with my other self's
> > surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
> > X portion of my Joe-User activity pushes the compile portion of root
> > down in bandwidth utilization automagically, which is exactly the right
> > thing, because the root me in not as important as the Joe-User me using
> > the GUI at that time.  If the idea of X disturbing root upsets some,
> > they can move X to another UID.  Generally, it seems perfect for here.
> 
> Now guys, I did not follow the whole lengthy and feisty thread, but IIRC 
> Con's scheduler has been attacked because, among other argouments, was 
> requiring X to be reniced. This happened like a month ago IINM.

I don't object to renicing X if you want it to receive _more_ than it's
fair share. I do object to having to renice X in order for it to _get_
it's fair share.  That's what I attacked.

> I did not have time to look at Con's scheduler, and I only had a brief 
> look at Ingo's one (looks very promising IMO, but so was the initial O(1) 
> post before all the corner-cases fixes went in).
> But this is not a about technical merit, this is about applying the same 
> rules of judgement to others as well to ourselves.

I'm running the same tests with CFS that I ran for RSDL/SD.  It falls
short in one key area (to me) in that X+client cannot yet split my box
50/50 with two concurrent tasks.  In the CFS case, renicing both X and
client does work, but it should not be necessary IMHO.  With RSDL/SD
renicing didn't help.

> We went from a "renicing X to -10 is bad because the scheduler should 
> be able to correctly handle the problem w/out additional external plugs" 
> to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads 
> class, on top of all the tasks owned by root" [1].
> >From a spectator POV like myself in this case, this looks rather "unfair".

Well, for me, the renicing I mentioned above is only interesting as a
way to improve long term fairness with schedulers with no history.

I found Linus' EUID idea intriguing in that by putting the server
together with a steady load in one 'fair' domain, and clients in
another, X can, if prioritized to empower it to do so, modulate the
steady load in it's domain (but can't starve it!), the clients modulate
X, and the steady load gets it all when X and clients are idle.  The
nice level of X determines to what _extent_ X can modulate the constant
load rather like a mixer slider.  The synchronous (I'm told) nature of
X/client then becomes kind of an asset to the desktop instead of a
liability.

The specific case I was thinking about is the X+Gforce test where both
RSDL and CFS fail to provide fairness (as defined by me;).  X and Gforce
are mostly not concurrent.  The make -j2 I put them up against are
mostly concurrent.  I don't call giving 1/3 of my CPU to X+Client fair
at _all_, but that's what you'll get if your fairstick of the instant
generally can't see the fourth competing task.  Seemed pretty cool to me
because it creates the missing connection between client and server,
though also likely complicated (and maybe full of perils, who knows).

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-19 22:47                                             ` Con Kolivas
  2007-04-20  2:00                                               ` Gene Heskett
  2007-04-20  2:01                                               ` Gene Heskett
@ 2007-04-20  5:24                                               ` Mike Galbraith
  2 siblings, 0 replies; 712+ messages in thread
From: Mike Galbraith @ 2007-04-20  5:24 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Gene Heskett, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Fri, 2007-04-20 at 08:47 +1000, Con Kolivas wrote:

> It's those who want X to have an unfair advantage that want it to do 
> something "special".

I hope you're not lumping me in with "those".  If X + client had been
able to get their fair share and do so in the low latency manner they
need, I would have been one of the carrots instead of being the stick.

	-Mike


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  1:32                                                 ` Michael K. Edwards
@ 2007-04-20  5:25                                                   ` Bill Huey
  2007-04-20  7:12                                                     ` Michael K. Edwards
  0 siblings, 1 reply; 712+ messages in thread
From: Bill Huey @ 2007-04-20  5:25 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar,
	Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Mike Galbraith, ck list, linux-kernel,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Thu, Apr 19, 2007 at 06:32:15PM -0700, Michael K. Edwards wrote:
> But I think SCHED_FIFO on a chain of tasks is fundamentally not the
> right way to handle low audio latency.  The object with a low latency
> requirement isn't the task, it's the device.  When it's starting to
> get urgent to deliver more data to the device, the task that it's
> waiting on should slide up the urgency scale; and if it's waiting on
> something else, that something else should slide up the scale; and so
> forth.  Similarly, responding to user input is urgent; so when user
> input is available (by whatever mechanism), the task that's waiting
> for it should slide up the urgency scale, etc.

DSP operations like, particularly with digital synthesis, tend to max
the CPU doing vector operations on as many processors as it can get
a hold of. In a live performance critical application, it's important
to be able to deliver a protected amount of CPU to a thread doing that
work as well as response to external input such as controllers, etc...

> In practice, you probably don't want to burden desktop Linux with
> priority inheritance where you don't have to.  Priority queues with
> algorithmically efficient decrease-key operations (Fibonacci heaps and
> their ilk) are complicated to implement and have correspondingly high
> constant factors.  (However, a sufficiently clever heuristic for
> assigning quasi-static task priorities would usually short-circuit the
> priority cascade; if you can keep N small in the
> tasks-with-unpredictable-priority queue, you can probably use a
> simpler flavor with O(log N) decrease-key.  Ask someone who knows more
> about data structures than I do.)

These are app issue and not really somethings that's mutable in kernel
per se with regard to the -rt patch.

> More importantly, non-real-time application coders aren't very smart
> about grouping data structure accesses on one side or the other of a
> system call that is likely to release a lock and let something else
> run, flushing application data out of cache.  (Kernel coders aren't
> always smart about this either; see LKML threads a few weeks ago about
> racy, and cache-stall-prone, f_pos handling in VFS.)  So switching
> tasks immediately on lock release is usually the wrong thing to do if
> letting the task run a little longer would allow it to reach a point
> where it has to block anyway.

I have Solaris style adaptive locks in my tree with my lockstat patch
under -rt. I've also modified my lockstat patch to track readers
correctly now with rwsem and the like to see where the single reader
limitation in the rtmutex blows it.

So far I've seen less than 10 percent of in-kernel contention events
actually worth spinning on and the rest of the stats imply that the
mutex owner in question is either preempted or blocked on something
else.

I've been trying to get folks to try this on a larger machine than my
2x AMD64 box so that I there is more data regarding Linux contention
and overschedulling in -rt.
 
> Anyway, I already described the urgency-driven strategy to the extent
> that I've thought it out, elsewhere in this thread.  I only held this
> draft back because I wanted to double-check my latency measurements.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 11:50                                           ` Peter Williams
@ 2007-04-20  5:26                                             ` William Lee Irwin III
  2007-04-20  6:16                                               ` Peter Williams
  0 siblings, 1 reply; 712+ messages in thread
From: William Lee Irwin III @ 2007-04-20  5:26 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> I'd further recommend making priority levels accessible to kernel threads
>> that are not otherwise accessible to processes, both above and below
>> user-available priority levels. Basically, if you can get SCHED_RR and
>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
>> scheduler class can coexist with SCHED_OTHER in like fashion, but with
>> availability of higher and lower priorities than any userspace process
>> is allowed, and potentially some differing scheduling semantics. In such
>> a manner nonessential background processing intended not to ever disturb
>> userspace can be given priorities appropriate to it (perhaps even con's
>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
>> given priority over userspace altogether.

On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote:
> This is sounding very much like System V Release 4 (and descendants) 
> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
> are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
> priority inversion, I believe).

Descriptions of that are probably where I got the idea (hurrah for OS
textbooks). It makes a fair amount of sense. Not sure what the take on
the specific precedent is. The only content here is expanding the
priority range with ranges above and below for the exclusive use of
ultra-privileged tasks, so it's really trivial. Actually it might be so
trivial it should just be some permission checks in the SCHED_OTHER
renicing code.


-- wli

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  0:20                                               ` Michael K. Edwards
@ 2007-04-20  5:34                                                 ` Bill Huey
  0 siblings, 0 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-20  5:34 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Con Kolivas, ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, linux-kernel,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Thu, Apr 19, 2007 at 05:20:53PM -0700, Michael K. Edwards wrote:
> Embedded systems are already in 2007, and the mainline Linux scheduler
> frankly sucks on them, because it thinks it's back in the 1960's with
> a fixed supply and captive demand, pissing away "CPU bandwidth" as
> waste heat.  Not to say it's an easy problem; even academics with a
> dozen publications in this area don't seem to be able to model energy
> usage to the nearest big O, let alone design a stable economic
> dispatch engine.  But it helps to acknowledge what the problem is:
> even in a 1960's raised-floor screaming-air-conditioners
> screw-the-power-bill machine room, you can't actually run a
> half-decent CPU flat out any more without burning it to a crisp.
> stupid.  What's your excuse?  ;-)

It's now possible to QoS significant parts of the kernel since we now
have a deadline mechanism in place. In the original 2.4 kernel, TimeSys's
irq-thread allowed for the processing of skbuffs in a thread under a CPU
reservation run category which was use to provide QoS I believe. This
basic mechanish can now be generalized to many place in the kernel and
put it under scheduler control.

It's just a matter of who and when somebody is going take on this task.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-20  5:26                                             ` William Lee Irwin III
@ 2007-04-20  6:16                                               ` Peter Williams
  0 siblings, 0 replies; 712+ messages in thread
From: Peter Williams @ 2007-04-20  6:16 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> I'd further recommend making priority levels accessible to kernel threads
>>> that are not otherwise accessible to processes, both above and below
>>> user-available priority levels. Basically, if you can get SCHED_RR and
>>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
>>> scheduler class can coexist with SCHED_OTHER in like fashion, but with
>>> availability of higher and lower priorities than any userspace process
>>> is allowed, and potentially some differing scheduling semantics. In such
>>> a manner nonessential background processing intended not to ever disturb
>>> userspace can be given priorities appropriate to it (perhaps even con's
>>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
>>> given priority over userspace altogether.
> 
> On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote:
>> This is sounding very much like System V Release 4 (and descendants) 
>> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
>> are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
>> priority inversion, I believe).
> 
> Descriptions of that are probably where I got the idea (hurrah for OS
> textbooks).

And long term background memory.  :-)

> It makes a fair amount of sense.

Yes.  You could also add a SCHED_IA in between SCHED_SYS and SCHED_OTHER 
(a la Solaris) for interactive tasks.  The only problem is how to get a 
task into SCHED_IA without root privileges.

> Not sure what the take on
> the specific precedent is. The only content here is expanding the
> priority range with ranges above and below for the exclusive use of
> ultra-privileged tasks, so it's really trivial. Actually it might be so
> trivial it should just be some permission checks in the SCHED_OTHER
> renicing code.

Perhaps.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  5:25                                                   ` Bill Huey
@ 2007-04-20  7:12                                                     ` Michael K. Edwards
  2007-04-20  8:21                                                       ` Bill Huey
  0 siblings, 1 reply; 712+ messages in thread
From: Michael K. Edwards @ 2007-04-20  7:12 UTC (permalink / raw)
  To: hui Bill Huey
  Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar,
	Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Mike Galbraith, ck list, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On 4/19/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> DSP operations like, particularly with digital synthesis, tend to max
> the CPU doing vector operations on as many processors as it can get
> a hold of. In a live performance critical application, it's important
> to be able to deliver a protected amount of CPU to a thread doing that
> work as well as response to external input such as controllers, etc...

Actual fractional CPU reservation is a bit different, and is probably
best handled with "container"-type infrastructure (not quite
virtualization, but not quite scheduling classes either).  SGI
pioneered this (in "open systems" space -- IBM probably had it first,
as usual) with GRIO in XFS.  (That was I/O throughput reservation of
course, not "CPU bandwidth" -- but IIRC IRIX had CPU reservation too).
 There's a more general class of techniques in which it's worth
spending idle cycles speculating along paths that might or might not
be taken depending on unpredictable I/O; I'd be surprised if you
couldn't approximate most of the sane balancing strategies in this
area within the "economic dispatch" scheduler model.  (Good JIT
bytecode engines more or less do this already if you let them, with a
cap on JIT cache size serving as a crude CPU throttle.)

> > In practice, you probably don't want to burden desktop Linux with
> > priority inheritance where you don't have to.  Priority queues with
> > algorithmically efficient decrease-key operations (Fibonacci heaps and
> > their ilk) are complicated to implement and have correspondingly high
> > constant factors.  (However, a sufficiently clever heuristic for
> > assigning quasi-static task priorities would usually short-circuit the
> > priority cascade; if you can keep N small in the
> > tasks-with-unpredictable-priority queue, you can probably use a
> > simpler flavor with O(log N) decrease-key.  Ask someone who knows more
> > about data structures than I do.)
>
> These are app issue and not really somethings that's mutable in kernel
> per se with regard to the -rt patch.

I don't know where the -rt patch enters in.  But if you need agile
reprioritization with a deep runnable queue, either under direct
application control or as a side effect of priority inheritance or a
related OS-enforced protocol, then you need a kernel-level data
structure with a fancier interface than the classic
insert/find/delete-min priority queue.  From what I've read (this is
not my area of expertise and I don't have Knuth handy), the relatively
simple heap-based implementations of priority queues can't
reprioritize an entry any more quickly than find+delete+insert, which
pretty much rules them out as a basis for a scalable scheduler with
priority inheritance (let alone PCP emulation).

> I have Solaris style adaptive locks in my tree with my lockstat patch
> under -rt. I've also modified my lockstat patch to track readers
> correctly now with rwsem and the like to see where the single reader
> limitation in the rtmutex blows it.

Ooh, that's neat.  The next time I can cook up an excuse to run a
kernel that won't load this damn WiFi driver, I'll try it out.  Some
of the people I work with are real engineers and respect in-system
instrumentation.

> So far I've seen less than 10 percent of in-kernel contention events
> actually worth spinning on and the rest of the stats imply that the
> mutex owner in question is either preempted or blocked on something
> else.

That's a good thing; it implies that in-kernel algorithms don't take
locks needlessly as a matter of cargo-cult habit.  Attempting to take
a lock (other than an uncontended futex, which is practically free)
should almost always transfer control to the thread that has the power
to deliver the information (or the free slot) that you're looking for
-- or in the case of an external data source/sink, should send you
into low-power mode until time and tide give you something new to do.
Think of it as a just-in-time inventory system; if you keep too much
product in stock (or free warehouse space), you're wasting space and
harming responsiveness to a shift in demand.  Once in a while you have
to play Sokoban in order to service a request promptly; that's exactly
the case that priority inheritance is meant to help with.

The fiddly part, on a non-real-time-no-matter-what-the-label-says
system with an opaque cache architecture and mysterious hidden costs
of context switching, is to minimize the lossage resulting from brutal
timer- or priority-inheritance-driven preemption.  Given the way
people code these days -- OK, it was probably just as bad back in the
day -- the only thing worse than letting the axe fall at random is to
steal the CPU away the moment a contended lock is released, because
the next 20 lines of code probably poke one last time at all the data
structures the task had in cache right before entering the critical
section.  That doesn't hurt so bad on RTOS-friendly hardware -- an
MMU-less system with either zero or near-infinite cache -- but it's
got to make this year's Intel/AMD/whatever kotatsu stall left and
right when that task gets rescheduled.

> I've been trying to get folks to try this on a larger machine than my
> 2x AMD64 box so that I there is more data regarding Linux contention
> and overschedulling in -rt.

Dave Miller, maybe?  He seems to be one of the few people around here
with the skills and the institutional motivation to push for
scalability under the typical half-assed middle-tier application
workload.  Which, make no mistake, stands to benefit a lot more from a
properly engineered SCHED_OTHER scheduler than any "real-time" media
gadget that sells by the forklift-load at Fry's.  (NUMA boxen might
also benefit, but AFAICT their target applications are different
enough not to count.)  And if anyone from Cavium is lurking, now would
be a real good time to show up and shoulder some of the load.  Heck,
even Intel ought to pitch in -- the Yonah team may have saved your ass
for now, but you'll be playing the 32-thread throughput-per-erg stakes
soon enough.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  7:12                                                     ` Michael K. Edwards
@ 2007-04-20  8:21                                                       ` Bill Huey
  0 siblings, 0 replies; 712+ messages in thread
From: Bill Huey @ 2007-04-20  8:21 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar,
	Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Mike Galbraith, ck list, linux-kernel,
	Arjan van de Ven, Thomas Gleixner, Steven Rostedt,
	Bill Huey (hui)

On Fri, Apr 20, 2007 at 12:12:29AM -0700, Michael K. Edwards wrote:
> Actual fractional CPU reservation is a bit different, and is probably
> best handled with "container"-type infrastructure (not quite
> virtualization, but not quite scheduling classes either).  SGI
> pioneered this (in "open systems" space -- IBM probably had it first,
> as usual) with GRIO in XFS.  (That was I/O throughput reservation of

I'm very aware of this having grow up on those systems and see what 30k
USD of hardware can do for you with the right kernel facilties. It
would be a mind blower to get OpenGL and friends back to that level of
performance with regards to React/Pro's rt abilities, frame drop would
just be gone and we'd own gaming. No joke.

We have a number of former SGI XFS engineers here at NetApp and I should
ask them about the GRIO implementation.

> course, not "CPU bandwidth" -- but IIRC IRIX had CPU reservation too).
> There's a more general class of techniques in which it's worth
> spending idle cycles speculating along paths that might or might not
> be taken depending on unpredictable I/O; I'd be surprised if you
> couldn't approximate most of the sane balancing strategies in this
> area within the "economic dispatch" scheduler model.  (Good JIT

What is that ? never heard of it before.
 
> I don't know where the -rt patch enters in.  But if you need agile
> reprioritization with a deep runnable queue, either under direct
> application control or as a side effect of priority inheritance or a
> related OS-enforced protocol, then you need a kernel-level data
> structure with a fancier interface than the classic
> insert/find/delete-min priority queue.  From what I've read (this is
> not my area of expertise and I don't have Knuth handy), the relatively
> simple heap-based implementations of priority queues can't
> reprioritize an entry any more quickly than find+delete+insert, which
> pretty much rules them out as a basis for a scalable scheduler with
> priority inheritance (let alone PCP emulation).

The -rt patch has turnstile-esque infrastructure that's stack allocated.
Linux's lock hierarchy is relatively shallow (compensated with a heavy
use of per CPU method and RCU-ified algorithms in place of rwlocks) so
I've encountered nothing close to this that would demand such an overly
sophisticated mechanism. I'm aware of PCP and preemptions thresholds.
I created the lockstat infrastructure as a means of precisely measuring
contention in -rt in anticipation to experiment with these techniques.

I mention -rt because it's the most likely place to encounter what you're
talking about, not an app.
 
> >I have Solaris style adaptive locks in my tree with my lockstat patch
> >under -rt. I've also modified my lockstat patch to track readers

...

> Ooh, that's neat.  The next time I can cook up an excuse to run a
> kernel that won't load this damn WiFi driver, I'll try it out.  Some
> of the people I work with are real engineers and respect in-system
> instrumentation.

It's not publically released yet since I'm still stuck in .20-rc6 land
and the soft lock up detector triggers. I need to forward port it and
my lockstat changes to the most recent -rt patch.

I've been stalled on revision control problem that I'm trying to solve
with monotone for at least a month (of my own spare time).

> That's a good thing; it implies that in-kernel algorithms don't take
> locks needlessly as a matter of cargo-cult habit.  Attempting to take

The jury is still out on this until I can record what the rtmutex owner's
state is in. No further conclusion can be made until then. I think
this is very interesting pursuit/investigation.

> a lock (other than an uncontended futex, which is practically free)
> should almost always transfer control to the thread that has the power
> to deliver the information (or the free slot) that you're looking for
> -- or in the case of an external data source/sink, should send you
> into low-power mode until time and tide give you something new to do.

> Think of it as a just-in-time inventory system; if you keep too much
> product in stock (or free warehouse space), you're wasting space and
> harming responsiveness to a shift in demand.  Once in a while you have
> to play Sokoban in order to service a request promptly; that's exactly
> the case that priority inheritance is meant to help with.

What did you mean by this ? Victor Yodaiken's stuff ?

> The fiddly part, on a non-real-time-no-matter-what-the-label-says
> system with an opaque cache architecture and mysterious hidden costs
> of context switching, is to minimize the lossage resulting from brutal
> timer- or priority-inheritance-driven preemption.  Given the way
> people code these days -- OK, it was probably just as bad back in the
> day -- the only thing worse than letting the axe fall at random is to
> steal the CPU away the moment a contended lock is released, because

My adaptive spin stuff in front of an rtmutex is design to complement
Steve Rostedt's owner stealing code also in that path and prevent this
from happening. I record stealing events as well.

> the next 20 lines of code probably poke one last time at all the data
> structures the task had in cache right before entering the critical
> section.  That doesn't hurt so bad on RTOS-friendly hardware -- an
> MMU-less system with either zero or near-infinite cache -- but it's
> got to make this year's Intel/AMD/whatever kotatsu stall left and
> right when that task gets rescheduled.

Cache and tlb's are a bitch.

> >I've been trying to get folks to try this on a larger machine than my
> >2x AMD64 box so that I there is more data regarding Linux contention
> >and overschedulling in -rt.
> 
> Dave Miller, maybe?  He seems to be one of the few people around here

Don't know. I do know that somebody is going to try -rt on large hardware
because they go some crazy app that needs both tons of CPU and rt abilties,
like IRIX. I wouldn't mind the help and access to an 8x machine.

bill


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-18 21:08                       ` S.Çağlar Onur
  2007-04-18 21:12                         ` Ingo Molnar
@ 2007-04-20 19:31                         ` Bill Davidsen
  2007-04-21  8:36                           ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Bill Davidsen @ 2007-04-20 19:31 UTC (permalink / raw)
  To: caglar
  Cc: Ingo Molnar, Christoph Pfister, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper

S.Çag(lar Onur wrote:
> 18 Nis 2007 Çar tarihinde, Ingo Molnar s,unlar? yazm?s,t?: 
>> * S.Çag(lar Onur <caglar@pardus.org.tr> wrote:
>>> -       schedule();
>>> +       msleep(1);
>>>
>>> which Ingo sends me to try also has the same effect on me. I cannot
>>> reproduce hangs anymore with that patch applied top of CFS while one
>>> console checks out SVN repos and other one compiles a small test
>>> software.
>> great! Could you please unapply the hack above and try the proper fix
>> below, does this one solve the hangs too?
> 
> Instead of that one, i tried CFSv3 and i cannot reproduce the hang anymore, 
> Thanks!...
> 
And that explains why CFS-v3 on 21-rc7-git3 wouldn't show me the hang. 
As a matter of fact, nothing I did showed any bad behavior! Note that I 
was doing actual badly behaved things which do sometimes glitch the 
standard scheduler, not running benchmarks.

This scheduler is boring, everything works. I am going to try some tests 
on a uniprocessor, though, I have been running everything on either SMP 
or HT CPUs. But so far it looks fine.


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  3:43             ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
  2007-04-17  4:14             ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
@ 2007-04-20 20:36             ` Bill Davidsen
  2 siblings, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-20 20:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Mike Galbraith wrote:
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>  
>>> Yup, and progress _is_ happening now, quite rapidly.
>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>> decide when to replace the mainline scheduler or with what.
>>
>> I don't think we can say Ingo's is better than the alternatives, can we?
> 
> No, that would require massive performance testing of all alternatives.
> 
>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>> be involved, and mine, and Peter's...
> 
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers.  If they're close in numbers, do you go with
> the one which starts the least flamewars or what?
> 
Here we disagree... I picked a scheduler not by running benchmarks, but 
by running loads which piss me off with the mainline scheduler. And then 
I ran the other schedulers for a while to find the things, normal things 
I do, which resulted in bad behavior. And when I found one which had (so 
far) no such cases I called it my winner, but I haven't tested it under 
server load, so I can't begin to say it's "the best."

What we need is for lots of people to run every scheduler in real life, 
and do "worst case analysis" by finding the cases which cause bad 
behavior. And if there were a way to easily choose another scheduler, 
call it plugable, modular, or Russian Roulette, people who found a worst 
case would report it (aka bitch about it) and try another. But the 
average user is better able to boot with an option like "sched=cfs" (or 
sc, or nick, or ...) than to patch and build a kernel. So if we don't 
get easily switched schedulers people will not test nearly as well.

The best scheduler isn't the one 2% faster than the rest, it's the one 
with the fewest jackpot cases where it sucks. And if the mainline had 
multiple schedulers this testing would get done, authors would get more 
reports and have a better chance of fixing corner cases.

Note that we really need multiple schedulers to make people happy, 
because fairness is not the most desirable behavior on all machines, and 
adding knobs probably isn't the answer. I want a server to degrade 
gently, I want my desktop to show my movie and echo my typing, and if 
that's hard on compiles or the file transfer, so be it. Con doesn't want 
to compromise his goals, I agree but want to have an option if I don't 
share them.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-17  9:51               ` Ingo Molnar
  2007-04-17 13:44                 ` Peter Williams
@ 2007-04-20 20:47                 ` Bill Davidsen
  2007-04-21  7:39                   ` Nick Piggin
  2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 2 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-20 20:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

Ingo Molnar wrote:

> ( Lets be cautious though: the jury is still out whether people actually 
>   like this more than the current approach. While CFS feedback looks 
>   promising after a whopping 3 days of it being released [ ;-) ], the 
>   test coverage of all 'fairness centric' schedulers, even considering 
>   years of availability is less than 1% i'm afraid, and that < 1% was 
>   mostly self-selecting. )
> 
All of my testing has been on desktop machines, although in most cases 
they were really loaded desktops which had load avg 10..100 from time to 
time, and none were low memory machines. Up to CFS v3 I thought 
nicksched was my winner, now CFSv3 looks better, by not having stumbles 
under stupid loads.

I have not tested:
   1 - server loads, nntp, smtp, etc
   2 - low memory machines
   3 - uniprocessor systems

I think this should be done before drawing conclusions. Or if someone 
has tried this, perhaps they would report what they saw. People are 
talking about smoothness, but not how many pages per second come out of 
their overloaded web server.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-20 20:47                 ` Bill Davidsen
@ 2007-04-21  7:39                   ` Nick Piggin
  2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Nick Piggin @ 2007-04-21  7:39 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Ingo Molnar, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

On Fri, Apr 20, 2007 at 04:47:27PM -0400, Bill Davidsen wrote:
> Ingo Molnar wrote:
> 
> >( Lets be cautious though: the jury is still out whether people actually 
> >  like this more than the current approach. While CFS feedback looks 
> >  promising after a whopping 3 days of it being released [ ;-) ], the 
> >  test coverage of all 'fairness centric' schedulers, even considering 
> >  years of availability is less than 1% i'm afraid, and that < 1% was 
> >  mostly self-selecting. )
> >
> All of my testing has been on desktop machines, although in most cases 
> they were really loaded desktops which had load avg 10..100 from time to 
> time, and none were low memory machines. Up to CFS v3 I thought 
> nicksched was my winner, now CFSv3 looks better, by not having stumbles 
> under stupid loads.

What base_timeslice were you using for nicksched, and what HZ?


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-20 20:47                 ` Bill Davidsen
  2007-04-21  7:39                   ` Nick Piggin
@ 2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-21  8:33 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


* Bill Davidsen <davidsen@tmr.com> wrote:

> All of my testing has been on desktop machines, although in most cases 
> they were really loaded desktops which had load avg 10..100 from time 
> to time, and none were low memory machines. Up to CFS v3 I thought 
> nicksched was my winner, now CFSv3 looks better, by not having 
> stumbles under stupid loads.

nice! I hope CFSv4 kept that good tradition too ;)

> I have not tested:
>   1 - server loads, nntp, smtp, etc
>   2 - low memory machines
>   3 - uniprocessor systems
> 
> I think this should be done before drawing conclusions. Or if someone 
> has tried this, perhaps they would report what they saw. People are 
> talking about smoothness, but not how many pages per second come out 
> of their overloaded web server.

i tested heavily swapping systems. (make -j50 workloads easily trigger 
that) I also tested UP systems and a handful of SMP systems. I have also 
tested massive_intr.c which i believe is an indicator of how fairly CPU 
time is distributed between partly sleeping partly running server 
threads. But i very much agree that diverse feedback is sought and 
welcome, both from those who are happy with the current scheduler and 
those who are unhappy about it.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Kaffeine problem with CFS
  2007-04-20 19:31                         ` Bill Davidsen
@ 2007-04-21  8:36                           ` Ingo Molnar
  0 siblings, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-21  8:36 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: caglar, Christoph Pfister, linux-kernel, Michael Lothian,
	Christophe Thommeret, Jurgen Kofler, Ulrich Drepper


* Bill Davidsen <davidsen@tmr.com> wrote:

> > Instead of that one, i tried CFSv3 and i cannot reproduce the hang 
> > anymore, Thanks!...
>
> And that explains why CFS-v3 on 21-rc7-git3 wouldn't show me the hang. 
> As a matter of fact, nothing I did showed any bad behavior! Note that 
> I was doing actual badly behaved things which do sometimes glitch the 
> standard scheduler, not running benchmarks.
> 
> This scheduler is boring, everything works. [...]

hehe :) Having a 'boring' scheduler in the end is the main goal :)

> [...] I am going to try some tests on a uniprocessor, though, I have 
> been running everything on either SMP or HT CPUs. But so far it looks 
> fine.

yeah, please do - while i do test UP frequently, most of my CFS testing 
was on SMP.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 18:45                     ` Willy Tarreau
@ 2007-04-21 10:31                       ` Ingo Molnar
  2007-04-21 10:38                         ` Ingo Molnar
  2007-04-21 10:45                         ` Ingo Molnar
  0 siblings, 2 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:31 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Willy Tarreau <w@1wt.eu> wrote:

> I hacked it a bit to make it accept two parameters :
>   -R <run_time_in_microsecond> : time spent burning CPU cycles at each round
>   -S <sleep_time_in_microsecond> : time spent getting a rest
> 
> It now advances what it thinks is a second at each iteration, so that 
> it makes it easy to compare its progress with other instances (there 
> are seconds, minutes and hours, so it's easy to visually count up to 
> around 43200).
> 
> The modified code is here :
> 
>   http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz
> 
> What is interesting to note is that it's easy to make X work a lot 
> (99%) by using 0 as the sleeping time, and it's easy to make the 
> process work a lot by using large values for the running time 
> associated with very low values (or 0) for the sleep time.
> 
> Ah, and it supports -geometry ;-)
> 
> It could become a useful scheduler benchmark !

i just tried ocbench-0.3, and it is indeed very nice!

Would it make sense perhaps to (optionally?) also log some sort of 
periodic text feedback to stdout, about the quality of scheduling? Maybe 
even a 'run this many seconds' option plus a summary text output at the 
end (which would output measured runtime, observed longest/smallest 
latency and standard deviation of latencies maybe)? That would make it 
directly usable both as a 'consistency of X app scheduling' visual test 
and as an easily shareable benchmark with an objective numeric result as 
well.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:31                       ` Ingo Molnar
@ 2007-04-21 10:38                         ` Ingo Molnar
  2007-04-21 10:45                         ` Ingo Molnar
  1 sibling, 0 replies; 712+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:38 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > The modified code is here :
> > 
> >   http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz
> > 
> > What is interesting to note is that it's easy to make X work a lot 
> > (99%) by using 0 as the sleeping time, and it's easy to make the 
> > process work a lot by using large values for the running time 
> > associated with very low values (or 0) for the sleep time.
> > 
> > Ah, and it supports -geometry ;-)
> > 
> > It could become a useful scheduler benchmark !
> 
> i just tried ocbench-0.3, and it is indeed very nice!

another thing i just noticed: when starting up lots of ocbench tasks 
(say -x 6 -y 6) then they (naturally) get started up with an already 
visible offset. It's nice to observe the startup behavior, but after 
that it would be useful if it were possible to 'resync' all those 
ocbench tasks so that they start at the same offset. [ Maybe a "killall 
-SIGUSR1 ocbench" could serve this purpose, without having to 
synchronize the tasks explicitly? ]

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:31                       ` Ingo Molnar
  2007-04-21 10:38                         ` Ingo Molnar
@ 2007-04-21 10:45                         ` Ingo Molnar
  2007-04-21 11:07                           ` Willy Tarreau
  1 sibling, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:45 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > It could become a useful scheduler benchmark !
> 
> i just tried ocbench-0.3, and it is indeed very nice!

another thing i noticed: when using a -y larger then 1, then the window 
title (at least on Metacity) overlaps and thus the ocbench tasks have 
different X overhead and get scheduled a bit assymetrically as well. Is 
there any way to start them up title-less perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:45                         ` Ingo Molnar
@ 2007-04-21 11:07                           ` Willy Tarreau
  2007-04-21 11:29                             ` Björn Steinbrink
  0 siblings, 1 reply; 712+ messages in thread
From: Willy Tarreau @ 2007-04-21 11:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi Ingo,

I'm replying to your 3 mails at once.

On Sat, Apr 21, 2007 at 12:45:22PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > It could become a useful scheduler benchmark !
> > 
> > i just tried ocbench-0.3, and it is indeed very nice!

So as you've noticed just one minute after I put it there, I've updated
the tool and renamed it ocbench. For others, it's here :

    http://linux.1wt.eu/sched/

Useful news are proper positionning, automatic forking, and more visible
progress with smaller windows, which eat less of X ressources.

Now about your idea of making it report information on stdout, I don't
know if it would be that useful. There are many other command line tools
for this purpose. This one's goal is to eat CPU with a visual control of
CPU distribution only.

Concerning your idea of using a signal to resync every process, I agree
with you. Running at 8x8 shows a noticeable offset. I've just uploaded
v0.4 which supports your idea of sending USR1.

> another thing i noticed: when using a -y larger then 1, then the window 
> title (at least on Metacity) overlaps and thus the ocbench tasks have 
> different X overhead and get scheduled a bit assymetrically as well. Is 
> there any way to start them up title-less perhaps?

It has annoyed me a bit too, but I'm no X developer at all, so I don't
know at all if it's possible nor how to do this. I know that my window
manager even adds title bars to xeyes, so I'm not sure we can do this.

Right now, I've added a "-B <border size>" argument so that you can
skip the size of your title bar. It's dirty but it's not my main job :-)

Thanks for your feedback
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 11:07                           ` Willy Tarreau
@ 2007-04-21 11:29                             ` Björn Steinbrink
  2007-04-21 11:51                               ` Willy Tarreau
  0 siblings, 1 reply; 712+ messages in thread
From: Björn Steinbrink @ 2007-04-21 11:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Ingo Molnar, linux-kernel

Hi,

On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote:
> > another thing i noticed: when using a -y larger then 1, then the window 
> > title (at least on Metacity) overlaps and thus the ocbench tasks have 
> > different X overhead and get scheduled a bit assymetrically as well. Is 
> > there any way to start them up title-less perhaps?
> 
> It has annoyed me a bit too, but I'm no X developer at all, so I don't
> know at all if it's possible nor how to do this. I know that my window
> manager even adds title bars to xeyes, so I'm not sure we can do this.
> 
> Right now, I've added a "-B <border size>" argument so that you can
> skip the size of your title bar. It's dirty but it's not my main job :-)

Here's a small patch that makes the windows unmanaged, which also causes
ocbench to start up quite a bit faster on my box with larger number of
windows, so it probably avoids some window manager overhead, which is a
nice side-effect.

Björn

--

diff -u ocbench-0.4/ocbench.c ocbench-0.4.1/ocbench.c
--- ocbench-0.4/ocbench.c	2007-04-21 13:05:55.000000000 +0200
+++ ocbench-0.4.1/ocbench.c	2007-04-21 13:24:01.000000000 +0200
@@ -213,6 +213,7 @@
 int main(int argc, char *argv[]) {
   Window root;
   XGCValues gc_setup;
+  XSetWindowAttributes swa;
   int c, index, proc_x, proc_y, pid;
   int *pcount[] = {&HOUR, &MIN, &SEC};
   char *p, *q;
@@ -342,8 +343,11 @@
   alloc_color(fg, &orange);
   alloc_color(fg2, &blue);
 
-  win = XCreateSimpleWindow(dpy, root, X, Y, width, height, 0, 
-  			    black.pixel, black.pixel);
+  swa.override_redirect = 1;
+
+  win = XCreateWindow(dpy, root, X, Y, width, height, 0,
+  			    CopyFromParent, InputOutput, CopyFromParent,
+  			    CWOverrideRedirect, &swa);
   XStoreName(dpy, win, "ocbench");
 
   XSelectInput(dpy, win, ExposureMask | StructureNotifyMask);
Only in ocbench-0.4.1/: .README.swp

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 11:29                             ` Björn Steinbrink
@ 2007-04-21 11:51                               ` Willy Tarreau
  0 siblings, 0 replies; 712+ messages in thread
From: Willy Tarreau @ 2007-04-21 11:51 UTC (permalink / raw)
  To: Björn Steinbrink, Ingo Molnar, linux-kernel

Hi Björn,

On Sat, Apr 21, 2007 at 01:29:41PM +0200, Björn Steinbrink wrote:
> Hi,
> 
> On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote:
> > > another thing i noticed: when using a -y larger then 1, then the window 
> > > title (at least on Metacity) overlaps and thus the ocbench tasks have 
> > > different X overhead and get scheduled a bit assymetrically as well. Is 
> > > there any way to start them up title-less perhaps?
> > 
> > It has annoyed me a bit too, but I'm no X developer at all, so I don't
> > know at all if it's possible nor how to do this. I know that my window
> > manager even adds title bars to xeyes, so I'm not sure we can do this.
> > 
> > Right now, I've added a "-B <border size>" argument so that you can
> > skip the size of your title bar. It's dirty but it's not my main job :-)
> 
> Here's a small patch that makes the windows unmanaged, which also causes
> ocbench to start up quite a bit faster on my box with larger number of
> windows, so it probably avoids some window manager overhead, which is a
> nice side-effect.

Excellent ! I've just merged it but conditionned it to a "-u" argument
so that we can keep previous behaviour (moving the windows is useful
especially when there are few of them).

So the new version 0.5 is available there :

  http://linux.1wt.eu/sched/

I believe it's the last one for today as I'm late on some work.

Thanks !
Willy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18  7:24                                     ` Nick Piggin
@ 2007-04-21 13:33                                     ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-21 13:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:

>>> [2] It's trivial to construct two or more perfectly reasonable and
>>> desirable definitions of fairness that are mutually incompatible.
>> Probably not if you use common sense, and in the context of a replacement
>> for the 2.6 scheduler.
> 
> Ok, trivial example. You cannot allocate equal CPU time to
> processes/tasks and simultaneously allocate equal time to thread
> groups. Is it common sense that a heavily-threaded app should be able
> to get hugely more CPU than a well-written app? No. I don't want Joe's
> stupid Java app to make my compile crawl.
> 
> On the other hand, if my heavily threaded app is, say, a voicemail
> server serving 30 customers, I probably want it to get 30x the CPU of
> my gzip job.
> 
Matt, you tickled a thought... on one hand we have a single user running 
a threaded application, and it ideally should get the same total CPU as 
a user running a single thread process. On the other hand we have a 
threaded application, call it sendmail, nnrpd, httpd, bind, whatever. In 
that case each thread is really providing service for an independent 
user, and should get an appropriate share of the CPU.

Perhaps the solution is to add a means for identifying server processes, 
by capability, or by membership in a "server" group, or by having the 
initiating process set some flag at exec() time. That doesn't 
necessarily solve problems, but it may provide more information to allow 
them to be soluble.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-19  3:18                                   ` Nick Piggin
@ 2007-04-21 13:40                                   ` Bill Davidsen
  2 siblings, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-21 13:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

Linus Torvalds wrote:
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
>> Why is X special? Because it does work on behalf of other processes?
>> Lots of things do this. Perhaps a scheduler should focus entirely on
>> the implicit and directed wakeup matrix and optimizing that
>> instead[1].
> 
> I 100% agree - the perfect scheduler would indeed take into account where 
> the wakeups come from, and try to "weigh" processes that help other 
> processes make progress more. That would naturally give server processes 
> more CPU power, because they help others
> 
> I don't believe for a second that "fairness" means "give everybody the 
> same amount of CPU". That's a totally illogical measure of fairness. All 
> processes are _not_ created equal.
> 
> That said, even trying to do "fairness by effective user ID" would 
> probably already do a lot. In a desktop environment, X would get as much 
> CPU time as the user processes, simply because it's in a different 
> protection domain (and that's really what "effective user ID" means: it's 
> not about "users", it's really about "protection domains").
> 
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.
> 
You probably want to consider the controlling terminal as well...  do 
you want to have people starting 'at' jobs competing on equal footing 
with people typing at a terminal? I'm not offering an answer, just 
raising the question.

And for some database applications, everyone in a group may connect with 
the same login-id, then do sub authorization to the database 
application. euid may be an issue there as well.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 15:43                                                     ` Davide Libenzi
@ 2007-04-21 14:09                                                     ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-21 14:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Davide Libenzi <davidel@xmailserver.org> wrote:

>> The same user nicing two different multi-threaded processes would 
>> expect a predictable CPU distribution too. [...]
> 
> i disagree that the user 'would expect' this. Some users might. Others 
> would say: 'my 10-thread rendering engine is more important than a 
> 1-thread job because it's using 10 threads for a reason'. And the CFS 
> feedback so far strengthens this point: the default behavior of treating 
> the thread as a single scheduling (and CPU time accounting) unit works 
> pretty well on the desktop.
> 
If by desktop you mean "one and only one interactive user," that's true. 
On a shared machine it's very hard to preserve any semblance of fairness 
when one user gets far more than another, based not on the value of what 
they're doing but the tools they use to to it.

> think about it in another, 'kernel policy' way as well: we'd like to 
> _encourage_ more parallel user applications. Hurting them by accounting 
> all threads together sends the exact opposite message.
> 
Why is that? There are lots of things which are intrinsically single 
threaded, how are we hurting hurting multi-threaded applications by 
refusing to give them more CPU than an application running on behalf of 
another user? By accounting all threads together we encourage writing an 
application in the most logical way. Threads are a solution, not a goal 
in themselves.

>> [...] Doing that efficently (the old per-cpu run-queue is pretty nice 
>> from many POVs) is the real challenge.
> 
> yeah.
> 
> 	Ingo


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  3:57                                             ` Nick Piggin
@ 2007-04-21 14:55                                               ` Mark Lord
  2007-04-22 12:54                                                 ` Mark Lord
  0 siblings, 1 reply; 712+ messages in thread
From: Mark Lord @ 2007-04-21 14:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Thu, Apr 19, 2007 at 09:17:25AM -0400, Mark Lord wrote:
>> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity
>> on my 2GHz P-M single-core non-HT machine with SD.
> 
> Is this with or without X reniced?

That was with no manual jiggling, everything the same as with stock kernels,
except that stock kernels don't kill interactivity here.

>> But with the very first posted version of CFS by Ingo,
>> I can do "make -j2" no problem and still have a nicely interactive destop.
> 
> How well does cfs run if you have the granularity set to something
> like 30ms (30000000)?

Dunno, I've put this stuff aside for now until things settle down.
With four schedulers, and lots of patches / revisions / tuning-knobs,
there's just no way to keep up with it all here.

Cheers

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-21 14:55                                               ` Mark Lord
@ 2007-04-22 12:54                                                 ` Mark Lord
  2007-04-22 12:58                                                   ` Con Kolivas
  0 siblings, 1 reply; 712+ messages in thread
From: Mark Lord @ 2007-04-22 12:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Just to throw another possibly-overlooked variable into the mess:

My system here is using the on-demand cpufreq policy governor.
I wonder how that interacts with the various schedulers here?

I suppose for the "make" kernel case, after a couple of seconds
the cpufreq would hit max and stay there for the rest of the build,
so it shouldn't really be a factor for (non-)interactivity during the build.

Or should it?

Cheers

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-22 12:54                                                 ` Mark Lord
@ 2007-04-22 12:58                                                   ` Con Kolivas
  0 siblings, 0 replies; 712+ messages in thread
From: Con Kolivas @ 2007-04-22 12:58 UTC (permalink / raw)
  To: Mark Lord
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Sunday 22 April 2007 22:54, Mark Lord wrote:
> Just to throw another possibly-overlooked variable into the mess:
>
> My system here is using the on-demand cpufreq policy governor.
> I wonder how that interacts with the various schedulers here?
>
> I suppose for the "make" kernel case, after a couple of seconds
> the cpufreq would hit max and stay there for the rest of the build,
> so it shouldn't really be a factor for (non-)interactivity during the
> build.
>
> Or should it?

Short answer: shouldn't matter :)

-- 
-ck

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-20  4:09                                             ` Nick Piggin
@ 2007-04-24 15:50                                               ` Ray Lee
  2007-04-24 16:23                                                 ` Matt Mackall
  0 siblings, 1 reply; 712+ messages in thread
From: Ray Lee @ 2007-04-24 15:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: ray-gmail, Con Kolivas, Ingo Molnar, Andrew Morton,
	Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Thu, Apr 19, 2007 at 12:26:03PM -0700, Ray Lee wrote:
>> On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote:
>>> The one fly in the ointment for
>>> linux remains X. I am still, to this moment, completely and utterly stunned
>>> at why everyone is trying to find increasingly complex unique ways to 
>>> manage
>>> X when all it needs is more cpu[1].
>> [...and hence should be reniced]
>>
>> The problem is that X is not unique. There's postgresql, memcached,
>> mysql, db2, a little embedded app I wrote... all of these perform work
>> on behalf of another process. It's just most *noticeable* with X, as
>> pretty much everyone is running that.
> 
> But for most of those apps, we don't actually care if they do fairly
> degrade in performance as other loads on the system ramp up.

(Who's this 'we' kemosabe? I do. Desktop systems are increasingly using
databases for their day-to-day tasks. As they should, a db is not
something that should be reinvented poorly.)

> However
> the user prefers X to be given priority in these situations. Whether
> that is the design of X, x clients, or the human condition really
> doesn't matter two hoots to the scheduler.

Hmm, let's try this again. Anything that communicates out of process
as part of its normal usage for Getting Work Done gets impacted by the
scheduler. That means pipelines in the shell, d-bus on the desktop, and
lots of other things that follow the unix philosophy of lots of little
programs communicating.

>> If we had some way for the scheduler to decide to donate part of a
>> client process's time slice to the server it just spoke to (with an
>> exponential dampening factor -- take 50% from the client, give 25% to
>> the server, toss the rest on the floor), that -- from my naive point
>> of view -- would be a step toward fixing the underlying issue. Or I
>> might be spouting crap, who knows.
> 
> Firstly, lots of clients in your list are remote. X usually isn't.

They really aren't, unless you happen to work somewhere that can afford
to dedicate a box to a db, which suddenly makes the scheduler a dull
topic.

For example, I have a db and web server installed on my laptop, so
that the few times that I have to do web app programming (while wearing
a mustache and glasses so that I don't have to admit to it in polite
company), I can be functional with just one computer.

> However for X, a syscall or something to donate time might not be
> such a bad idea...

We have one already, it's called write(). We have another called
read(), too. Okay, so they have some data related side-effects other
than the scheduler hints, but I claim the scheduler hint is already
implicitly there.

> but given a couple of X clients and a server
> against a parallel make, this is probably just going to make the
> clients slow down as well without giving enough priority to the
> server.

Do you have data, or at least a theory to back up that hypothesis?

> X isn't special so much because it does work on behalf of others
> (as you said, lots of things do that). It is special simply because
> we _want_ rendering to have priority of the CPU

Really not. I'm trying to get across that this is a general problem
with interprocess communication, or any systems that rely on multiple
processes to make forward progress on a problem. Sure, let the clients
make forward progress until they can't any more. If they stop making
forward progress by blocking on a read or sleeping after a write to
another process, then there's a big hint there as to who should get
focus next.

> (if you shifed CPU
> intensive rendering to the clients, you'd most likely want to give
> them priority to); nice, right?

They'd have it automatically, if they were spending their time computing
rather than rendering.

Ray

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Renice X for cpu schedulers
  2007-04-24 15:50                                               ` Ray Lee
@ 2007-04-24 16:23                                                 ` Matt Mackall
  0 siblings, 0 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-24 16:23 UTC (permalink / raw)
  To: Ray Lee
  Cc: Nick Piggin, ray-gmail, Con Kolivas, Ingo Molnar, Andrew Morton,
	Linus Torvalds, William Lee Irwin III, Peter Williams,
	Mike Galbraith, ck list, Bill Huey, linux-kernel,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 24, 2007 at 08:50:20AM -0700, Ray Lee wrote:
> > Firstly, lots of clients in your list are remote. X usually isn't.
> 
> They really aren't, unless you happen to work somewhere that can afford
> to dedicate a box to a db, which suddenly makes the scheduler a dull
> topic.
> 
> For example, I have a db and web server installed on my laptop, so
> that the few times that I have to do web app programming (while wearing
> a mustache and glasses so that I don't have to admit to it in polite
> company), I can be functional with just one computer.

Indeed. The vast majority of people doing "LAMP" web services are
doing it on a single machine. Or VM for that matter.

It seems that this is a lot like the priority inheritance problem. If
a nice -19 process blocks on the db running at nice 0, the db ought to
get a boost until it wakes the original process up. The same should
apply at the level of dynamic priorities at the same nice level.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-19  7:04               ` Ingo Molnar
  2007-04-19  9:05                 ` Nigel Cunningham
@ 2007-04-24 20:23                 ` Pavel Machek
  2007-04-24 20:41                   ` Linus Torvalds
  1 sibling, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-24 20:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Linus Torvalds, Thomas Gleixner, Arjan van de Ven

Hi!

> > From subsequent emails, I think you already got your answer, but just 
> > in case...
> > 
> > Yes, if you enabled "Replace swsusp by default" and you already had it 
> > set up for getting swsusp to resume. If not, and you're using an 
> > initrd/ramfs, you'll need to modify it to echo
> > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but
> > prior to mounting / and so on.
> 
> yeah, went with the default suggested by your patch:
> 
>    CONFIG_SUSPEND2_REPLACE_SWSUSP=y
> 
> and it was pretty easy to set things up. I used "echo disk > 
> /sys/power/state" to trigger it.
> 
> In hindsight it was all pretty straightforward and suspend2 worked 
> beautifully on an UP and on an SMP system i tried. So in exchange for 
> suspend2 folks debugging a bug in CFS here's some suspend2 review 
> feedback ;) Any plans about moving suspend2 to the upstream kernel? It 
> should be pretty easy for it to co-exist with the current swsuspend 
> code.

Well, current uswsusp code can do most of stuff suspend2 can do, with
20% (or so) of kernel code. 

"Major feature" that is missing is ability to save 100% of memory if
it is all the pagecache. I think that is not that important; we have
200 line patch to do that, but noone was able to verify it is correct.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 20:23                 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
@ 2007-04-24 20:41                   ` Linus Torvalds
  2007-04-24 20:51                     ` Hua Zhong
                                       ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-24 20:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven



On Tue, 24 Apr 2007, Pavel Machek wrote:
> 
> Well, current uswsusp code can do most of stuff suspend2 can do, with
> 20% (or so) of kernel code. 

Btw, this is a totally inane argument.

If the code just moved somewhere else, it's not "less code".

You compare complete subsystems against complete subsystems, OR YOU DON'T 
COMPARE THEM AT ALL!

This whole notion that "kernel lines of code" is somehow different is a 
stupid and idiotic _disease_ that is spread by microkernel people and 
people who have been brainwashed by them.

Code is code, and sometimes it's better in the kernel, sometimes it's 
better in user space, and you cannot say "we only have 10 lines of kernel 
code", if that is then combined with a million lines of user space code 
that actually is the only reason for the 10 lines of code in the first 
place.

Separation of code often makes things *harder* to understand and debug. A 
few prime examples of this f*cking idiotic stupid disease of discounting 
user level code because it somehow "doesn't matter" is:

 - the old 16-bit pcmcia layer that did all the "policy" in user space, 
   and only the "device access" in kernel space, and as a result _neither_ 
   actually knew what the hell they were actually doing, and debugging was 
   a nightmare.

   We've become a *lot* better off with a device layer that actually knows 
   and understands what it is doing, and having the code in one place, 
   rather than having two broken pieces.

   And we became better exactly by doing *more* in the kernel, and havign 
   a *higher* level of abstraction. This is a BIG ISSUE. Abstraction is 
   good, but abstraction is good only if it is at a high enough level to 
   make sense and matter, and give the abstraction level a choice in how 
   to implement the lower layers.

 - the old module loader was also split into user/kernel space, and yes, 
   we made the kernel part "larger" by moving some parts into the kernel, 
   but in doing so, we actually made the *combined* code smaller, and a 
   hell of a lot more maintainable.

   It also automatically (again, because of a higher level of abstraction) 
   meant that the new module loader infrastructure was not only more 
   maintainable, but could actually *do* more. Suddenly you can do things
   like check for cryptographic signatures etc, because you know what 
   you're actually doing, as opposed to getting a ready-made "binary blob" 
   that you don't know anything about, that has been pre-linked etc.

So stop blathering about "less kernel code". That's the *least* of any 
worries. The only thing that matters is the end result, and trying to say 
that magically only one part counts is just dishonest and stupid.

In general, the kernel should be self-sufficient and *understand* what it 
is doing. If the kernel cannot understand the bigger picture, nobody can 
ever maintain the kernel, because the kernel is just a broken piece 
bobbing around in a mindless churning sea during a thunderstorm. You 
cannot have purpose, and you cannot improve yourself if you don't actyally 
understand your lot in life. That's as true of kernels as it is of people.

User-space should set high-level policy, but if the kernel doesn't know 
what it's all about, the kernel can never do anything smarter and can 
never *fix* itself. That was the case in both PCMCIA and in kernel module 
loaders.

I have not a frigging clue whether that is the case in suspend2 vs 
uswsusp, but I object to this idiotic argument of counting "kernel code". 
That's simply not a valid argument. It never was.

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* RE: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 20:41                   ` Linus Torvalds
@ 2007-04-24 20:51                     ` Hua Zhong
  2007-04-24 20:54                     ` Ingo Molnar
  2007-04-24 21:24                     ` Pavel Machek
  2 siblings, 0 replies; 712+ messages in thread
From: Hua Zhong @ 2007-04-24 20:51 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Pavel Machek'
  Cc: 'Ingo Molnar', 'Nigel Cunningham',
	'Christian Hesse', 'Nick Piggin',
	'Mike Galbraith', linux-kernel, 'Con Kolivas',
	suspend2-devel, 'Andrew Morton',
	'Thomas Gleixner', 'Arjan van de Ven'

> This whole notion that "kernel lines of code" is somehow different is a
> stupid and idiotic _disease_ that is spread by microkernel people and
> people who have been brainwashed by them.

I think a lot of people are tired of this argument, but I am glad you speak
up (as you did last year wrt s2ram).

> The only thing that matters is the end result

Amen to that. The end result is not just code size, but quality and whether
it actually *works reliably*.

Cheers,

Hua


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 20:41                   ` Linus Torvalds
  2007-04-24 20:51                     ` Hua Zhong
@ 2007-04-24 20:54                     ` Ingo Molnar
  2007-04-24 21:29                       ` Pavel Machek
  2007-04-24 21:24                     ` Pavel Machek
  2 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-24 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I have not a frigging clue whether that is the case in suspend2 vs 
> uswsusp, but I object to this idiotic argument of counting "kernel 
> code". That's simply not a valid argument. It never was.

the raw linecount appears to be the following:

 suspend2-2.2.9.12-for-2.6.21-rc6.patch

   89 files changed, 16452 insertions(+), 69 deletions(-)

 $ suspend-0.5> countlines
 32260

so, while it's probably apples to oranges, uswsusp seems to be larger, 
while there's at least one feature that it is missing.

also, from the structure of the suspend2 patch it seemed to me that they 
could peacefully coexist in the kernel without stepping on each other's 
toes - why not do that? Users will then pick the winner.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 20:41                   ` Linus Torvalds
  2007-04-24 20:51                     ` Hua Zhong
  2007-04-24 20:54                     ` Ingo Molnar
@ 2007-04-24 21:24                     ` Pavel Machek
  2007-04-24 23:41                       ` Linus Torvalds
  2007-04-26 10:17                       ` Johannes Berg
  2 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-24 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > Well, current uswsusp code can do most of stuff suspend2 can do, with
> > 20% (or so) of kernel code. 
> 
> Btw, this is a totally inane argument.
> 
> If the code just moved somewhere else, it's not "less code".

It is not "just moved". It is in userspace, where we can use liblzf /
gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about
7000 LoC of userland code (that is not libraries).

> You compare complete subsystems against complete subsystems, OR YOU DON'T 
> COMPARE THEM AT ALL!

Ok, I do not know how big suspend2 user code is, but kernel uswsusp
(4 kLoC) + userland support (7 kLoC) is still smaller than suspend2
kernel code (+ ? kLoC suspend2 userland support).

> This whole notion that "kernel lines of code" is somehow different is a 
> stupid and idiotic _disease_ that is spread by microkernel people and 
> people who have been brainwashed by them.

Yep, sorry about that.

> Separation of code often makes things *harder* to understand and debug. A 
> few prime examples of this f*cking idiotic stupid disease of discounting 
> user level code because it somehow "doesn't matter" is:

I believe uswsusp user/kernel separation is clean enough. Kernel
provides "snapshot image" and "resume image". (Thanks go to Rafael for
very clean interface).

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 20:54                     ` Ingo Molnar
@ 2007-04-24 21:29                       ` Pavel Machek
  2007-04-24 22:24                         ` Ray Lee
  2007-04-25 21:41                         ` Matt Mackall
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-24 21:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > I have not a frigging clue whether that is the case in suspend2 vs 
> > uswsusp, but I object to this idiotic argument of counting "kernel 
> > code". That's simply not a valid argument. It never was.
> 
> the raw linecount appears to be the following:
> 
>  suspend2-2.2.9.12-for-2.6.21-rc6.patch
> 
>    89 files changed, 16452 insertions(+), 69 deletions(-)
> 
>  $ suspend-0.5> countlines
>  32260

(I'm getting very different numbers here. Userland part:)

pavel@amd:~/sf/suspend$ wc -l *.c
   125 bootsplash.c
    12 breakit.c
   119 config.c
   136 delme.c
   207 dmidecode.c
    92 encrypt.c
   222 keygen.c
   447 md5.c
   286 radeontool.c
   870 resume.c
   434 s2ram.c
    78 splash.c
    73 splashy_funcs.c
  1481 suspend.c
   117 swap-offset.c
    11 vfork_test.c
   123 vt.c
   413 whitelist.c
   136 whitelist2.c
  5382 total
pavel@amd:~/sf/suspend$ wc -l *.h
    23 bootsplash.h
    26 config.h
    62 encrypt.h
   106 md5.h
  1764 radeon_reg.h
    20 s2ram.h
    26 splash.h
    25 splashy_funcs.h
   217 swsusp.h
    10 vt.h
  2279 total
pavel@amd:~/sf/suspend$

> so, while it's probably apples to oranges, uswsusp seems to be larger, 
> while there's at least one feature that it is missing.

(We are talking "save 100% memory" here).

As I said, that one feature is doable in uswsusp, too. It is 200
lines. It also makes mm <-> swsusp interaction _way_ more complex, and
noone was able to review it. It will corrupt memory if we got it
wrong.

(Suspend2 has the same problem. It includes that same feature, and
noone is able to review it. It has few more problems.).

> also, from the structure of the suspend2 patch it seemed to me that they 
> could peacefully coexist in the kernel without stepping on each other's 
> toes - why not do that? Users will then pick the winner.

We do not want to fragment the testing base, and suspend2 does not
really have any interesting features over uswsusp.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 21:29                       ` Pavel Machek
@ 2007-04-24 22:24                         ` Ray Lee
  2007-04-25 21:41                         ` Matt Mackall
  1 sibling, 0 replies; 712+ messages in thread
From: Ray Lee @ 2007-04-24 22:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On 4/24/07, Pavel Machek <pavel@ucw.cz> wrote:
> > so, while it's probably apples to oranges, uswsusp seems to be larger,
> > while there's at least one feature that it is missing.
>
> (We are talking "save 100% memory" here).
>
> As I said, that one feature is doable in uswsusp, too. It is 200
> lines. It also makes mm <-> swsusp interaction _way_ more complex, and

Sounds like the perfect reason to put that in kernel space.

Ray

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 21:24                     ` Pavel Machek
@ 2007-04-24 23:41                       ` Linus Torvalds
  2007-04-25  1:06                         ` Olivier Galibert
                                           ` (2 more replies)
  2007-04-26 10:17                       ` Johannes Berg
  1 sibling, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-24 23:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven



On Tue, 24 Apr 2007, Pavel Machek wrote:
> > 
> > If the code just moved somewhere else, it's not "less code".
> 
> It is not "just moved". It is in userspace, where we can use liblzf /
> gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about
> 7000 LoC of userland code (that is not libraries).

If it's in user land, we also have 

 - communication difficulties between two parts, and all the *crap* that 
   tends to entail (ie legacy interfaces forever, and upgrading one 
   without the other etc)

 - people who work on the kernel part are working "blind" (ie they are at 
   the mercy of whatever userland does, and it's not a "contained" 
   subsystem). This just ends up becoming worse when you then interact 
   with ten different versions of the user-land stuff, thanks to small 
   tweaks by five different vendors, and a hundred random people.

And don't tell me that doesn't happen. Maybe it doesn't happen _now_, 
because people who use it all get the patches from one place, but the 
moment we start talking about integration into the standard kernel, that 
means that the kernel needs to work regardless of whether somebody uses 
SuSE, RH, Fedora, Ubuntu or cooked his own distro entirely using some 
development version of the suspend user-space tools.

This is why I don't believe in the whole kernel-line-counting thing. I'm 
personally 100% convinced that it's better to have ten times as many lines 
in the kernel, if it means that you can just forget about version skew and 
bad user-space interfaces etc.

So if you want to enumerate "good" points, you'd damn well also face the 
_problems_.

This is why there's a lot to be said for

	echo mem > /sys/power/state

and being able to follow the path through _one_ object (the kernel) over 
trying to figure out the interaction between many different parts with 
different versions.

> I believe uswsusp user/kernel separation is clean enough. Kernel
> provides "snapshot image" and "resume image". (Thanks go to Rafael for
> very clean interface).

Now, *that* is the kind of argument that matters.

Quite frankly, if you want to convince me, it's not by "lines of kernel 
code", but by talking about easy-to-understand interfaces that actuually 
do one thing and do it well (and by "one thing", I mean "one _whole_ 
thing"). Because I care a lot less about lines of code than about 
"maintainable interfaces that people can think about and debug".

I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
whole thing. I think they've _all_ caused problems for the "true" suspend 
(suspend-to-ram), and the last thing I want to see is three or four 
different suspend-to-disk implementations.  So unlike Ingo, I don't think 
"let's just integrate them all side-by-side and maintain them and look who 
wins" is really a good idea.

How many different magic ioctl's does the thing introduce? Is it really 
just *two* entry-points (and how simple are they, interface-wise), and 
nothing else?

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 23:41                       ` Linus Torvalds
@ 2007-04-25  1:06                         ` Olivier Galibert
  2007-04-25  6:41                         ` Ingo Molnar
  2007-04-25  7:23                         ` Pavel Machek
  2 siblings, 0 replies; 712+ messages in thread
From: Olivier Galibert @ 2007-04-25  1:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Tue, Apr 24, 2007 at 04:41:58PM -0700, Linus Torvalds wrote:
> How many different magic ioctl's does the thing introduce? Is it really 
> just *two* entry-points (and how simple are they, interface-wise), and 
> nothing else?

Aren't you a little late to the party here?  The userland version is
the one that currently is in the kernel, after all the people who said
"doing it in userland is not necessarily a good idea" got happily
ignored.  Suspend2 which is the continuity of the fully-in-kernel one
is the one that has been constantly rejected by Pavel, lately by
saying "it should be done in userspace", and hence never merged.

Incidentally, it's 13 ioctls, and it's documented in
Documentation/power/userland-swsusp.txt in a hard drive near you.  I
especially like the "get the available swap space in bytes" one that
can only handle 32 bits.

  OG.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 23:41                       ` Linus Torvalds
  2007-04-25  1:06                         ` Olivier Galibert
@ 2007-04-25  6:41                         ` Ingo Molnar
  2007-04-25  7:29                           ` Pavel Machek
  2007-04-25  7:23                         ` Pavel Machek
  2 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-25  6:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I absolutely detest all suspend-to-disk crap. Quite frankly, I hate 
> the whole thing. I think they've _all_ caused problems for the "true" 
> suspend (suspend-to-ram), and the last thing I want to see is three or 
> four different suspend-to-disk implementations.  So unlike Ingo, I 
> don't think "let's just integrate them all side-by-side and maintain 
> them and look who wins" is really a good idea.
>
> How many different magic ioctl's does the thing introduce? Is it 
> really just *two* entry-points (and how simple are they, 
> interface-wise), and nothing else?

userspace-driven-suspend is already in the kernel, today. So it's not 
really "two versions side by side doing the same thing", but more of:

           A B C + D E F G H

where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 
(uswsusp of course redoes 'DEFGH' in user-space its own way, and there 
is the inevitable "+" glue code as well, but it's at least not two 
parallel versions of the same thing in the kernel, which would be 
gross.)

My original mail was about the following thing: i tried the suspend2 
patch (which just makes "echo disk > /sys/power/state" work as expected, 
as long as you give the booting up kernel image an idea about where the 
swap partition we suspended to is, via a single boot option) and that it 
was pretty straightforward and worked well, and that i think its way of 
reusing the existing suspend infrastructure and doing the add-ons 
cleanly while keeping the existing user-hibernate code intact looked 
viable to me.

I.e. to me it looked like while there are apparent conflicts of 
personalities suspend2 did not really seem to be a hostile 
reimplementation of 'A B C', but that it tries to build upon 'A B C' and 
just has a different technical opinion about whether 'DEFGH' should be 
in the kernel or outside of it.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 23:41                       ` Linus Torvalds
  2007-04-25  1:06                         ` Olivier Galibert
  2007-04-25  6:41                         ` Ingo Molnar
@ 2007-04-25  7:23                         ` Pavel Machek
  2007-04-25  8:48                           ` Xavier Bestel
                                             ` (4 more replies)
  2 siblings, 5 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25  7:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> This is why there's a lot to be said for
> 
> 	echo mem > /sys/power/state
> 
> and being able to follow the path through _one_ object (the kernel) over 
> trying to figure out the interaction between many different parts with 
> different versions.

The 'promise' is 'if you can get echo disk > /sys/power/state working,
uswsusp will work. too'. IOW it should be ok to debug the in-kernel
parts, only.

Even I am running in-kernel swsusp, but my managers were pretty clear
they want graphical progress bar hiding all the 'ugly' swsusp
messages... and in the end the same uswsusp enables compression, too.

> I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> whole thing. I think they've _all_ caused problems for the "true" suspend 
> (suspend-to-ram), and the last thing I want to see is three or four 

Well, it is a bit more complex than that.

suspend-to-disk is a workaround for

	'suspend-to-ram eats too much power' (plus some details like
	being able to replace battery).

suspend-to-ram is a workaround for

	'idle machine takes way too much power' (plus some details
	like don't spin the disk so that machine is safe to carry).

I'm starting to think that we should fix the idle power consumption
problem. Cell phones do it right. They pretend to be ready/idle all
the time, yet they have _days_ of standby.

OLPC can do something like that, too: it is capable of entering
suspend-to-ram with screen on and input devices ready to wake the
system up.

And with right network card (and right userland) ... I think normal
PCs could enter suspend-to-ram during screensaver, too. When you are
about to turn off the screen, machine should enable WOL on any packet,
arm RTC wakeup for the next packet, and s2ram happily.

(Obviously we are far away from that on PC.)
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  6:41                         ` Ingo Molnar
@ 2007-04-25  7:29                           ` Pavel Machek
  2007-04-25  7:48                             ` Dumitru Ciobarcianu
                                               ` (3 more replies)
  0 siblings, 4 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25  7:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate 
> > the whole thing. I think they've _all_ caused problems for the "true" 
> > suspend (suspend-to-ram), and the last thing I want to see is three or 
> > four different suspend-to-disk implementations.  So unlike Ingo, I 
> > don't think "let's just integrate them all side-by-side and maintain 
> > them and look who wins" is really a good idea.
> >
> > How many different magic ioctl's does the thing introduce? Is it 
> > really just *two* entry-points (and how simple are they, 
> > interface-wise), and nothing else?
> 
> userspace-driven-suspend is already in the kernel, today. So it's not 
> really "two versions side by side doing the same thing", but more of:
> 
>            A B C + D E F G H
> 
> where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
> suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 

Actually, we have 'D H' in kernel, today. It is called swsusp...
(Encryption, swapFile support and Graphical progress are missing from
today's kernel.)

> My original mail was about the following thing: i tried the suspend2 
> patch (which just makes "echo disk > /sys/power/state" work as expected, 
> as long as you give the booting up kernel image an idea about where the 

..and it means that 'echo disk > ...' should work w/o suspend2 patch,
too. (Just try it). You'll miss compression part, but that provides
only small speedup.
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:29                           ` Pavel Machek
@ 2007-04-25  7:48                             ` Dumitru Ciobarcianu
  2007-04-25  8:10                               ` Pavel Machek
  2007-04-25  8:48                             ` Nigel Cunningham
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 712+ messages in thread
From: Dumitru Ciobarcianu @ 2007-04-25  7:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, 2007-04-25 at 07:29 +0000, Pavel Machek wrote:
> > userspace-driven-suspend is already in the kernel, today. So it's not 
> > really "two versions side by side doing the same thing", but more of:
> > 
> >            A B C + D E F G H
> > 
> > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
> > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 
> 
> Actually, we have 'D H' in kernel, today. It is called swsusp...
> (Encryption, swapFile support and Graphical progress are missing from
> today's kernel.)

Please stop using FUD. 
Graphical progress it's not in the kernel, even with suspend2.

> 
> > My original mail was about the following thing: i tried the suspend2 
> > patch (which just makes "echo disk > /sys/power/state" work as expected, 
> > as long as you give the booting up kernel image an idea about where the 
> 
> ..and it means that 'echo disk > ...' should work w/o suspend2 patch,
> too. (Just try it). You'll miss compression part, but that provides
> only small speedup.

I beg to differ:
  Compressed 904687616 bytes into 418828687 (53 percent compression).

Almost 500mb less to write (did I mention it writes the full image?).
Now imagine the time it takes to write that with those pesky 4200rpm
laptop hdds.

-- 
Cioby

"Mr Linus, how do you debug the kernel, what tools do you use?"
"Ever heard of prinf ?"
(From an presentation at the "Politehnica" University of Bucharest, 1995)


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:48                             ` Dumitru Ciobarcianu
@ 2007-04-25  8:10                               ` Pavel Machek
  2007-04-25  8:22                                 ` Dumitru Ciobarcianu
  2007-04-26 11:12                                 ` Pekka Enberg
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25  8:10 UTC (permalink / raw)
  To: Dumitru Ciobarcianu
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > > userspace-driven-suspend is already in the kernel, today. So it's not 
> > > really "two versions side by side doing the same thing", but more of:
> > > 
> > >            A B C + D E F G H
> > > 
> > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
> > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 
> > 
> > Actually, we have 'D H' in kernel, today. It is called swsusp...
> > (Encryption, swapFile support and Graphical progress are missing from
> > today's kernel.)
> 
> Please stop using FUD. 
> Graphical progress it's not in the kernel, even with suspend2.

It was ascii-art, but still 'graphical', last time I checked.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  8:10                               ` Pavel Machek
@ 2007-04-25  8:22                                 ` Dumitru Ciobarcianu
  2007-04-26 11:12                                 ` Pekka Enberg
  1 sibling, 0 replies; 712+ messages in thread
From: Dumitru Ciobarcianu @ 2007-04-25  8:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, 2007-04-25 at 08:10 +0000, Pavel Machek wrote:
> Hi!
> 
> > > > userspace-driven-suspend is already in the kernel, today. So it's not 
> > > > really "two versions side by side doing the same thing", but more of:
> > > > 
> > > >            A B C + D E F G H
> > > > 
> > > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
> > > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 
> > > 
> > > Actually, we have 'D H' in kernel, today. It is called swsusp...
> > > (Encryption, swapFile support and Graphical progress are missing from
> > > today's kernel.)
> > 
> > Please stop using FUD. 
> > Graphical progress it's not in the kernel, even with suspend2.
> 
> It was ascii-art, but still 'graphical', last time I checked.

That would be suspend2ui_text , an userspace app.
It also works without it.

-- 
Cioby



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:23                         ` Pavel Machek
@ 2007-04-25  8:48                           ` Xavier Bestel
  2007-04-25  8:50                             ` Nigel Cunningham
  2007-04-25  9:02                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti
                                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 712+ messages in thread
From: Xavier Bestel @ 2007-04-25  8:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote:
> > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> > whole thing. I think they've _all_ caused problems for the "true" suspend 
> > (suspend-to-ram), and the last thing I want to see is three or four 
> 
> Well, it is a bit more complex than that.
> 
> suspend-to-disk is a workaround for
> 
>         'suspend-to-ram eats too much power' (plus some details like
>         being able to replace battery).
> 
> suspend-to-ram is a workaround for
> 
>         'idle machine takes way too much power' (plus some details
>         like don't spin the disk so that machine is safe to carry).

I think it depends on who you ask. I personally think that suspend-to-
$youchoose is a workaround for the slowness of system startup. I never
turn off my laptop, I just suspend it.

(And guess what, it uses APM and suspend is really faster and way more
reliable than each kernel implementation I could try).

	Xav



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:29                           ` Pavel Machek
  2007-04-25  7:48                             ` Dumitru Ciobarcianu
@ 2007-04-25  8:48                             ` Nigel Cunningham
  2007-04-25 13:07                             ` Federico Heinz
  2007-04-25 19:38                             ` Kenneth Crudup
  3 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-25  8:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

Hi.

On Wed, 2007-04-25 at 07:29 +0000, Pavel Machek wrote:
> Hi!
> 
> > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate 
> > > the whole thing. I think they've _all_ caused problems for the "true" 
> > > suspend (suspend-to-ram), and the last thing I want to see is three or 
> > > four different suspend-to-disk implementations.  So unlike Ingo, I 
> > > don't think "let's just integrate them all side-by-side and maintain 
> > > them and look who wins" is really a good idea.
> > >
> > > How many different magic ioctl's does the thing introduce? Is it 
> > > really just *two* entry-points (and how simple are they, 
> > > interface-wise), and nothing else?
> > 
> > userspace-driven-suspend is already in the kernel, today. So it's not 
> > really "two versions side by side doing the same thing", but more of:
> > 
> >            A B C + D E F G H
> > 
> > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by 
> > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". 
> 
> Actually, we have 'D H' in kernel, today. It is called swsusp...
> (Encryption, swapFile support and Graphical progress are missing from
> today's kernel.)

Along with a lot of other things (see my "Reasons to merge Suspend2"
email from earlier in the day).

> > My original mail was about the following thing: i tried the suspend2 
> > patch (which just makes "echo disk > /sys/power/state" work as expected, 
> > as long as you give the booting up kernel image an idea about where the 
> 
> ..and it means that 'echo disk > ...' should work w/o suspend2 patch,
> too. (Just try it). You'll miss compression part, but that provides
> only small speedup.

Please don't spread misinformation to support your case. LZF compression
(which is what all Suspend2 users use AFAIK) generally doubles the speed
of your cycle.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  8:48                           ` Xavier Bestel
@ 2007-04-25  8:50                             ` Nigel Cunningham
  2007-04-25  9:07                               ` Xavier Bestel
  0 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-25  8:50 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

Hi.

On Wed, 2007-04-25 at 10:48 +0200, Xavier Bestel wrote:
> On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote:
> > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> > > whole thing. I think they've _all_ caused problems for the "true" suspend 
> > > (suspend-to-ram), and the last thing I want to see is three or four 
> > 
> > Well, it is a bit more complex than that.
> > 
> > suspend-to-disk is a workaround for
> > 
> >         'suspend-to-ram eats too much power' (plus some details like
> >         being able to replace battery).
> > 
> > suspend-to-ram is a workaround for
> > 
> >         'idle machine takes way too much power' (plus some details
> >         like don't spin the disk so that machine is safe to carry).
> 
> I think it depends on who you ask. I personally think that suspend-to-
> $youchoose is a workaround for the slowness of system startup. I never
> turn off my laptop, I just suspend it.
> 
> (And guess what, it uses APM and suspend is really faster and way more
> reliable than each kernel implementation I could try).

If you tried Suspend2 and had problems with reliability, please send me
logs. I'll do all I can to help. (I have to qualify it a bit, because
I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and
I'll fix it).

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and  suspend2:hang in atomic copy)
  2007-04-25  7:23                         ` Pavel Machek
  2007-04-25  8:48                           ` Xavier Bestel
@ 2007-04-25  9:02                           ` Romano Giannetti
  2007-04-25 19:16                             ` suspend2 merge Martin Steigerwald
  2007-04-25 15:18                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk
                                             ` (2 subsequent siblings)
  4 siblings, 1 reply; 712+ messages in thread
From: Romano Giannetti @ 2007-04-25  9:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote:
> suspend-to-disk is a workaround for
>
>         'suspend-to-ram eats too much power' (plus some details like
>         being able to replace battery).
>

...and let me add 'suspend-to-disk' is a workaround for when s2ram does
not work for a gazillion interacting reasons (ACPI, vga bios, drm/dri,
you name it).
I am quite happy with s2ram now on my AMD-based vaio, but it started to
work with 2.6.17 kernels (Ubuntu Edgy, really), and the three years
before that suspend-to-disk (sometime Pavel's, sometime Nigel's) 
saved the day (yes, it's quite faster to use suspend-to-disk that doing
shutdown, reboot, and re-open all the applications).

So, please do not dismiss suspend-to-disk as "crap". It has its place
under the sun.

Romano

--
Romano Giannetti --- romano.giannetti@gmail.com
Sorry for the following disclaimer, it's attached by our outgoing server
and I cannot shut it up.



--
La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración.

This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  8:50                             ` Nigel Cunningham
@ 2007-04-25  9:07                               ` Xavier Bestel
  2007-04-25  9:19                                 ` Nigel Cunningham
  2007-04-26 18:18                                 ` Bill Davidsen
  0 siblings, 2 replies; 712+ messages in thread
From: Xavier Bestel @ 2007-04-25  9:07 UTC (permalink / raw)
  To: nigel
  Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote:
> > (And guess what, it uses APM and suspend is really faster and way more
> > reliable than each kernel implementation I could try).
> 
> If you tried Suspend2 and had problems with reliability, please send me
> logs. I'll do all I can to help. (I have to qualify it a bit, because
> I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and
> I'll fix it).

Does suspend2 work with APM ? After much trying, I think now the ACPI
implementation of my laptop (a vintage Compaq Armada 1700) is busted,
only APM works.

AFAIR the problem with suspend2 was that it didn't poweroff some parts
of the laptop (the led of the wifi pcmcia card was on, and the lcd light
was on too), but that was last year. Kernel's suspend kind of worked but
didn't resume (no reaction on button press). As I tried all this last
year, I may have forgotten some things.
Honestly, I like this laptop when it works flawlessly, so I don't see
many reasons to try *susp* again. I'll do it when I'm bored, just not
today.

Thanks,
	Xav



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  9:07                               ` Xavier Bestel
@ 2007-04-25  9:19                                 ` Nigel Cunningham
  2007-04-26 18:18                                 ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-25  9:19 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1755 bytes --]

Hi.

On Wed, 2007-04-25 at 11:07 +0200, Xavier Bestel wrote:
> On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote:
> > > (And guess what, it uses APM and suspend is really faster and way more
> > > reliable than each kernel implementation I could try).
> > 
> > If you tried Suspend2 and had problems with reliability, please send me
> > logs. I'll do all I can to help. (I have to qualify it a bit, because
> > I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and
> > I'll fix it).
> 
> Does suspend2 work with APM ? After much trying, I think now the ACPI
> implementation of my laptop (a vintage Compaq Armada 1700) is busted,
> only APM works.

It should do. If you set the powerdown method to 0, it will use
machine_power_off() instead of trying to use acpi, fall back to
machine_halt() if that fails and lastly (should not be needed) a
while(1) cpu_relax() loop.

> AFAIR the problem with suspend2 was that it didn't poweroff some parts
> of the laptop (the led of the wifi pcmcia card was on, and the lcd light
> was on too), but that was last year. Kernel's suspend kind of worked but
> didn't resume (no reaction on button press). As I tried all this last
> year, I may have forgotten some things.

The code to poweroff those parts will be dependent on the drivers
(assuming I'm making the right calls). If it's something where swsusp
works and suspend2 doesn't, it will be because I'm doing something
wrong. If they both don't do the right thing, then it's probably the
driver.

> Honestly, I like this laptop when it works flawlessly, so I don't see
> many reasons to try *susp* again. I'll do it when I'm bored, just not
> today.

Okay :) Just let me know if I can help.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:29                           ` Pavel Machek
  2007-04-25  7:48                             ` Dumitru Ciobarcianu
  2007-04-25  8:48                             ` Nigel Cunningham
@ 2007-04-25 13:07                             ` Federico Heinz
  2007-04-25 19:38                             ` Kenneth Crudup
  3 siblings, 0 replies; 712+ messages in thread
From: Federico Heinz @ 2007-04-25 13:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

Pavel Machek wrote:
> ..and it means that 'echo disk > ...' should work w/o suspend2 patch,
> too. (Just try it). You'll miss compression part, but that provides
> only small speedup.
>   

In my experience, the speedup is significant, both in hibernating and in 
waking up, and since the full image is written to disk, the system wakes 
up *usable*. It takes forever for a system that wakes up from uswsusp to 
be usable again, it keeps tripping over page faults for *minutes*.

    Fede


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:23                         ` Pavel Machek
  2007-04-25  8:48                           ` Xavier Bestel
  2007-04-25  9:02                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti
@ 2007-04-25 15:18                           ` Adrian Bunk
  2007-04-25 17:34                             ` Pavel Machek
  2007-04-25 19:43                           ` Kenneth Crudup
  2007-05-26 17:37                           ` Martin Steigerwald
  4 siblings, 1 reply; 712+ messages in thread
From: Adrian Bunk @ 2007-04-25 15:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, Apr 25, 2007 at 07:23:50AM +0000, Pavel Machek wrote:
> Hi!
> 
> > This is why there's a lot to be said for
> > 
> > 	echo mem > /sys/power/state
> > 
> > and being able to follow the path through _one_ object (the kernel) over 
> > trying to figure out the interaction between many different parts with 
> > different versions.
> 
> The 'promise' is 'if you can get echo disk > /sys/power/state working,
> uswsusp will work. too'. IOW it should be ok to debug the in-kernel
> parts, only.
> 
> Even I am running in-kernel swsusp, but my managers were pretty clear
> they want graphical progress bar hiding all the 'ugly' swsusp
> messages... and in the end the same uswsusp enables compression, too.
> 
> > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> > whole thing. I think they've _all_ caused problems for the "true" suspend 
> > (suspend-to-ram), and the last thing I want to see is three or four 
> 
> Well, it is a bit more complex than that.
> 
> suspend-to-disk is a workaround for
> 
> 	'suspend-to-ram eats too much power' (plus some details like
> 	being able to replace battery).
>...

Why does everyone think suspend-to-disk was a laptop-only thing?

My personal usage of suspend-to-disk is for turning the computer off in 
the evening and getting the complete FVWM with all programs running, 
open browser tabs,... back the next morning.

All I need for suspending is:
- echo "disk" > /sys/power/state

All I need for getting a running and usable system back is:
- turn on the power at the socket my computer is connected at
- swapoff -a; swapon -a       [1]
- wait a bit until the above commands finished

That's much more convenient than a cold boot.

And it's working with a plain 2.6.16 kernel and zero userspace support.

> 							Pavel

cu
Adrian

[1] required step: working with 1 or 2 GB swapped out is horrible

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 15:18                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk
@ 2007-04-25 17:34                             ` Pavel Machek
  2007-04-25 18:39                               ` Adrian Bunk
                                                 ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 17:34 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > Even I am running in-kernel swsusp, but my managers were pretty clear
> > they want graphical progress bar hiding all the 'ugly' swsusp
> > messages... and in the end the same uswsusp enables compression, too.
> > 
> > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> > > whole thing. I think they've _all_ caused problems for the "true" suspend 
> > > (suspend-to-ram), and the last thing I want to see is three or four 
> > 
> > Well, it is a bit more complex than that.
> > 
> > suspend-to-disk is a workaround for
> > 
> > 	'suspend-to-ram eats too much power' (plus some details like
> > 	being able to replace battery).
> >...
> 
> Why does everyone think suspend-to-disk was a laptop-only thing?
> 
> My personal usage of suspend-to-disk is for turning the computer off in 
> the evening and getting the complete FVWM with all programs running, 
> open browser tabs,... back the next morning.

Ok ok ok, suspend-to-disk has some other uses, too.

But ... you are really using suspend-to-disk as a workaround for "my
desktop takes too much power when idle". Imagine pressing "lock
screensaver" combination, and your machine going to low power mode
(3W?), immediately. (Quiet, too; you can't generate much noise for
3W). In the morning, you'd just press any key, machine would power up,
immediately... ok, you'd have to ifconfig eth0 down, so that spurious
packets on the local net would wake your machine, with all its fans
etc.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 17:34                             ` Pavel Machek
@ 2007-04-25 18:39                               ` Adrian Bunk
  2007-04-25 18:50                                 ` Linus Torvalds
  2007-04-25 18:52                               ` Alon Bar-Lev
  2007-04-25 22:11                               ` Kenneth Crudup
  2 siblings, 1 reply; 712+ messages in thread
From: Adrian Bunk @ 2007-04-25 18:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, Apr 25, 2007 at 07:34:05PM +0200, Pavel Machek wrote:
> Hi!
> 
> > > Even I am running in-kernel swsusp, but my managers were pretty clear
> > > they want graphical progress bar hiding all the 'ugly' swsusp
> > > messages... and in the end the same uswsusp enables compression, too.
> > > 
> > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the 
> > > > whole thing. I think they've _all_ caused problems for the "true" suspend 
> > > > (suspend-to-ram), and the last thing I want to see is three or four 
> > > 
> > > Well, it is a bit more complex than that.
> > > 
> > > suspend-to-disk is a workaround for
> > > 
> > > 	'suspend-to-ram eats too much power' (plus some details like
> > > 	being able to replace battery).
> > >...
> > 
> > Why does everyone think suspend-to-disk was a laptop-only thing?
> > 
> > My personal usage of suspend-to-disk is for turning the computer off in 
> > the evening and getting the complete FVWM with all programs running, 
> > open browser tabs,... back the next morning.
> 
> Ok ok ok, suspend-to-disk has some other uses, too.
> 
> But ... you are really using suspend-to-disk as a workaround for "my
> desktop takes too much power when idle". Imagine pressing "lock
> screensaver" combination, and your machine going to low power mode
> (3W?), immediately. (Quiet, too; you can't generate much noise for
> 3W). In the morning, you'd just press any key, machine would power up,
> immediately... ok, you'd have to ifconfig eth0 down, so that spurious
> packets on the local net would wake your machine, with all its fans
> etc.

3W for the complete system? In CPU state S1? [1]
And even 3W would still be a waste of energy.

And what would be the advantage? The socket my computer is connected at 
is located below my bed so I can turn the power on while still lying in 
bed (the computer is not reachable from my bed). OK, I could create an 
external power button for the computer using longer cables connected to 
the motherboard, but I still haven't understood why this should be 
better for my use case than suspend-to-disk.

> 								Pavel

cu
Adrian

[1] unless I'm misunderstanding [2], page 9, that's the highest state
    my processor supports
[2] http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24309.pdf

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 18:39                               ` Adrian Bunk
@ 2007-04-25 18:50                                 ` Linus Torvalds
  2007-04-25 19:02                                   ` Hua Zhong
                                                     ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 18:50 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven



On Wed, 25 Apr 2007, Adrian Bunk wrote:
> 
> 3W for the complete system? In CPU state S1? [1]

In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) 
the devices are off, but the motherboard and memory is powered.

> And even 3W would still be a waste of energy.

.. but if the alternative is a feature that just isn't worth it, and 
likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
I believe STD is both of those. There's a reason it's called "STD". Go 
to google and type "STD" and press "I'm feeling lucky". Google is God).

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 17:34                             ` Pavel Machek
  2007-04-25 18:39                               ` Adrian Bunk
@ 2007-04-25 18:52                               ` Alon Bar-Lev
  2007-04-25 22:11                               ` Kenneth Crudup
  2 siblings, 0 replies; 712+ messages in thread
From: Alon Bar-Lev @ 2007-04-25 18:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Adrian Bunk, Linus Torvalds, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote:
> Ok ok ok, suspend-to-disk has some other uses, too.
>
> But ... you are really using suspend-to-disk as a workaround for "my
> desktop takes too much power when idle". Imagine pressing "lock
> screensaver" combination, and your machine going to low power mode
> (3W?), immediately. (Quiet, too; you can't generate much noise for
> 3W). In the morning, you'd just press any key, machine would power up,
> immediately... ok, you'd have to ifconfig eth0 down, so that spurious
> packets on the local net would wake your machine, with all its fans
> etc.
>                                                                 Pavel

You are assuming that:
1. You have battery backup, or external power never fail.
2. You don't disconnect the filesystem from the device.
3. The security level of turned on device equals to a turned off one.
4. You turn on the same device that turned off.
5. You do not wish to boot another OS on this machine.

None of the above are always true... but why assume?
Just make this work... If Nigel wish to maintain this please let him,
you can be in charge of the s2ram.

Best Regards,
Alon Bar-Lev.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* RE: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 18:50                                 ` Linus Torvalds
@ 2007-04-25 19:02                                   ` Hua Zhong
  2007-04-25 19:25                                   ` Adrian Bunk
  2007-04-25 23:33                                   ` Olivier Galibert
  2 siblings, 0 replies; 712+ messages in thread
From: Hua Zhong @ 2007-04-25 19:02 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Adrian Bunk'
  Cc: 'Pavel Machek', 'Ingo Molnar',
	'Nigel Cunningham', 'Christian Hesse',
	'Nick Piggin', 'Mike Galbraith',
	linux-kernel, 'Con Kolivas',
	suspend2-devel, 'Andrew Morton',
	'Thomas Gleixner', 'Arjan van de Ven'

> In STR, 3W is quite realistic. The CPU is off, all (or most - up to you)
> the devices are off, but the motherboard and memory is powered.
> 
> > And even 3W would still be a waste of energy.
> 
> .. but if the alternative is a feature that just isn't worth it, and
> likely to not only have its own bugs, but cause bugs elsewhere? (And
> yes,
> I believe STD is both of those. There's a reason it's called "STD". Go
> to google and type "STD" and press "I'm feeling lucky". Google is God).

Linus, the fact that you personally don't use S2D does not mean it's not
useful for other people. I've been using solely laptop for six years and
until recently (when my commute is now only two miles) I'd always used
hibernate when I go to or leave form work. And even now if I take my laptop
to somewhere farther away (like on a vacation) I need hibernation.

This is one area where Windows has been doing great for many years, and it's
not like Linux has not had a mature implementation for many years either. So
don't you think your comments are a bit odd at this point?

Hua



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge
  2007-04-25  9:02                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti
@ 2007-04-25 19:16                             ` Martin Steigerwald
  0 siblings, 0 replies; 712+ messages in thread
From: Martin Steigerwald @ 2007-04-25 19:16 UTC (permalink / raw)
  To: suspend2-devel
  Cc: Romano Giannetti, Pavel Machek, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar,
	Linus Torvalds, Andrew Morton, Arjan van de Ven

Am Mittwoch 25 April 2007 schrieb Romano Giannetti:
> On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote:
> > suspend-to-disk is a workaround for
> >
> >         'suspend-to-ram eats too much power' (plus some details like
> >         being able to replace battery).
>
> ...and let me add 'suspend-to-disk' is a workaround for when s2ram does
> not work for a gazillion interacting reasons (ACPI, vga bios, drm/dri,
> you name it).

Hello Romano,

for me not only. I usually do not put the batteries into my laptops if not 
needed cause I read and experienced that I can extend battery life by 
that while making sure they are always charged more than 50%. Suspend to 
RAM thus wouldn't work at all for me.

I use suspend2 since 2.6.14 cause I never managed to get userspace 
software suspend working on my ThinkPad T23 and T42, not even with 
standard Debian kernel packages, didn't try the latest ones however, 
AFAIR my last try was with one 2.6.18 package. And cause its faster and 
the machine is more responsive after resuming than swsusp that I used 
upto kernel 2.6.13.

With that 1.5 GB RAM on my T42 suspending to disk with suspend2 takes 
quite some time and resuming also, but I didn't optimize it and thus it 
saves out almost everything of that:

martin@shambala:~> free -m
             total       used       free     shared    buffers     cached
Mem:          1518       1219        298          0          0        831
-/+ buffers/cache:        388       1130
Swap:         1908          0       1908

Probably should limit that.

I would like suspend2 getting merged! Its proven technology and just 
works, while I couldn't get userspace software suspend to work for me. 
Maybe I made a mistake while setting it up, but I think setting it up at 
first shouldn't be that complicated than I perceived it to be.

I use suspend2 for our workstations at work, too, and my workstation has a 
uptime of more than 43 days with more than 17 successful suspend and 
resume cycles.

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 18:50                                 ` Linus Torvalds
  2007-04-25 19:02                                   ` Hua Zhong
@ 2007-04-25 19:25                                   ` Adrian Bunk
  2007-04-25 19:38                                     ` Linus Torvalds
                                                       ` (4 more replies)
  2007-04-25 23:33                                   ` Olivier Galibert
  2 siblings, 5 replies; 712+ messages in thread
From: Adrian Bunk @ 2007-04-25 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2007, Adrian Bunk wrote:
> > 
> > 3W for the complete system? In CPU state S1? [1]
> 
> In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) 
> the devices are off, but the motherboard and memory is powered.

As far as I understand it, the CPU isn't off in S1.

> > And even 3W would still be a waste of energy.
> 
> .. but if the alternative is a feature that just isn't worth it, and 
> likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> I believe STD is both of those. There's a reason it's called "STD". Go 
> to google and type "STD" and press "I'm feeling lucky". Google is God).

Is there really no use case for STD?

No worries if power is completely lost.
Some people might boot Windows between suspending and resuming.
...

> 		Linus

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:29                           ` Pavel Machek
                                               ` (2 preceding siblings ...)
  2007-04-25 13:07                             ` Federico Heinz
@ 2007-04-25 19:38                             ` Kenneth Crudup
  3 siblings, 0 replies; 712+ messages in thread
From: Kenneth Crudup @ 2007-04-25 19:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


On Wed, 25 Apr 2007, Pavel Machek wrote:

> You'll miss compression part, but that provides only small speedup.

Not here:

----
fgrep -h Compressed /var/log/rawlog*
Apr 22 13:41:34 vaio kernel:   Compressed 85655552 bytes into 46779248 (45 percent compression).
Apr 22 16:09:13 vaio kernel:   Compressed 1380552704 bytes into 435656971 (68 percent compression).
Apr 22 17:06:11 vaio kernel:   Compressed 1488437248 bytes into 437400026 (70 percent compression).
Apr 22 22:55:41 vaio kernel:   Compressed 1875677184 bytes into 623450953 (66 percent compression).
Apr 23 12:30:33 vaio kernel:   Compressed 1731796992 bytes into 528194347 (69 percent compression).
Apr 23 18:13:32 vaio kernel:   Compressed 1883869184 bytes into 691016832 (63 percent compression).
Apr 24 11:55:07 vaio kernel:   Compressed 1795903488 bytes into 703370960 (60 percent compression).
<snip>
----

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:25                                   ` Adrian Bunk
@ 2007-04-25 19:38                                     ` Linus Torvalds
  2007-04-25 20:08                                       ` Pavel Machek
                                                         ` (3 more replies)
  2007-04-25 19:41                                     ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton
                                                       ` (3 subsequent siblings)
  4 siblings, 4 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 19:38 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven



On Wed, 25 Apr 2007, Adrian Bunk wrote:
> > 
> > .. but if the alternative is a feature that just isn't worth it, and 
> > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > I believe STD is both of those. There's a reason it's called "STD". Go 
> > to google and type "STD" and press "I'm feeling lucky". Google is God).
> 
> Is there really no use case for STD?

People seem to have reading comprehension problems.

The STD code is buggy, and has introduced bugs in STR too, largely thanks 
to bad design. Some of them have happily gotten fixed. Others did not, and 
now we have three totally different versions (two of which share some 
infrastructure), all of which are broken (ie the "suspend2" people will 
swear up-and-down that swsusp doesn't work for them, but anybody who 
thinks that "suspend2" will work for everybody is just being a total 
idiot, and I have a bridge to sell to them).

I'd actually be happier *removing* STD support in the sense it is now: 
it's way too closely integrated with STR, even though it has absolutely 
nothing in common with it. When you STD, you'e actually much closer to a 
*shutdown* than to STR, yet the STD code continually seems to want to be 
in the "suspend" path, as shown even by its name.

So my objections to STD have nothing to do with saving state and shutting 
down. They have everything to do with the fact that it is not - and will 
never be - a "suspend", and it shouldn't affect suspend.

And that's a *fundamental* problem. If the STD people cannot even realize 
that they have less to do with "suspend" than to "reboot", how do you ever 
expect them to get anything to work, and not affect other things 
negatively?

Yeah, I'm down on it. I'm down on it because every person involved with 
the whole STD thing seems to have basically zero taste, and a total 
inability to work with anybody else. 

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:25                                   ` Adrian Bunk
  2007-04-25 19:38                                     ` Linus Torvalds
@ 2007-04-25 19:41                                     ` Andrew Morton
  2007-04-25 19:55                                     ` Pavel Machek
                                                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 712+ messages in thread
From: Andrew Morton @ 2007-04-25 19:41 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Pavel Machek, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Thomas Gleixner, Arjan van de Ven

On Wed, 25 Apr 2007 21:25:12 +0200 Adrian Bunk <bunk@stusta.de> wrote:

> > > And even 3W would still be a waste of energy.
> > 
> > .. but if the alternative is a feature that just isn't worth it, and 
> > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > I believe STD is both of those. There's a reason it's called "STD". Go 
> > to google and type "STD" and press "I'm feeling lucky". Google is God).
> 
> Is there really no use case for STD?

I use it all the time.  The batteries only seem to last a day or so in STR.

Plus one is supposed to power off all electrical equipment during takeoff
and landing.

> No worries if power is completely lost.

To change batteries.

> Some people might boot Windows between suspending and resuming.

I use that often too.  (But I won't when I get around to upgrading the X
driver to get the VaioOfDeath's external video output working under
Linux)

I don't think I need a fancy splash screen tho.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:23                         ` Pavel Machek
                                             ` (2 preceding siblings ...)
  2007-04-25 15:18                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk
@ 2007-04-25 19:43                           ` Kenneth Crudup
  2007-04-25 20:08                             ` Linus Torvalds
  2007-05-26 17:37                           ` Martin Steigerwald
  4 siblings, 1 reply; 712+ messages in thread
From: Kenneth Crudup @ 2007-04-25 19:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven


On Wed, 25 Apr 2007, Pavel Machek wrote:

> I'm starting to think that we should fix the idle power consumption
> problem. Cell phones do it right. They pretend to be ready/idle all
> the time, yet they have _days_ of standby.

My laptop goes nearly everywhere I do; I DO NOT want it on when I'm
travelling around to clients or between home and office or on a plane,
and I lose a lot of productivity the times I have to restart from a
cold boot as when I'm working I tend to have up ~10 xterms and while
my browsers have "restart", that's not infallible.

Any working suspend-to-disk method takes care of that for me.  (I'm
really not sure why Linus hates S2D so much, though. Back in the day
there was a lot more BIOS support, but that's been years now.)

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:25                                   ` Adrian Bunk
  2007-04-25 19:38                                     ` Linus Torvalds
  2007-04-25 19:41                                     ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton
@ 2007-04-25 19:55                                     ` Pavel Machek
  2007-04-25 22:13                                     ` Kenneth Crudup
  2007-04-26  1:25                                     ` Antonino A. Daplas
  4 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 19:55 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > > 3W for the complete system? In CPU state S1? [1]
> > 
> > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) 
> > the devices are off, but the motherboard and memory is powered.
> 
> As far as I understand it, the CPU isn't off in S1.
> 
> > > And even 3W would still be a waste of energy.
> > 
> > .. but if the alternative is a feature that just isn't worth it, and 
> > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > I believe STD is both of those. There's a reason it's called "STD". Go 
> > to google and type "STD" and press "I'm feeling lucky". Google is God).
> 
> Is there really no use case for STD?

Of course there are use cases for STD... lots of them... that's why
I'm maintaining it.

It has some "interesting" use cases, like suspend on one machine,
transfer image to identical one, resume there, dual-boot to windows;
there are "normal" use cases, like machines not capable of S3.

I hope we are not dropping STD just now...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:43                           ` Kenneth Crudup
@ 2007-04-25 20:08                             ` Linus Torvalds
  2007-04-25 20:27                               ` Pavel Machek
  2007-04-26  0:41                               ` Thomas Orgis
  0 siblings, 2 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 20:08 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Kenneth Crudup wrote:
> 
> Any working suspend-to-disk method takes care of that for me.  (I'm
> really not sure why Linus hates S2D so much, though. Back in the day
> there was a lot more BIOS support, but that's been years now.)

The really sad part is that APM actually did this better.. 

It's not often you can say that, and APM didn't do diddly-squat for 
run-time power management, but when it comes to suspend-to-disk, APM 
actually did ok.

I think you could do STD right too, but:

 - if you think it's about suspending devices, you are immediately 
   disqualified. If you call the device driver "suspend" or "resume" 
   functions, you are doing something wrong.

 - "suspend" is: snapshot memory, and anything you do *after* the snapshot 
   is totally irrelevant. You MUST NOT suspend devices before, since 
   devices are what that snapshot should be written out to, and you MUST 
   NOT suspend devices afterwards either, because that shows that you are 
   a moron who didn't understand the "machine will be turned off" part.

 - "resume" is basically: get image into memory, turn *off* every device, 
   put image into its proper location, and call the "startup" function. If 
   you call a device "resume()" function, you again show that you are a 
   moron, because you're not resuming anything at all, you're resetting 
   the device from scratch. You _reinitialize_ the device. You don't 
   resume it, and somebody may hve (and indeed, *WILL HAVE* used the 
   device in between). There should be absolutely zero shared code, and 
   the *last* thing you should do is to call the device with the same 
   function, and give it a flag to tell it to do one thing or the other.

The important thing to take away from this is that it has nothing to do 
with "suspend" or "resume" at any level what-so-ever. Not at a device 
level, not at a system level, and not even when it comes to hardware. But 
for completely idiotic and wrong reasons, it is currently intimately 
involved in suspend/resume, and calls the same device management functions 
as a suspend/resume thing does, and shares a lot of the code.

And THAT is why I hate the kernel STD. It is fundamentally confused. In 
ways that APM was not, I'd like to point out.

I'd love to get it fixed. But the first fix is to not call it "suspend", 
because language *does* matter, and using that term is what I'm convinced 
has confused so many people.

If it had been called "snapshot + restore", I suspect a lot of people 
wouldn't have been so confused about what it does and how it needs to do 
it, and wouldn't have tried to shoehorn it into the same corner of the 
kernel as "suspend-to-ram" (where you really *can* do things like 
"suspend" devices, and while they might certainly lose power in between, 
they also really might not, and they certainly won't be *doing* things in 
between).

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:38                                     ` Linus Torvalds
@ 2007-04-25 20:08                                       ` Pavel Machek
  2007-04-25 20:33                                         ` Rafael J. Wysocki
  2007-04-25 22:36                                         ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham
  2007-04-25 20:20                                       ` Rafael J. Wysocki
                                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > > .. but if the alternative is a feature that just isn't worth it, and 
> > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > > I believe STD is both of those. There's a reason it's called "STD". Go 
> > > to google and type "STD" and press "I'm feeling lucky". Google is God).
> > 
> > Is there really no use case for STD?
> 
> People seem to have reading comprehension problems.
> 
> The STD code is buggy, and has introduced bugs in STR too, largely thanks 
> to bad design. Some of them have happily gotten fixed. Others did not, and 
> now we have three totally different versions (two of which share some 
> infrastructure), all of which are broken (ie the "suspend2" people will 
> swear up-and-down that swsusp doesn't work for them, but anybody who 
> thinks that "suspend2" will work for everybody is just being a total 
> idiot, and I have a bridge to sell to them).

Well, lets get some credit to STD... it worked before STR, and it
allowed debugging basic driver infrastructure. 

> So my objections to STD have nothing to do with saving state and shutting 
> down. They have everything to do with the fact that it is not - and will 
> never be - a "suspend", and it shouldn't affect suspend.

STD needs to snapshot system, and then it needs devices to be
suspended so that snapshot is consistent.

> And that's a *fundamental* problem. If the STD people cannot even realize 
> that they have less to do with "suspend" than to "reboot", how do you ever 
> expect them to get anything to work, and not affect other things 
> negatively?

STD worked first ;-). Yes, these days it has little to do with
"suspend", it was mostly separated to "snapshot" and "restore".

We still keep swsusp in kernel for compatibility (and because it makes
debugging very easy).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:38                                     ` Linus Torvalds
  2007-04-25 20:08                                       ` Pavel Machek
@ 2007-04-25 20:20                                       ` Rafael J. Wysocki
  2007-04-25 20:24                                         ` Linus Torvalds
  2007-04-25 20:23                                       ` Adrian Bunk
  2007-04-27 12:36                                       ` suspend2 merge Martin Steigerwald
  3 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-25 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On Wednesday, 25 April 2007 21:38, Linus Torvalds wrote:
> 
> On Wed, 25 Apr 2007, Adrian Bunk wrote:

Well, I told Pavel that I wouldn't take part in this thread, but since you're
making some rude and unfounded personal remarks, I feel I have to speak.

[--snip--]
> And that's a *fundamental* problem. If the STD people cannot even realize 
> that they have less to do with "suspend" than to "reboot", how do you ever 
> expect them to get anything to work, and not affect other things 
> negatively?

That's not true.

> Yeah, I'm down on it. I'm down on it because every person involved with 
> the whole STD thing seems to have basically zero taste, and a total 
> inability to work with anybody else.

Please ask anyone who's worked with me if he's had any problem with that.
If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
then, I feel offended.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:38                                     ` Linus Torvalds
  2007-04-25 20:08                                       ` Pavel Machek
  2007-04-25 20:20                                       ` Rafael J. Wysocki
@ 2007-04-25 20:23                                       ` Adrian Bunk
  2007-04-25 22:19                                         ` Kenneth Crudup
  2007-04-27 12:36                                       ` suspend2 merge Martin Steigerwald
  3 siblings, 1 reply; 712+ messages in thread
From: Adrian Bunk @ 2007-04-25 20:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Wed, Apr 25, 2007 at 12:38:47PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2007, Adrian Bunk wrote:
> > > 
> > > .. but if the alternative is a feature that just isn't worth it, and 
> > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > > I believe STD is both of those. There's a reason it's called "STD". Go 
> > > to google and type "STD" and press "I'm feeling lucky". Google is God).
> > 
> > Is there really no use case for STD?
>...
> I'd actually be happier *removing* STD support in the sense it is now: 
> it's way too closely integrated with STR, even though it has absolutely 
> nothing in common with it. When you STD, you'e actually much closer to a 
> *shutdown* than to STR, yet the STD code continually seems to want to be 
> in the "suspend" path, as shown even by its name.
> 
> So my objections to STD have nothing to do with saving state and shutting 
> down. They have everything to do with the fact that it is not - and will 
> never be - a "suspend", and it shouldn't affect suspend.
>...

There are two completely different points:
- I say that the feature STD has use cases where STR is not a replacement
- you say you dislike the current implementation of STD

For me it was a serious regression if STD was removed without any 
replacement.

If someone would replace the STD implementation with what you want it to 
be I wouldn't care and you were happy.

> 		Linus

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:20                                       ` Rafael J. Wysocki
@ 2007-04-25 20:24                                         ` Linus Torvalds
  2007-04-25 21:30                                           ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 20:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven



On Wed, 25 Apr 2007, Rafael J. Wysocki wrote:
> 
> Please ask anyone who's worked with me if he's had any problem with that.
> If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
> then, I feel offended.

I'll apologise (and virtually kiss your hairy feet) if you could actually 
show me a single implementation that people can agree on.

But until then, I claim that the suspend-to-disk people cannot work with 
each other.

And no, "three different implementations" doesn't cut it. Even _two_ is 
too much. We need to get *rid* of something, not add more.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:08                             ` Linus Torvalds
@ 2007-04-25 20:27                               ` Pavel Machek
  2007-04-25 20:44                                 ` Linus Torvalds
  2007-04-26  0:41                               ` Thomas Orgis
  1 sibling, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

Hi!

_Can we get a suspend-to-RAM maintainer_?

Noone cares about s2ram these days. I do care a little, seife
maintains whitelist, you care for mac mini, Len/Andrew/Intel acpi team
helps sometimes... But I feel we should have someone listed in the
MAINTAINERS file. Patrick was close, but... 

> > Any working suspend-to-disk method takes care of that for me.  (I'm
> > really not sure why Linus hates S2D so much, though. Back in the day
> > there was a lot more BIOS support, but that's been years now.)
> 
> The really sad part is that APM actually did this better.. 

I agree that APM STR worked better than current ACPI STR. I think
swsusp already works better than APM STD, but...

> I think you could do STD right too, but:
> 
>  - if you think it's about suspending devices, you are immediately 
>    disqualified. If you call the device driver "suspend" or "resume" 
>    functions, you are doing something wrong.
> 
>  - "suspend" is: snapshot memory, and anything you do *after* the snapshot 
>    is totally irrelevant. You MUST NOT suspend devices before, since 
>    devices are what that snapshot should be written out to, and you MUST 
>    NOT suspend devices afterwards either, because that shows that you are 
>    a moron who didn't understand the "machine will be turned off" part.

Can I get you on IRC somewhere? No, I do not think I'm a moron, and
yes, I need to suspend^Wsnapshot the devices before, so I have that in
the snapshot. Of course, I'll need to resume^Wrestore the devices
before writing snapshot. That's okay, it does not take long.

>  - "resume" is basically: get image into memory, turn *off* every device, 

Exactly. I need to turn devices *off* before restoring image, and I
need them *off* before saving image, too -- DMAs are dangerous.

I currently do that using "suspend" and "resume" hooks, before they
turn DMAs / IRQs off as a sideeffect.

>    put image into its proper location, and call the "startup" function. If 
>    you call a device "resume()" function, you again show that you are a 
>    moron, because you're not resuming anything at all, you're resetting 
>    the device from scratch. You _reinitialize_ the device. You don't 
>    resume it, and somebody may hve (and indeed, *WILL HAVE* used the 
>    device in between). There should be absolutely zero shared code, and 
>    the *last* thing you should do is to call the device with the same 
>    function, and give it a flag to tell it to do one thing or the other.

Well, "startup" function or how you want to call it has to deal with
device not initialized (s2disk driver is module, s2ram with hw powered
off) and has to deal with device initialized (s2disk driver in kernel
case, s2ram device was powered). 

I fear that "resume"/"restore" functions need to be pretty robust,
anyway... 

> And THAT is why I hate the kernel STD. It is fundamentally confused. In 
> ways that APM was not, I'd like to point out.

Ok, yes, it is confused/confusing.

> I'd love to get it fixed. But the first fix is to not call it "suspend", 
> because language *does* matter, and using that term is what I'm convinced 
> has confused so many people.

> If it had been called "snapshot + restore", I suspect a lot of
> people 

snapshot/restore sounds okay to me.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:33                                         ` Rafael J. Wysocki
@ 2007-04-25 20:31                                           ` Pavel Machek
  2007-04-27 10:21                                             ` driver power operations (was Re: suspend2 merge) Johannes Berg
  2007-04-27 10:21                                             ` Johannes Berg
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 20:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

Hi!

> > > > > .. but if the alternative is a feature that just isn't worth it, and 
> > > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > > > > I believe STD is both of those. There's a reason it's called "STD". Go 
> > > > > to google and type "STD" and press "I'm feeling lucky". Google is God).
> > > > 
> > > > Is there really no use case for STD?
> 
> [--snip--]
> > > So my objections to STD have nothing to do with saving state and shutting 
> > > down. They have everything to do with the fact that it is not - and will 
> > > never be - a "suspend", and it shouldn't affect suspend.
> > 
> > STD needs to snapshot system, and then it needs devices to be
> > suspended so that snapshot is consistent.
> 
> Not suspended.  _Frozen_.  In fact don't want any DMA transfers or interrupts
> to take place when we're creating the image.  That's all and that's what we're
> doing (or rather, trying to do).

Yep, _frozen_. That's the right word.

> So, the "suspend" and "resume" for the functions being called for that are
> wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> be that difficult ...

It would be ugly big patch I'm afraid.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:08                                       ` Pavel Machek
@ 2007-04-25 20:33                                         ` Rafael J. Wysocki
  2007-04-25 20:31                                           ` Pavel Machek
  2007-04-25 22:36                                         ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-25 20:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On Wednesday, 25 April 2007 22:08, Pavel Machek wrote:
> Hi!
> 
> > > > .. but if the alternative is a feature that just isn't worth it, and 
> > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > > > I believe STD is both of those. There's a reason it's called "STD". Go 
> > > > to google and type "STD" and press "I'm feeling lucky". Google is God).
> > > 
> > > Is there really no use case for STD?

[--snip--]
> > So my objections to STD have nothing to do with saving state and shutting 
> > down. They have everything to do with the fact that it is not - and will 
> > never be - a "suspend", and it shouldn't affect suspend.
> 
> STD needs to snapshot system, and then it needs devices to be
> suspended so that snapshot is consistent.

Not suspended.  _Frozen_.  In fact don't want any DMA transfers or interrupts
to take place when we're creating the image.  That's all and that's what we're
doing (or rather, trying to do).

So, the "suspend" and "resume" for the functions being called for that are
wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
.freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
be that difficult ...

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:27                               ` Pavel Machek
@ 2007-04-25 20:44                                 ` Linus Torvalds
  2007-04-25 21:07                                   ` Rafael J. Wysocki
  2007-04-25 21:44                                   ` Pavel Machek
  0 siblings, 2 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 20:44 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Pavel Machek wrote:
> 
> Can I get you on IRC somewhere? No, I do not think I'm a moron, and
> yes, I need to suspend^Wsnapshot the devices before, so I have that in
> the snapshot. Of course, I'll need to resume^Wrestore the devices
> before writing snapshot. That's okay, it does not take long.

You do NOT need to "suspend" the devices, and that's the whole point.

You may want to save the device info somewhere, BUT THAT IS SOMETHING 
TOTALLY DIFFERENT!

This is *exactly* the confusion I'm talking about. The STD and STR 
codepaths try to use the same function for two TOTALLY DIFFERENT things.

STR actually wants to "suspend".

STD actually wants to "atomic snapshot", and it must not allow allocations 
or anything like that, because the whole snapshot image should be done 
atomically as one event. But it should *not* suspend, because that device 
may actually be needed afterwards. 

So not the same thing at all.

So here's what "suspend()" wants:
 - suspend() - preparatory work, can error our, can delay, can park the 
   disk, etc etc.
 - suspend_late() - called late, with interrupts disabled, should actually
   suspend if the early suspend didn't do it already

And here is what "snapshot()" wants:
 - prepare_to_snapshot() (for memory allocation)
 - snapshot() - called late, with interrupts disabled, save state.

and there is absolutely _zero_ overlap between them. There just isn't 
anything in common. Yes, both are two-phase (for the simple reason that 
both want an "atomic" part), but there's really no real overlap.

Just trying to *make* them be the same operations is just going to 
introduce flags that then cause them to be totally different *and* 
confusing and generate bugs. It also means that people do one of them, and 
"it works" for that case, and the other case is totally broken, but it's 
not obvious, because doing one means that the system _thinks_ that you did 
both!

In the very unlikely case that some driver actually *wants* to use the 
same function for snapshots and suspending, that driver could just go 
ahead and _use_ the same function pointer. But now, as things are set up, 
we force a total confusion on drivers by calling them through the same 
interface for two totally different things.

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:44                                 ` Linus Torvalds
@ 2007-04-25 21:07                                   ` Rafael J. Wysocki
  2007-04-25 21:44                                   ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-25 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

On Wednesday, 25 April 2007 22:44, Linus Torvalds wrote:
> 
> On Wed, 25 Apr 2007, Pavel Machek wrote:
> > 
> > Can I get you on IRC somewhere? No, I do not think I'm a moron, and
> > yes, I need to suspend^Wsnapshot the devices before, so I have that in
> > the snapshot. Of course, I'll need to resume^Wrestore the devices
> > before writing snapshot. That's okay, it does not take long.
> 
> You do NOT need to "suspend" the devices, and that's the whole point.
> 
> You may want to save the device info somewhere, BUT THAT IS SOMETHING 
> TOTALLY DIFFERENT!
> 
> This is *exactly* the confusion I'm talking about. The STD and STR 
> codepaths try to use the same function for two TOTALLY DIFFERENT things.
> 
> STR actually wants to "suspend".
> 
> STD actually wants to "atomic snapshot", and it must not allow allocations 
> or anything like that, because the whole snapshot image should be done 
> atomically as one event. But it should *not* suspend, because that device 
> may actually be needed afterwards. 
> 
> So not the same thing at all.
> 
> So here's what "suspend()" wants:
>  - suspend() - preparatory work, can error our, can delay, can park the 
>    disk, etc etc.
>  - suspend_late() - called late, with interrupts disabled, should actually
>    suspend if the early suspend didn't do it already
> 
> And here is what "snapshot()" wants:
>  - prepare_to_snapshot() (for memory allocation)
>  - snapshot() - called late, with interrupts disabled, save state.
> 
> and there is absolutely _zero_ overlap between them. There just isn't 
> anything in common. Yes, both are two-phase (for the simple reason that 
> both want an "atomic" part), but there's really no real overlap.
> 
> Just trying to *make* them be the same operations is just going to 
> introduce flags that then cause them to be totally different *and* 
> confusing and generate bugs. It also means that people do one of them, and 
> "it works" for that case, and the other case is totally broken, but it's 
> not obvious, because doing one means that the system _thinks_ that you did 
> both!
> 
> In the very unlikely case that some driver actually *wants* to use the 
> same function for snapshots and suspending, that driver could just go 
> ahead and _use_ the same function pointer. But now, as things are set up, 
> we force a total confusion on drivers by calling them through the same 
> interface for two totally different things.

I agree, except there are surprisingly many drivers like that.

You're right, we should be doing all of it in a different way, but this means
a lot of changes and we can't do them overnight.

As I wrote in the reply to Pavel, I think we can introduce .freeze(), .thaw()
(and .prethaw() for that matter) callbacks for hibernation and make drivers
use them, but that will be a long series of patches.  Still, I think it's
doable.

Greetings,
Rafael


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:24                                         ` Linus Torvalds
@ 2007-04-25 21:30                                           ` Pavel Machek
  2007-04-25 21:40                                             ` Rafael J. Wysocki
  2007-04-25 22:22                                             ` Nigel Cunningham
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

Hi!

> > Please ask anyone who's worked with me if he's had any problem with that.
> > If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
> > then, I feel offended.
> 
> I'll apologise (and virtually kiss your hairy feet) if you could actually 
> show me a single implementation that people can agree on.
> 
> But until then, I claim that the suspend-to-disk people cannot work with 
> each other.

It is not Rafael's fault. Actually it is quite hard to work with
Nigel, because he implements every feature someone asks for, and wants
to merge them all :-(. I don't expect to ever agree with Nigel on
anything important, sorry.

> And no, "three different implementations" doesn't cut it. Even _two_ is 
> too much. We need to get *rid* of something, not add more.

swsusp can be dropped. It is nice -- self contained, extremely easy to
setup, Andrew likes it. uswsusp has all the features, and pretty
elegant design. With klibc (or some way to ship userland code with
kernel, and put it into initramfs or something) we can reasonably drop
swsusp.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:30                                           ` Pavel Machek
@ 2007-04-25 21:40                                             ` Rafael J. Wysocki
  2007-04-25 21:46                                               ` Pavel Machek
  2007-04-25 22:22                                             ` Nigel Cunningham
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-25 21:40 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On Wednesday, 25 April 2007 23:30, Pavel Machek wrote:
> Hi!
> 
> > > Please ask anyone who's worked with me if he's had any problem with that.
> > > If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
> > > then, I feel offended.
> > 
> > I'll apologise (and virtually kiss your hairy feet) if you could actually 
> > show me a single implementation that people can agree on.
> > 
> > But until then, I claim that the suspend-to-disk people cannot work with 
> > each other.
> 
> It is not Rafael's fault. Actually it is quite hard to work with
> Nigel, because he implements every feature someone asks for, and wants
> to merge them all :-(. I don't expect to ever agree with Nigel on
> anything important, sorry.
> 
> > And no, "three different implementations" doesn't cut it. Even _two_ is 
> > too much. We need to get *rid* of something, not add more.
> 
> swsusp can be dropped. It is nice -- self contained, extremely easy to
> setup, Andrew likes it. uswsusp has all the features, and pretty
> elegant design. With klibc (or some way to ship userland code with
> kernel, and put it into initramfs or something) we can reasonably drop
> swsusp.

Well, I think we still need it and will need it in the future, at least for
debugging.  Moreover, I think there are many users of it.

Let's not drop things that are helping us. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 21:29                       ` Pavel Machek
  2007-04-24 22:24                         ` Ray Lee
@ 2007-04-25 21:41                         ` Matt Mackall
  2007-04-26 11:27                           ` Pavel Machek
  2007-04-26 19:04                           ` Bill Davidsen
  1 sibling, 2 replies; 712+ messages in thread
From: Matt Mackall @ 2007-04-25 21:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Tue, Apr 24, 2007 at 11:29:56PM +0200, Pavel Machek wrote:
> We do not want to fragment the testing base, and suspend2 does not
> really have any interesting features over uswsusp.

The testing base is already fragmented!

What the current situation means is that you simply never hear from
the people who get fed up with suspend but who manage to get suspend2
working.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:44                                 ` Linus Torvalds
  2007-04-25 21:07                                   ` Rafael J. Wysocki
@ 2007-04-25 21:44                                   ` Pavel Machek
  2007-04-25 22:18                                     ` Linus Torvalds
  1 sibling, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 21:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

Hi!

> > Can I get you on IRC somewhere? No, I do not think I'm a moron, and
> > yes, I need to suspend^Wsnapshot the devices before, so I have that in
> > the snapshot. Of course, I'll need to resume^Wrestore the devices
> > before writing snapshot. That's okay, it does not take long.
> 
> You do NOT need to "suspend" the devices, and that's the whole point.
> 
> You may want to save the device info somewhere, BUT THAT IS SOMETHING 
> TOTALLY DIFFERENT!
> 
> This is *exactly* the confusion I'm talking about. The STD and STR 
> codepaths try to use the same function for two TOTALLY DIFFERENT things.
> 
> STR actually wants to "suspend".
> 
> STD actually wants to "atomic snapshot", and it must not allow allocations 
> or anything like that, because the whole snapshot image should be done 
> atomically as one event. But it should *not* suspend, because that device 
> may actually be needed afterwards. 
> 
> So not the same thing at all.

Not the same... but they are still related. "freeze" (for atomic
snapshot) is actually subset of "suspend"... freeze needs DMAs off and
saved state, and you need DMAs off and saved state for "suspend".

So it is actually correct to do "suspend" when you want "freeze"; it
is just slow. That's why they only differ in parameter these days.

> So here's what "suspend()" wants:
>  - suspend() - preparatory work, can error our, can delay, can park the 
>    disk, etc etc.
>  - suspend_late() - called late, with interrupts disabled, should actually
>    suspend if the early suspend didn't do it already
> 
> And here is what "snapshot()" wants:
>  - prepare_to_snapshot() (for memory allocation)

Lets call this "freeze"?

>  - snapshot() - called late, with interrupts disabled, save state.

> and there is absolutely _zero_ overlap between them. There just isn't 
> anything in common. Yes, both are two-phase (for the simple reason that 
> both want an "atomic" part), but there's really no real overlap.

As I tried to explain, you can use suspend() to stop DMAs and save
state, and that's enough to get sane snapshot. (You do cli() before
doing snapshot, that helps with irqs).

> Just trying to *make* them be the same operations is just going to 
> introduce flags that then cause them to be totally different *and* 
> confusing and generate bugs. It also means that people do one of them, and 
> "it works" for that case, and the other case is totally broken, but it's 
> not obvious, because doing one means that the system _thinks_ that you did 
> both!
> 
> In the very unlikely case that some driver actually *wants* to use the 
> same function for snapshots and suspending, that driver could just go 
> ahead and _use_ the same function pointer. But now, as things are set up, 
> we force a total confusion on drivers by calling them through the same 
> interface for two totally different things.

Ok ok, we can do

suspend(PMSG_SUSPEND) -> suspend()
suspend(PMSG_FREEZE) -> freeze()

. We'll need to do big search&replace over the kernel etc. But if you
think it helps with confusion...

I'd still like to keep people using same method for both. It means
suspend path gets more testing, even when some stuff it does is not
strictly neccessary for freeze.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:40                                             ` Rafael J. Wysocki
@ 2007-04-25 21:46                                               ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 21:46 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

Hi!

> > > And no, "three different implementations" doesn't cut it. Even _two_ is 
> > > too much. We need to get *rid* of something, not add more.
> > 
> > swsusp can be dropped. It is nice -- self contained, extremely easy to
> > setup, Andrew likes it. uswsusp has all the features, and pretty
> > elegant design. With klibc (or some way to ship userland code with
> > kernel, and put it into initramfs or something) we can reasonably drop
> > swsusp.
> 
> Well, I think we still need it and will need it in the future, at least for
> debugging.  Moreover, I think there are many users of it.
> 
> Let's not drop things that are helping us. :-)

Yes, it is very nice for debugging. But if I _had_ to choose, I'd
rather remove swsusp than uswsusp.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 17:34                             ` Pavel Machek
  2007-04-25 18:39                               ` Adrian Bunk
  2007-04-25 18:52                               ` Alon Bar-Lev
@ 2007-04-25 22:11                               ` Kenneth Crudup
  2 siblings, 0 replies; 712+ messages in thread
From: Kenneth Crudup @ 2007-04-25 22:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Adrian Bunk, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Linus Torvalds, Andrew Morton, Arjan van de Ven


On Wed, 25 Apr 2007, Pavel Machek wrote:

> But ... you are really using suspend-to-disk as a workaround for "my
> desktop takes too much power when idle".

While rare is the day admittedly, that my machine isn't on, there are
days I take a break from loooong days and won't work for 2-5 days at
a time.

My main revenue machine is a laptop with a fast, but last-generation
mobile processor and 2GB of DDR2 SDRAM.

I think it's ridiculous to expect that I could resume off battery (and
this thing is a behemoth, with a 17" screen and backlight and a lot of
little juice-eating peripherals that'll go thru a 4.4A/Hr battery in a
little over 90 mins, even with conservative power settings) after that
kind of delay. I don't even like "suspend to RAM, then suspend to disk
on battery low" 'cause that means when I turn it on again I have a low
battery for an hour and a half.

The only acceptable power usage when (completely) idle, IMO, is *zero*.

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:25                                   ` Adrian Bunk
                                                       ` (2 preceding siblings ...)
  2007-04-25 19:55                                     ` Pavel Machek
@ 2007-04-25 22:13                                     ` Kenneth Crudup
  2007-04-26  1:25                                     ` Antonino A. Daplas
  4 siblings, 0 replies; 712+ messages in thread
From: Kenneth Crudup @ 2007-04-25 22:13 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Nick Piggin, suspend2-devel, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek,
	Ingo Molnar, Andrew Morton, Arjan van de Ven


On Wed, 25 Apr 2007, Adrian Bunk wrote:

> Some people might boot Windows between suspending and resuming.

Oh yeah- that, too. Since iTunes doesn't work well with VMWare, I do this
all the time.

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:44                                   ` Pavel Machek
@ 2007-04-25 22:18                                     ` Linus Torvalds
  2007-04-25 22:27                                       ` Nigel Cunningham
                                                         ` (4 more replies)
  0 siblings, 5 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 22:18 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Pavel Machek wrote:
> 
> Not the same... but they are still related. "freeze" (for atomic
> snapshot) is actually subset of "suspend"... freeze needs DMAs off and
> saved state, and you need DMAs off and saved state for "suspend".

THEY HAVE ABSOLUTELY NOTHING IN COMMON!

Nobody in their right mind thinks that "disable DMA" and "suspend" are 
similar operations. 

> So it is actually correct to do "suspend" when you want "freeze"; it
> is just slow. That's why they only differ in parameter these days.

It is *not* correct to "suspend" when you want "freeze".

I don't understand how you can even *claim* something like that.

Here's a trivial example:
 - SCSI disk

Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?

And name *one* thing that have in common.

I'll tell you: Nada. Zero. Zilch. Nothing.

"Freeze" for a disk is a total no-op. There is no DMA, there is no 
nothing. In contrast, "suspend" for a disk is a totally valid operation.

Anybody who claims that these two operations are "related" is a moron.

I'm sorry Pavel, but that's exactly how it is.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:23                                       ` Adrian Bunk
@ 2007-04-25 22:19                                         ` Kenneth Crudup
  0 siblings, 0 replies; 712+ messages in thread
From: Kenneth Crudup @ 2007-04-25 22:19 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Nick Piggin, suspend2-devel, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek,
	Ingo Molnar, Andrew Morton, Arjan van de Ven


On Wed, 25 Apr 2007, Adrian Bunk wrote:

> For me it was a serious regression if STD was removed without any
> replacement.

Amen. I have even made material donations to the SS2 effort to give the
developer what he'd needed to fix an issue with a certain configuration
and will do so again if need be, as that expense would be minor compared
to the productivity (== billing) loss that arises from having to start
over from scratch from each of the 3+ times per day I have to shutdown
and physically relocate my machine from place-to-place or client-to-client.

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:30                                           ` Pavel Machek
  2007-04-25 21:40                                             ` Rafael J. Wysocki
@ 2007-04-25 22:22                                             ` Nigel Cunningham
  1 sibling, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-25 22:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Rafael J. Wysocki, Adrian Bunk, Ingo Molnar,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1671 bytes --]

Hello.

On Wed, 2007-04-25 at 23:30 +0200, Pavel Machek wrote:
> Hi!
> 
> > > Please ask anyone who's worked with me if he's had any problem with that.
> > > If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
> > > then, I feel offended.
> > 
> > I'll apologise (and virtually kiss your hairy feet) if you could actually 
> > show me a single implementation that people can agree on.
> > 
> > But until then, I claim that the suspend-to-disk people cannot work with 
> > each other.
> 
> It is not Rafael's fault. Actually it is quite hard to work with
> Nigel, because he implements every feature someone asks for, and wants
> to merge them all :-(. I don't expect to ever agree with Nigel on
> anything important, sorry.

I'm sorry that you feel that way, Pavel.

I can agree that I implement features that people ask for, but I think
saying "every feature someone asks for" is going a bit far (I won't ask
you to prove that). My desire is to provide Linux with hibernation
support that does more than just the bare minimum. Different people have
different usage scenarios, and this has led to me implementing more and
different features.

As to wanting to merge them all, this is true. No one wants to put time
into something only to have it left out. But I don't see why you think
this is a bad thing. Many kernel guys claim the thing follows an
evolutionary model. Well, here's software that has been developed out of
tree - evolved if you like - and which many people would consider more
mature ('evolved'?) than [u]swsusp. If evolutionary theory is to be
followed, let the fittest survive!

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:18                                     ` Linus Torvalds
@ 2007-04-25 22:27                                       ` Nigel Cunningham
  2007-04-25 22:55                                         ` Linus Torvalds
  2007-04-25 22:42                                       ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
                                                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-25 22:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

Hi.

On Wed, 2007-04-25 at 15:18 -0700, Linus Torvalds wrote:
> 
> On Wed, 25 Apr 2007, Pavel Machek wrote:
> > 
> > Not the same... but they are still related. "freeze" (for atomic
> > snapshot) is actually subset of "suspend"... freeze needs DMAs off and
> > saved state, and you need DMAs off and saved state for "suspend".
> 
> THEY HAVE ABSOLUTELY NOTHING IN COMMON!
> 
> Nobody in their right mind thinks that "disable DMA" and "suspend" are 
> similar operations. 
> 
> > So it is actually correct to do "suspend" when you want "freeze"; it
> > is just slow. That's why they only differ in parameter these days.
> 
> It is *not* correct to "suspend" when you want "freeze".
> 
> I don't understand how you can even *claim* something like that.
> 
> Here's a trivial example:
>  - SCSI disk
> 
> Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?
> 
> And name *one* thing that have in common.

Set/reset the scsi transaction id thingy? Hibernation didn't work with
SCSI for a long time precisely because that support was missing.

Don't get me wrong, I agree on the whole - Suspend2 worked fine on the
whole under 2.4 without a driver model. But they do have a bit in
common.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:08                                       ` Pavel Machek
  2007-04-25 20:33                                         ` Rafael J. Wysocki
@ 2007-04-25 22:36                                         ` Manu Abraham
  1 sibling, 0 replies; 712+ messages in thread
From: Manu Abraham @ 2007-04-25 22:36 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

Pavel Machek wrote:

> STD needs to snapshot system, and then it needs devices to be
> suspended so that snapshot is consistent.


One question though, there are devices that can be suspended (broken
suspend) and restore in such a case wouldn't work at all. The only
possible way would be then to reinitialize the device instead of restore ?

Manu


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:18                                     ` Linus Torvalds
  2007-04-25 22:27                                       ` Nigel Cunningham
@ 2007-04-25 22:42                                       ` Pavel Machek
  2007-04-25 22:58                                         ` Linus Torvalds
  2007-04-25 22:43                                       ` Chuck Ebbert
                                                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

Hi!

> > Not the same... but they are still related. "freeze" (for atomic
> > snapshot) is actually subset of "suspend"... freeze needs DMAs off and
> > saved state, and you need DMAs off and saved state for "suspend".
> 
> THEY HAVE ABSOLUTELY NOTHING IN COMMON!
> 
> Nobody in their right mind thinks that "disable DMA" and "suspend" are 
> similar operations. 
> 
> > So it is actually correct to do "suspend" when you want "freeze"; it
> > is just slow. That's why they only differ in parameter these days.
> 
> It is *not* correct to "suspend" when you want "freeze".

Example?

> I don't understand how you can even *claim* something like that.
> 
> Here's a trivial example:
>  - SCSI disk
> 
> Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?

Suspend syncs caches/spins down. Freeze does not do anything.

That's okay, I keep claiming "freeze" is subset of "suspend". Can you
name device where that is not true?

Remember we do

suspend(PMSG_FREEZE)
atomic snapshot
resume()
write snapshot.

So if we do spin the scsi disk down, nothing really bad happens, we'll
just spin it up. (So scsi disk is not example I want. Spining down
scsi disk on freeze is slow and stupid, but it is not incorrect).

Yes, If I'd knew what I know now, drivers would have
suspend/freeze/thaw/resume methods. We probably still can do that
change. Unfortunately, it needs driver authors to understand 4 hooks
(not 2) and do the right thing.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:18                                     ` Linus Torvalds
  2007-04-25 22:27                                       ` Nigel Cunningham
  2007-04-25 22:42                                       ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
@ 2007-04-25 22:43                                       ` Chuck Ebbert
  2007-04-25 23:00                                         ` Linus Torvalds
  2007-04-25 22:49                                       ` Pavel Machek
  2007-04-25 22:57                                       ` Alan Cox
  4 siblings, 1 reply; 712+ messages in thread
From: Chuck Ebbert @ 2007-04-25 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Linus Torvalds wrote:
> Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?
> 
> And name *one* thing that have in common.
> 
> I'll tell you: Nada. Zero. Zilch. Nothing.
> 
> "Freeze" for a disk is a total no-op. There is no DMA, there is no 
> nothing. In contrast, "suspend" for a disk is a totally valid operation.
> 

Freeze is a subset of suspend, isn't it? (It might be an empty subset
in some cases.)


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:18                                     ` Linus Torvalds
                                                         ` (2 preceding siblings ...)
  2007-04-25 22:43                                       ` Chuck Ebbert
@ 2007-04-25 22:49                                       ` Pavel Machek
  2007-04-25 23:10                                         ` Linus Torvalds
  2007-04-25 22:57                                       ` Alan Cox
  4 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 22:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

Hi!

> > Not the same... but they are still related. "freeze" (for atomic
> > snapshot) is actually subset of "suspend"... freeze needs DMAs off and
> > saved state, and you need DMAs off and saved state for "suspend".
> 
> THEY HAVE ABSOLUTELY NOTHING IN COMMON!
> 
> Nobody in their right mind thinks that "disable DMA" and "suspend" are 
> similar operations. 
> 
> > So it is actually correct to do "suspend" when you want "freeze"; it
> > is just slow. That's why they only differ in parameter these days.
> 
> It is *not* correct to "suspend" when you want "freeze".
> 
> I don't understand how you can even *claim* something like that.

BTW most problems are in thaw/resume functions. Both suspend and
freeze are pretty simple, and they both need to save device state. In
SCSI disk, it would be nice to save options set by sdparm. And both
thaw and resume need to be able to restore the device from both
"powered down" and "some state preserved".
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:27                                       ` Nigel Cunningham
@ 2007-04-25 22:55                                         ` Linus Torvalds
  2007-04-25 23:13                                           ` Pavel Machek
                                                             ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 22:55 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> > 
> > And name *one* thing that have in common.
> 
> Set/reset the scsi transaction id thingy? Hibernation didn't work with
> SCSI for a long time precisely because that support was missing.

And by "hibernation", you mean what? You mean "snapshot + shutdown", 
right?

Think about it for five seconds, and then ask yourself: at which point in 
the "snapshot + shutdown" sequence would you actually tell a disk to shut 
down?

If you said "snapshot", then you'd be *wrong*. 

That's my _point_. The snapshot() function should not (and MUST NOT) tell 
disks to shut down, because unlike suspend(), we're still going to _use_ 
those disks afterwards (why? To write out the snapshot image!).

In other words, the act of creating a snapshot has *nothing* to do with 
suspend.

Now, after you've created (and written out) the snapshot, what do you 
actually end up doing?

That's right - you end up _shutting down_ the machine, and yes, as part 
of the _shutdown_ sequence you may actually end up doing a lot of the 
things that a suspend would do. But that's long *after* you've actually 
done the "snapshot" part, and has absolutely nothing to do with it.

That's where I started: whole "suspend to disk" thing actually has _more_ 
to do with "shutdown" than with "suspend". 

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:18                                     ` Linus Torvalds
                                                         ` (3 preceding siblings ...)
  2007-04-25 22:49                                       ` Pavel Machek
@ 2007-04-25 22:57                                       ` Alan Cox
  2007-04-25 23:20                                         ` Linus Torvalds
  4 siblings, 1 reply; 712+ messages in thread
From: Alan Cox @ 2007-04-25 22:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

> Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?
> 
> And name *one* thing that have in common.

Both of them have to ensure you can make a consistent snapshot. Doing
that means you've got to be able to define a single "point" at which the
snapshot is made and is internally self-consistent. That in both cases
tends to mean you've got to ensure nothing occurs which pees on the image
while you are making that snapshot (such as outstanding O_DIRECT I/O to
user pages).

Alan

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:42                                       ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
@ 2007-04-25 22:58                                         ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 22:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> Suspend syncs caches/spins down. Freeze does not do anything.
> 
> That's okay, I keep claiming "freeze" is subset of "suspend". Can you
> name device where that is not true?

Sure. Like just about any PCI device that doesn't do things on its own.

A "freeze" does nothing at all, or perhaps shuts down the reader side 
(for something like a network controller).

A "suspend" does "write D3 to the suspend register". Absolutely zero in 
common.

> Remember we do
> 
> suspend(PMSG_FREEZE)
> atomic snapshot
> resume()
> write snapshot.

AND THAT IS STUPID. It mixes up "suspend()" and creating a snapshot in 
ways that are totally idiotic. There is nothing in common!

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:43                                       ` Chuck Ebbert
@ 2007-04-25 23:00                                         ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 23:00 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Chuck Ebbert wrote:
> 
> Freeze is a subset of suspend, isn't it? (It might be an empty subset
> in some cases.)

NO IT IS NOT!

Yes, you are parroting Pavel, but he can say it a million times, and it's 
*still* not true.

That's like saying "read() is a subset of write(), isn't it?" On many 
devices, they share some of the setup, like writing the same sector 
registers with the same values.

Does that make them subsets of each other?

Or does it mean that they *may* use some of the same common helper 
functions for some devices?

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:49                                       ` Pavel Machek
@ 2007-04-25 23:10                                         ` Linus Torvalds
  2007-04-25 23:28                                           ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 23:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> >
> > I don't understand how you can even *claim* something like that.
> 
> BTW most problems are in thaw/resume functions.

And do you realize that the thaw/resume functions are totally different 
too?

Or rather, they *would* be, if you allowed them to.

For example, for "snapshot + thaw", the _sane_ thing is to actually make 
the snapshot just throw away all the DMA tables etc, and let the thawing 
just do a full initialization (as it did on boot). It basically needs to 
do that anyway, and it simplifies the whole thing (ie you don't even 
*want* to save things like the DMA command queues etc - the ones that will 
quite often be stepped on by the final "write snapshot to disk" stuff 
anyway).

For suspend to ram, in contrast, since you *know* that nobody will be 
touching the hardware, and since the timings are very different anyway 
(you'd hope that you can resume in a second or two), you'd generally want 
to keep the DMA engine tables right where they are, and just literally 
suspend the PCI chip itself.

See? Again, *nothing* in common.

You think they have things in common just because your whole (incorrect) 
mindset has _forced_ them to have things in common, becasue your setup 
stupidly thinks that "resume" is the same as "thaw", the same way you 
think "freeze" is the same as "suspend".

NEITHER is true. You've _made_ them true in your mind, but there's 
absolutely zero reason that they *should* be true.

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:55                                         ` Linus Torvalds
@ 2007-04-25 23:13                                           ` Pavel Machek
  2007-04-25 23:29                                             ` Linus Torvalds
  2007-04-26  1:40                                           ` Nigel Cunningham
  2007-04-26 10:39                                           ` Johannes Berg
  2 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Hi!

> > > And name *one* thing that have in common.
> > 
> > Set/reset the scsi transaction id thingy? Hibernation didn't work with
> > SCSI for a long time precisely because that support was missing.
> 
> And by "hibernation", you mean what? You mean "snapshot + shutdown", 
> right?
> 
> Think about it for five seconds, and then ask yourself: at which point in 
> the "snapshot + shutdown" sequence would you actually tell a disk to shut 
> down?

Current design is:

Twice. Once during snapshot (then we spin it up when the snapshot is
done), and once during shutdown.

Yep, we optimize away spindown, because it takes too long, so SCSI
disks are actually very bad example.

> If you said "snapshot", then you'd be *wrong*. 
> 
> That's my _point_. The snapshot() function should not (and MUST NOT) tell 
> disks to shut down, because unlike suspend(), we're still going to _use_ 
> those disks afterwards (why? To write out the snapshot image!).

No, I'd like you to understand that we actually CAN tell the disks to
spin down, because we'll call resume and spin them back again before
writing the image. We used to do it. We still can do it, but it is
slow.

Yes, it is quite confusing.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:57                                       ` Alan Cox
@ 2007-04-25 23:20                                         ` Linus Torvalds
  2007-04-25 23:52                                           ` Pavel Machek
  2007-04-26  0:24                                           ` Alan Cox
  0 siblings, 2 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 23:20 UTC (permalink / raw)
  To: Alan Cox
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Alan Cox wrote:
> 
> Both of them have to ensure you can make a consistent snapshot.

Bzzt. Wrong again. Very much so.

STR does not need to "ensure that you have a consistent snapshot".

Why? Becuase there is no _room_ for inconsistency. There's nothing to be 
"inconsistent with", since any changes to memory (by things like DMA or 
other setup that happens while the suspend process is going on) is by 
_definition_ consistent with the resume image (becasue there is no 
separate image).

> Doing that means you've got to be able to define a single "point" at 
> which the snapshot is made and is internally self-consistent. That in 
> both cases tends to mean you've got to ensure nothing occurs which pees 
> on the image while you are making that snapshot (such as outstanding 
> O_DIRECT I/O to user pages).

Get off the drugs, Alan. There *is* no snapshot with suspend-to-ram.

Which is the whole point I'm trying to make! A _lot_ of people are 
confused about this.

With suspend-to-ram, you don't need to do a damn thing to the chip, except 
suspend it and resume it. There are _zero_ consistency issues. There is no 
need to freeze anything at any point. You can suspend each device totally 
independently of all other devices (taking into account things like bus 
topology, of course), and there is no "atomic" snapshot that needs to ever 
exist.

That's TOTALLY DIFFERENT from "suspend to disk". In suspend to disk, you 
need a completely different kind of mindset, namely you need a single 
consistent image, where the image is consistent not only with memory, but 
with all the devices.

For example, the whole myth that "freeze" needs to shut off DMA is a total 
and utter *myth*. It needs nothing of the sort at all. Rather than shut 
off DMA and try to make the hardware be wevy wevy quiet while it's hunting 
wabbits, it's a lot easier to just do nothing at all on "freeze", and just 
make sure that "thaw" will re-initialze the DMA tables entirely! All 
drivers have code to do that anyway, since that's what you need to do at 
boot.

Notice?  Totally different. Absolutely NOTHING in common. Not on a 
practical plane, and not even conceptually.  The current (broken!) 
implementation has forced a totally idiotic model on things, where instead 
of snapshotting doing the sane and simple thing, it ends up doing extra 
work that is totally unnecessary, but *becomes* necessary just because it 
*also* shares the "resume" path (which should _not_ be the same either!)

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:10                                         ` Linus Torvalds
@ 2007-04-25 23:28                                           ` Pavel Machek
  2007-04-25 23:57                                             ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

Hi!

> > > I don't understand how you can even *claim* something like that.
> > 
> > BTW most problems are in thaw/resume functions.
> 
> And do you realize that the thaw/resume functions are totally different 
> too?
> 
> Or rather, they *would* be, if you allowed them to.
> 
> For example, for "snapshot + thaw", the _sane_ thing is to actually make 
> the snapshot just throw away all the DMA tables etc, and let the thawing 
> just do a full initialization (as it did on boot). It basically needs to 
> do that anyway, and it simplifies the whole thing (ie you don't even 
> *want* to save things like the DMA command queues etc - the ones that will 
> quite often be stepped on by the final "write snapshot to disk" stuff 
> anyway).

I'd prefer thaw to be similar to module insert, yes.

> For suspend to ram, in contrast, since you *know* that nobody will be 
> touching the hardware, and since the timings are very different anyway 
> (you'd hope that you can resume in a second or two), you'd generally want 
> to keep the DMA engine tables right where they are, and just literally 
> suspend the PCI chip itself.

I'd actually prefer resume to be similar to module insert, too... Do
you think that resume is _that_ time critical?

> You think they have things in common just because your whole (incorrect) 
> mindset has _forced_ them to have things in common, becasue your setup 
> stupidly thinks that "resume" is the same as "thaw", the same way you 
> think "freeze" is the same as "suspend".
> 
> NEITHER is true. You've _made_ them true in your mind, but there's 
> absolutely zero reason that they *should* be true.

[I'd like you to drop me a line saying you understand current design
and that it works -- even if it is very inelegant]

Now, we can separate suspend/freeze and resume/thaw (with some common
helpers). It will speed the code up by avoiding unneccessary
operations. It also needs attetion from driver writers (ouch).

Do we want to do that?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:13                                           ` Pavel Machek
@ 2007-04-25 23:29                                             ` Linus Torvalds
  2007-04-25 23:45                                               ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 23:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> Current design is:

Broken. Yes. I've tried to tell you.

> Twice. Once during snapshot (then we spin it up when the snapshot is
> done), and once during shutdown.

And nobody can possibly say that is "sane". But it's a direct result of 
the incorrect thinking that "suspend()" and "snapshot()" have anything 
what-so-ever to do with each other.

> Yep, we optimize away spindown, because it takes too long, so SCSI
> disks are actually very bad example.

No. SCSI disks are a *good* example. It's an example of how you 
(incorrectly) call the same function for two totally different things, and 
then that function is smart enough that it *understands* that they are 
totally different.

But the *confusion* remains. It remains in your head, and you've poisoned 
people like Alan too, that usually are not confused. And THAT is the main 
problem (although there are also indirect problems like "fixing one may 
break the other", but I actually think that the fundamental problem is the 
confusion it creates, which in turn causes bugs to happen because people 
are confused and think that they should do the same thing for suspend and 
for snapshot).

> No, I'd like you to understand that we actually CAN tell the disks to
> spin down, because we'll call resume and spin them back again before
> writing the image. We used to do it. We still can do it, but it is
> slow.
> 
> Yes, it is quite confusing.

It's worse than just confusing, it's *idiotic*.

It _can_ work in practice, but
 - we have pretty damn solid evidence that it doesn't work all that often 
   in practice
 - the fact that something *can* be done the stupid way is in no way an 
   argument that it *should* be done the stupid way.

I claim that the current STD is *stupid*. Yes, it can work. But that 
doesn't make it less stupid.

What's your argument? Your argument seems to be that it's not stupid, 
because it can work. Can't you see that that simply isn't an argument at 
all? "stupid and wrong" doesn't mean "cannot work in theory". But it 
*does* mean that people get confused, and it *does* mean that there are 
likely more bugs, because confused people do not tend to write very good 
code.

I'm not claiming that the current code cannot work. It clearly *does* 
work for a lot of people. But I'm claiming that it's STUPID.

So don't argue that "it works". Windows works, kind of. That doesn't make 
it less stupid and badly designed!

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 18:50                                 ` Linus Torvalds
  2007-04-25 19:02                                   ` Hua Zhong
  2007-04-25 19:25                                   ` Adrian Bunk
@ 2007-04-25 23:33                                   ` Olivier Galibert
  2007-04-26  1:56                                     ` Nigel Cunningham
  2 siblings, 1 reply; 712+ messages in thread
From: Olivier Galibert @ 2007-04-25 23:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
> .. but if the alternative is a feature that just isn't worth it, and 
> likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> I believe STD is both of those. There's a reason it's called "STD". Go 
> to google and type "STD" and press "I'm feeling lucky". Google is God).

If it was correctly designed, it would be possible to change the
hardware or even the kernel through a STD cycle.  And that would be
damn interesting on servers.

In any case, if I could trust it, I'd use it when I need to move
servers around and I don't want to lose what is running.  Riding power
cuts that way would be nice.

  OG.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:29                                             ` Linus Torvalds
@ 2007-04-25 23:45                                               ` Pavel Machek
  2007-04-26  1:48                                                 ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Hi!

> > Current design is:
> 
> Broken. Yes. I've tried to tell you.

Ok.

...

> It's worse than just confusing, it's *idiotic*.
> 
> It _can_ work in practice, but
>  - we have pretty damn solid evidence that it doesn't work all that often 
>    in practice
>  - the fact that something *can* be done the stupid way is in no way an 
>    argument that it *should* be done the stupid way.
> 
> I claim that the current STD is *stupid*. Yes, it can work. But that 
> doesn't make it less stupid.

Good. So you understand how it works.

> What's your argument? Your argument seems to be that it's not stupid, 
> because it can work. Can't you see that that simply isn't an
> argument at 

I tried keeping module_init/thaw/resume similar code, so that driver
authors can debug suspend-to-disk, cross their fingers, and have
suspend-to-ram work, too.

Now, perhaps enough people do std/str these days so this is not
important any longer... lets hope so.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:14                                               ` Pavel Machek
@ 2007-04-25 23:51                                                 ` David Lang
  2007-04-26  0:38                                                 ` Linus Torvalds
  1 sibling, 0 replies; 712+ messages in thread
From: David Lang @ 2007-04-25 23:51 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Alan Cox, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven

On Thu, 26 Apr 2007, Pavel Machek wrote:

>>> Now, if the old kernel left DMAs running, it could be overwriting
>>> the data we are copying in.
>>
>> The *thaw* needs to happen with devices quiescent.
>>
>> But that sure doesn't have anythign to do with the "snapshot()" path. In
>> fact, you'll have rebooted the machine in between.
>
> Only the fact that we are currently using same device call during
> snapshot() and during restore(). We obviously could do _5_ device
> calls
>
> (suspend/resume/freeze/quiesce_disable_dma/thaw)
>
> ...but that looks like too many calls to me.
>
>> So what does that have to do with "snapshotting"?
>
> I'm not comfortable with memory I'm copying changing under my hands
> because of some DMA. It just looks like asking for trouble. I _think_
> we can get away with DMA running during snapshot, because driver may
> not assume anything about the DMA result before it got completion
> interrupt, but...

the key is that with STR you don't need to copy the memory (it's staying where 
it is)

for STD you need to copy the memory, and there you halt DMA becouse you need to 
make an atomic snapshot.

David Lang

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:20                                         ` Linus Torvalds
@ 2007-04-25 23:52                                           ` Pavel Machek
  2007-04-26  0:05                                             ` Linus Torvalds
  2007-04-26  0:24                                           ` Alan Cox
  1 sibling, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-25 23:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Hi!

> > Both of them have to ensure you can make a consistent snapshot.
> 
> Bzzt. Wrong again. Very much so.
> 
> STR does not need to "ensure that you have a consistent snapshot".
> 
> Why? Becuase there is no _room_ for inconsistency. There's nothing to be 
> "inconsistent with", since any changes to memory (by things like DMA or 
> other setup that happens while the suspend process is going on) is by 
> _definition_ consistent with the resume image (becasue there is no 
> separate image).

Do you propose to keep DMAs running while suspending-to-RAM? That
sounds really unsafe; we are shutting down our PCI controllers at that
time; doing that while DMAs are running sounds bad.

> That's TOTALLY DIFFERENT from "suspend to disk". In suspend to disk, you 
> need a completely different kind of mindset, namely you need a single 
> consistent image, where the image is consistent not only with memory, but 
> with all the devices.
> 
> For example, the whole myth that "freeze" needs to shut off DMA is a total 
> and utter *myth*. It needs nothing of the sort at all. Rather than shut 
> off DMA and try to make the hardware be wevy wevy quiet while it's hunting 
> wabbits, it's a lot easier to just do nothing at all on "freeze",

No. Sorry, you are wrong here. 

Remember that during resume we run

freeze()
copy old data into memory
thaw()

. Now, if the old kernel left DMAs running, it could be overwriting
the data we are copying in. It is not about DMA tables. While
resuming, CPU needs to be alone, without interference from DMA engines
(or other CPUs), because copying back old image means writing to
memory that was not properly alocated.

(Now, we could add one more hook, turn_off_dmas_for_copyback(), but
that looks like way too many hooks to me. And I'm not comfortable with
DMA engines running while I'm trying to copy image. They may be
overwriting data I'm trying to copy...) 

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:28                                           ` Pavel Machek
@ 2007-04-25 23:57                                             ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-25 23:57 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> > For suspend to ram, in contrast, since you *know* that nobody will be 
> > touching the hardware, and since the timings are very different anyway 
> > (you'd hope that you can resume in a second or two), you'd generally want 
> > to keep the DMA engine tables right where they are, and just literally 
> > suspend the PCI chip itself.
> 
> I'd actually prefer resume to be similar to module insert, too... Do
> you think that resume is _that_ time critical?

I think it probably depends on the device, and it should depend on the 
driver writer how he wants to do it.

My _point_ is that there is absolutely zero reason to think that the two 
events are the same. We *know* that for snapshot+shutdown, we need to 
actually keep the DMA tables intact *over* the snapshot (because writing 
out the snapshot may _need_ them). But exactly because we keep them 
intact, a driver writer may sanely say "I didn't even bother shutting them 
down, so on thaw, I cannot trust them, so I'll just re-initialize them 
entirely".

In contrast, over suspend-to-ram, it's entirely reasonable to just leave 
them in memory, and just keep them. There's no *reason* not to.

And that's my whole point in this argument: the two paths are 
fundamentally totally different. You *claim* that "snapshot()" needs to 
stop DMA etc, but that's simply not true.

So I claim:
 - for a lot of devices, it's actually a *lot* easier to just have 
   snapshot not do anythign at all, and re-initialze on thaw
 - for those same devices, for s2ram, since the tables are *safe* and 
   don't get touched by anything else, it's probably easier to just let 
   them be.

See? The "it's easier to do X" is a _different_ X for the two cases. 

So the whole "suspend is a superset of freeze" is simply not true.

> [I'd like you to drop me a line saying you understand current design
> and that it works -- even if it is very inelegant]

I _do_ understand the current design. I just think that it's totally 
and seriously broken. I've ranted against it before. I think it's stupid 
to play like you're "suspending" something just to save some state, 
especially since most users probably don't even *want* to suspend the 
state, and would quite happily re-initialize the chip instead.

And I think it's horrible to have a dynamic flag to tell the difference 
between two or more state changes that the devices should statically be 
able to determine. _If_ some driver really does have the same routine, 
just use the same routine. There are no downsides to splitting them up.

> Now, we can separate suspend/freeze and resume/thaw (with some common
> helpers). It will speed the code up by avoiding unneccessary
> operations. It also needs attetion from driver writers (ouch).
> 
> Do we want to do that?

I'd personally certainly want to do that. But I want to split up the 
callers too. Right now we mix those a lot as well. I suspect that would 
automatically be fixed by just forcing them to separate out (since they 
now call different functions of the devices), but I'm not 100% sure. There 
might be other issues.

Just as an example: one of the most painful things there is in the suspend 
sequence is that we shut off the console (because the console device will 
be suspended in hw, and it's thus not safe to use it over a suspend/resume 
sequence). That should just go away entirely for "snapshot()", because 
there is *never* any excuse for actually turning off the console during a 
snapshot: even a network console should be entirely functional.  Things 
like that - things that sound like small issues, but that really aren't.

(Right now you can enable the "don't disable the console" config option, 
but since network drivers will actually shut down etc, it just means that 
you'll have oopses etc if you do, and you have netconsole enabled)

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:52                                           ` Pavel Machek
@ 2007-04-26  0:05                                             ` Linus Torvalds
  2007-04-26  0:14                                               ` Pavel Machek
  2007-04-26  0:34                                               ` Linus Torvalds
  0 siblings, 2 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  0:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> > 
> > Why? Becuase there is no _room_ for inconsistency. There's nothing to be 
> > "inconsistent with", since any changes to memory (by things like DMA or 
> > other setup that happens while the suspend process is going on) is by 
> > _definition_ consistent with the resume image (becasue there is no 
> > separate image).
> 
> Do you propose to keep DMAs running while suspending-to-RAM?

What part of "suspend a chip" do you have trouble with?

DMA obviously does *not* happen with a suspended device. There's no need 
to turn DMA even off - it just doesn't happen!

> > For example, the whole myth that "freeze" needs to shut off DMA is a total 
> > and utter *myth*. It needs nothing of the sort at all. Rather than shut 
> > off DMA and try to make the hardware be wevy wevy quiet while it's hunting 
> > wabbits, it's a lot easier to just do nothing at all on "freeze",
> 
> No. Sorry, you are wrong here. 
> 
> Remember that during resume we run
> 
> freeze()
> copy old data into memory
> thaw()
> 
> Now, if the old kernel left DMAs running, it could be overwriting
> the data we are copying in.

The *thaw* needs to happen with devices quiescent. 

But that sure doesn't have anythign to do with the "snapshot()" path. In 
fact, you'll have rebooted the machine in between.

So what does that have to do with "snapshotting"?

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:05                                             ` Linus Torvalds
@ 2007-04-26  0:14                                               ` Pavel Machek
  2007-04-25 23:51                                                 ` David Lang
  2007-04-26  0:38                                                 ` Linus Torvalds
  2007-04-26  0:34                                               ` Linus Torvalds
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26  0:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Hi!

> > > Why? Becuase there is no _room_ for inconsistency. There's nothing to be 
> > > "inconsistent with", since any changes to memory (by things like DMA or 
> > > other setup that happens while the suspend process is going on) is by 
> > > _definition_ consistent with the resume image (becasue there is no 
> > > separate image).
> > 
> > Do you propose to keep DMAs running while suspending-to-RAM?
> 
> What part of "suspend a chip" do you have trouble with?
> 
> DMA obviously does *not* happen with a suspended device. There's no need 
> to turn DMA even off - it just doesn't happen!

Ok, I guess I'll have nightmares of DMA controllers doing DMAs from
chips that are no longer there tonight.

> > > For example, the whole myth that "freeze" needs to shut off DMA is a total 
> > > and utter *myth*. It needs nothing of the sort at all. Rather than shut 
> > > off DMA and try to make the hardware be wevy wevy quiet while it's hunting 
> > > wabbits, it's a lot easier to just do nothing at all on "freeze",
> > 
> > No. Sorry, you are wrong here. 
> > 
> > Remember that during resume we run
> > 
> > freeze()
> > copy old data into memory
> > thaw()
> > 
> > Now, if the old kernel left DMAs running, it could be overwriting
> > the data we are copying in.
> 
> The *thaw* needs to happen with devices quiescent. 
> 
> But that sure doesn't have anythign to do with the "snapshot()" path. In 
> fact, you'll have rebooted the machine in between.

Only the fact that we are currently using same device call during
snapshot() and during restore(). We obviously could do _5_ device
calls

(suspend/resume/freeze/quiesce_disable_dma/thaw)

...but that looks like too many calls to me.

> So what does that have to do with "snapshotting"?

I'm not comfortable with memory I'm copying changing under my hands
because of some DMA. It just looks like asking for trouble. I _think_
we can get away with DMA running during snapshot, because driver may
not assume anything about the DMA result before it got completion
interrupt, but... 

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:20                                         ` Linus Torvalds
  2007-04-25 23:52                                           ` Pavel Machek
@ 2007-04-26  0:24                                           ` Alan Cox
  2007-04-26  1:10                                             ` Linus Torvalds
  2007-04-26  7:08                                             ` Andy Grover
  1 sibling, 2 replies; 712+ messages in thread
From: Alan Cox @ 2007-04-26  0:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

> STR does not need to "ensure that you have a consistent snapshot".

Linus I think someone's been spiking your guinness again...

> Why? Becuase there is no _room_ for inconsistency. There's nothing to be 
> "inconsistent with", since any changes to memory (by things like DMA or 
> other setup that happens while the suspend process is going on) is by 
> _definition_ consistent with the resume image (becasue there is no 
> separate image).

You bet there is. We need to know if data arrived or not, because there
is no guarantee that the data retrieved if we inadvertently re-execute a
command will be the same. The hardware state itself isn't the problem,
its the combination of hardware state and internal state which need to
match in some cases.

> off DMA and try to make the hardware be wevy wevy quiet while it's hunting 
> wabbits, it's a lot easier to just do nothing at all on "freeze", and just 
> make sure that "thaw" will re-initialze the DMA tables entirely! All 

Who cares about DMA mapping tables, those are easy to deal with, not even
that bad with an IOMMU to restore. More problematic is the users data
because if we have a device where re-executing a command is not
repeatable (eg O_DIRECT SCSI on a shared bus) then we risk being
inconsistent in our S2RAM.  If we re-run the command on resume having
allowed it to prattle on while doing S2anything then we'll get the wrong
answer.

Now there are lots of devices we don't care about as they don't have
state in the form that causes problems - network cards, TV capture etc,
but there are cases where it matters that every operation is either
finished or not started and there is no doubt about them getting done
during the S2RAM/S2DISK

S2DISK/S2RAM both need to synchronize the state of a device so it can
build a valid snapshot. That bit is a shared requirement just like you
said didn't exist. Doesn't even need to involve turning DMA off, just
getting a consistent state.

The rest can be quite different.

Mind you some laptops think S2RAM is just a transition state on the way
to disk, leave them in ACPI S2RAM and the firmware will magically turn it
into a save to disk and back to ram if the battery runs low or you leave
it idle too long.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:05                                             ` Linus Torvalds
  2007-04-26  0:14                                               ` Pavel Machek
@ 2007-04-26  0:34                                               ` Linus Torvalds
  2007-04-26 20:12                                                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  0:34 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, Linus Torvalds wrote:
> 
> The *thaw* needs to happen with devices quiescent. 

Btw, I sure as hell hope you didn't use "suspend()" for that. You're 
(again) much better off having a totally separate function that just 
freezes stuff.

So in the "snapshot+shutdown" path, you should have:

 - prepare_to_snapshot() - allocate memory, and possibly return errors

   We can skip this, if we just make the rule be that any devices that 
   want to support snapshotting must always have the memory required for 
   snapshotting pre-allocated. Most devices really do allocate memory for 
   their state anyway, and the only real reason for the "prepare" stage 
   here is becasue the final snapshot has to happen with interrupts off, 
   obviously. So *if* we don't need to allocate any memory, and if we 
   don't expect to want to accept some early error case, this is likely 
   useless.

 - snapshot() - actually save device state that is consistent with the 
   memory image at the time. Called with interrupts off, but the device 
   has to be usable both before and afterwards!

And I would seriously suggest that "snapshot()" be documented to not rely 
on any DMA memory, exactly because the device has to be accessible both 
before and after (before - because we're running and allocating memory, 
and after - because we'll be writing thigns out). But see later:

For the "resume snapshot" path, I would suggest having 

 - freeze(): quiesce the device. This literally just does the absolute 
   minimum to make sure that the device doesn't do anything surprising (no 
   interrupts, no DMA, no nothing). For many devices, it's a no-op, even 
   if they can do DMA (eg most disk controllers will do DMA, but only as 
   an actual result of a request, and upper layers will be quiescent 
   anyway, so they do *not* need to disable DMA)

   NOTE! The "freeze()" gets called from the *old* kernel just _before_ a
   snapshot unpacking!!

 - restart_snapshot() - actually restart the snapshot (and usually this 
   would involve re-setting the device, not so much trying to restore all 
   the saved state. IOW, it's easier to just re-initialize the DMA command 
   queues than to try to make them "atomic" in the snapshot).

   NOTE! This gets called by the *new* kernel _after_ the snapshot resume!

And if you *want* to, I can see that you might want to actually do a 
"unfreeze()" thing too, and make the actual shapshotting be:

	/* We may not even need this.. */
	for_each_device() {
		err = prepare_to_snapshot();
		if (err)
			return err;
	}

	/* This is the real work for snapshotting */
	cli();
	for_each_device()
		freeze(dev);
	for_each_device()
		snapshot(dev);
	.. snapshot current memory image ..
	for_each_device_depth_first()
		unfreeze(dev);
	sti();

and maybe it's worth it, but I would almost suggest that you just make the 
rule be that any DMA etc just *has* to be re-initialized by 
"restart_snapshot()", in which case it's not even necessary to 
freeze/unfreeze over the device, and "snapshot()" itself only needs to 
make sure any non-DMA data is safe.

But adding the freeze/unfreeze (which is a no-op for most hardware anyway) 
might make things easier to think about, so I would certainly not *object* 
to it, even if I suspect it's not necessary.

Anyway, the restore_snapshot() sequence should be:

	/* Old kernel.. Normal boot, load snapshot image */
	cli()
	for_each_device()
		freeze(dev);
	restore_snapshot_image();
	restore_regs_and_jump_to_image();
	/* noreturn */


	/* New kernel, gets called at the snapshot restore address
	 * with interrupts off and devices frozen, and memory image
	 * constsntent with what it was at "snapshot()" time
	 */
	for_each_dev_depth_first()
		restore_snapshot(dev);
	/* And if you want to, just to be "symmetric"

		for_each_dev_depth_first()
			unfreeze(dev)

	   although I think you could just make "restore_snapshot()" 
	   implicitly unfreeze it too..
	 */
	sti();
	/* We're up */

and notice how *different* this is from what happens for s2ram. There 
really isn't anything in common here. Exactly because s2ram simply doesn't 
_have_ any of the issues with atomic memory images.

So s2ram is just

	for_each_dev()
		suspend(dev);
	cli();
	for_each_dev()
		late_suspend(dev);
	.. go to sleep ..
	for_each_dev_depth_first()
		early_resume(dev);
	sti();
	for_each_dev_depth_first()
		resume(dev);

and has none of the "freeze" issues at all.

Doesn't that seem a lot more straightforward? Yes, it's more functions, 
but each function is a lot more obvious. This follows the unix rule of "do 
one thing, and do that thing well", instead of trying to make one function 
do many very different things depending on what you actually want done..

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:14                                               ` Pavel Machek
  2007-04-25 23:51                                                 ` David Lang
@ 2007-04-26  0:38                                                 ` Linus Torvalds
  2007-04-26  2:04                                                   ` H. Peter Anvin
  1 sibling, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  0:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> Ok, I guess I'll have nightmares of DMA controllers doing DMAs from
> chips that are no longer there tonight.

Umm. Welcome to the 21st century: we don't do that "separate DMA 
controller" thing any more. All devices do their own DMA.

> Only the fact that we are currently using same device call during
> snapshot() and during restore(). We obviously could do _5_ device
> calls
> 
> (suspend/resume/freeze/quiesce_disable_dma/thaw)
> 
> ...but that looks like too many calls to me.

I'd much rather have five or even more functions that each do *one* 
obvious thing. 

Think like a device driver writer: would you prefer to just implement five 
functions that do something very specific that you know trivially how to 
do ("I know how to disable interrupts and DMA") or would you want to do 
some high-level opertion that you don't even know why the caller asks you 
to suspend? What does "suspend()" even mean when the caller is just going 
to wake up up immediately again? Is it performance-critical? Should I tear 
down all my DMA's? I dunno!

In other words, splitting things up actually makes things simpler. That's 
*doubly* true if you can then give each specific function some really 
clear goals.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 20:08                             ` Linus Torvalds
  2007-04-25 20:27                               ` Pavel Machek
@ 2007-04-26  0:41                               ` Thomas Orgis
  1 sibling, 0 replies; 712+ messages in thread
From: Thomas Orgis @ 2007-04-26  0:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kenneth Crudup, Nick Piggin, suspend2-devel, Mike Galbraith,
	linux-kernel, Con Kolivas, Andrew Morton, Pavel Machek,
	Thomas Gleixner, Ingo Molnar, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 4979 bytes --]

Sort of my 2-many-cents story on why I need "snapshot/restore"...

Am Wed, 25 Apr 2007 13:08:09 -0700 (PDT)
schrieb Linus Torvalds <torvalds@linux-foundation.org>: 

> 
> 
> On Wed, 25 Apr 2007, Kenneth Crudup wrote:
> > 
> > Any working suspend-to-disk method takes care of that for me.  (I'm
> > really not sure why Linus hates S2D so much, though. Back in the day
> > there was a lot more BIOS support, but that's been years now.)
> 
> The really sad part is that APM actually did this better.. 

This really triggers a nerve in me. My laptops (always used models from
some years ago, even) didn't necessarily get easier with respect to power
management (suspend) over time.

My first laptop (Siemens Scenic Mobile 710, 200Mhz Pentium, maxed to
192MB RAM) worked just fine with APM, be it s2ram or s2disk.
Everything handled by the BIOS.
Admittedly, S2disk was quite slow as it stored all ram and didn't write
to the disk as fast as possible, but it worked.
S2ram was also a viable option because I was even able to easily swap
batteries because the thing had two bays to put batteries in.

The next one was a Toshiba Portege 7020 CT (366MHz Pentium2 with dynamic
clock, 192MB), supporting both APM and ACPI.
Installing Linux was not that easy, I think I remember that APM in kernel
froze the box (early 2.6 kernel), while ACPI needed some headache to set
up (compiling a fixed DSDT into the kernel, for example)... I needed
experimental toshiba_acpi to get functions and the acpi_pm_timer to
get something like continuous system clock (special cpu throttling has
funny effects).
Well, I got it together after some time.
Used suspend2 for "snapshot/restore" and actually was able to use ACPI
S3 with the glitch of having to unload/load psmouse driver ... until I
realized that it only resumed in about 80% of cases (BIOS ....).
So suspend2 was a badly needed "hack" around the hardware/BIOS to get
some sane workflow.
I remember dealing with swsusp / pmdisk before... but I really ended
up with suspend2 as the thing that works (and I wouldn't have bothered
finding this patch if the in-kernel stuff worked for me).
Of course this was a long time ago and recently I have seen that
in-kernel swsusp works ok, just this unresponsiveness after "restore"
due to missing page cache...

Now I have an IBM ThinkPad X31 (600-1.4GHz Pentium M, 512MB).
ACPI. SpeedStep.
The machine generally works fine, hardware config via ACPI seems to
be fine.
But doing S3/STR? Well... this machine has the odd idea that turning the
system off but the screen backlight back on after a second is a good idea.
Of course just now S3 worked fine... you cannot even depend on the
malfunction -- could have something to do with changing bootup video
from LCD to VGA output for some other reason recently.
Hm. Perhaps it even may work (after tricking the BIOS!?). But I doubt
I'll suddenly develop trust in that.
I _had_ trust in APM STR and STD.
I am quite confident in suspend2 being able to correctly resume (restore)
after a successful suspend (snapshot/restore).

And then, STR doesn't help me on the road when I need to exchange the
battery (I'd need this special extra battery to put under the ThinkPad
for that).
Another thing is that the old Siemens has a nice auxilliary monochrome
LCD that shows the charge status of the batteries in 5 levels, so you
have some means to predict the time you have in STR. The Thinkpad has
greed LED for "battery level OK" and red for "battery level low".
Well, but the Linux kernel won't change that...

Perhaps at some time ACPI implementations in BIOS get to something
reliable (hm, should I get a PowerBook instead?) and can be a good partner
for Linux which struggles for many years now to get into the post-APM era.
Remember reading desktop PC test reports in the c't magazine in the last
years, S3 usually did _not_ work; with Windows, even.
Well, there must be a reason Microsoft chose to implement the "hibernate"
(it _is_ in software, right?).

The APM->ACPI transition made me use the software STD
(snapshot/restore...;-) and I think I will stay with it for the
forseeable future, and be it because I can do fancy things like image
encryption.
ACPI S3 / STR is a nice addition when it works, for the smaller pauses
(changing a train at the station, leaving office for half an hour...),
but I consider STD really to be the more important feature that enables
me to _never_ close my applications unless I want to do a kernel update.

I really must say that some sort of STD is a total must for a laptop for me.
On the other hand I once had a Psion 5MX, which basically was on STR all
the (non-working) time -- and enabled well over 20h of working time on two AAs.
When laptops enter that range of battery life, I guess I could arrange with
just doing STR and won't have to worry about changing batteries without AC
connection;-)


Alrighty then,

Thomas.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:24                                           ` Alan Cox
@ 2007-04-26  1:10                                             ` Linus Torvalds
  2007-04-26 14:04                                               ` Mark Lord
  2007-04-26  7:08                                             ` Andy Grover
  1 sibling, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  1:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Alan Cox wrote:
> 
> You bet there is. We need to know if data arrived or not, because there
> is no guarantee that the data retrieved if we inadvertently re-execute a
> command will be the same. The hardware state itself isn't the problem,
> its the combination of hardware state and internal state which need to
> match in some cases.

... which is why "suspend()" suspends the hardware.

Is that so hard to understand?

Once the hardware is suspended, it's not doing anything.

But STR doesn't have any need for atomicity guarantees _between_devices_.

That's a really *fundamental* difference. 

The reason s2ram is *so* different from snapshot-to-disk is exactly the 
fact that s2ram can (and does) work on one device at a time. 

In contrast, snapshot-to-disk needs to snapshot all the devices 
*together*, since it has a separate disk image.

See? Two *totally* different cases. They have *nothing* in common. Not the 
call sequence, not the logic, not *anything*.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 19:25                                   ` Adrian Bunk
                                                       ` (3 preceding siblings ...)
  2007-04-25 22:13                                     ` Kenneth Crudup
@ 2007-04-26  1:25                                     ` Antonino A. Daplas
  4 siblings, 0 replies; 712+ messages in thread
From: Antonino A. Daplas @ 2007-04-26  1:25 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Pavel Machek, Ingo Molnar, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

On Wed, 2007-04-25 at 21:25 +0200, Adrian Bunk wrote:
> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 25 Apr 2007, Adrian Bunk wrote:
> > > 
> > > 3W for the complete system? In CPU state S1? [1]
> > 
> > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) 
> > the devices are off, but the motherboard and memory is powered.
> 
> As far as I understand it, the CPU isn't off in S1.
> 
> > > And even 3W would still be a waste of energy.

It is, especially if you're living in a place where power infrastructure
is unreliable (such as where I live). Currently, because of the summer
heat, power demand exceeds power supply so we experience practically
daily rotating 4-hour power interruption. 

That 3W saved multiplied by the total number of computers is a lot.
In this perspective, S2D (or shutdown) is preferred over S2RAM.

Tony



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:55                                         ` Linus Torvalds
  2007-04-25 23:13                                           ` Pavel Machek
@ 2007-04-26  1:40                                           ` Nigel Cunningham
  2007-04-26  2:04                                             ` Linus Torvalds
  2007-04-26 10:39                                           ` Johannes Berg
  2 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  1:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2410 bytes --]

Hi.

On Wed, 2007-04-25 at 15:55 -0700, Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> > > 
> > > And name *one* thing that have in common.
> > 
> > Set/reset the scsi transaction id thingy? Hibernation didn't work with
> > SCSI for a long time precisely because that support was missing.
> 
> And by "hibernation", you mean what? You mean "snapshot + shutdown", 
> right?
> 
> Think about it for five seconds, and then ask yourself: at which point in 
> the "snapshot + shutdown" sequence would you actually tell a disk to shut 
> down?
> 
> If you said "snapshot", then you'd be *wrong*. 

No, I didn't. I agree with you that they should be separate and
distinct. I'm just pointing out that you're overstretching your argument
a little. There are some similiarities in that in both cases we want to
get the driver into some quiet state and out of it again. The difference
comes from the fact that the quite states don't have to be the same and
shouldn't be the same. I won't insult your intelligence by describing
the differences in more detail.

> That's my _point_. The snapshot() function should not (and MUST NOT) tell 
> disks to shut down, because unlike suspend(), we're still going to _use_ 
> those disks afterwards (why? To write out the snapshot image!).
> 
> In other words, the act of creating a snapshot has *nothing* to do with 
> suspend.

Absolutely. It's about getting the data we need to restore it to the
same state post-hibernation-cycle (or, more correctly)
post-atomic-restore.

> Now, after you've created (and written out) the snapshot, what do you 
> actually end up doing?
> 
> That's right - you end up _shutting down_ the machine, and yes, as part 
> of the _shutdown_ sequence you may actually end up doing a lot of the 
> things that a suspend would do. But that's long *after* you've actually 
> done the "snapshot" part, and has absolutely nothing to do with it.
> 
> That's where I started: whole "suspend to disk" thing actually has _more_ 
> to do with "shutdown" than with "suspend". 

That's where I think you're overstretching the argument. Like suspend
(to ram), we're concerned at the snapshot point with getting the
hardware in the same state at a later stage. Unlike suspend, we don't
necessarily want it to enter a low power-usage state as part of that
state preservation.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:45                                               ` Pavel Machek
@ 2007-04-26  1:48                                                 ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  1:48 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

Hi.

On Thu, 2007-04-26 at 01:45 +0200, Pavel Machek wrote:
> > What's your argument? Your argument seems to be that it's not stupid, 
> > because it can work. Can't you see that that simply isn't an
> > argument at 
> 
> I tried keeping module_init/thaw/resume similar code, so that driver
> authors can debug suspend-to-disk, cross their fingers, and have
> suspend-to-ram work, too.



> Now, perhaps enough people do std/str these days so this is not
> important any longer... lets hope so.

Noooo! It's important and getting more important. More and more, people
are going to be demanding better power saving (climate change and all
that stuff). The best power saving is to have the thing completely off,
so STD is more important. The second best power saving is STR, so that's
important too. But even more important is good power saving all the
time.

For that reason, I agree completely with Linus. The current model is far
too limited. It shouldn't be so suspend-to-ram/disk centric, and should
instead focus on run-time power management, with suspend to ram and disk
as particular instances of run-time power management. It should make
appropriate differentiation between snapshotting and suspending to ram.

I do disagree that the current suspend-to-disk algorithm is broken. We
do need a point at which we say "Ok, drivers, record your state." - the
current device_suspend and device_resume calls. But that doesn't mean
the need to be called device_suspend/resume or do what they do now.

I'd love to help make this happen, but I'm afraid I just don't have the
time.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 23:33                                   ` Olivier Galibert
@ 2007-04-26  1:56                                     ` Nigel Cunningham
  2007-04-26  7:27                                       ` David Lang
  0 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  1:56 UTC (permalink / raw)
  To: Olivier Galibert
  Cc: Linus Torvalds, Adrian Bunk, Pavel Machek, Ingo Molnar,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 1246 bytes --]

Hi.

On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote:
> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
> > .. but if the alternative is a feature that just isn't worth it, and 
> > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, 
> > I believe STD is both of those. There's a reason it's called "STD". Go 
> > to google and type "STD" and press "I'm feeling lucky". Google is God).
> 
> If it was correctly designed, it would be possible to change the
> hardware or even the kernel through a STD cycle.  And that would be
> damn interesting on servers.

Those are different issues - hardware hot/cold plugging for the first.

Changing the kernel through a cycle - that's not a design fault. The
problem there is that the kernel and it's associated data structures are
part of the state. Changing the kernel and keeping the image would
require exactly correspondence in data structures, memory map and so on.
That's why the same kernel is required.

> In any case, if I could trust it, I'd use it when I need to move
> servers around and I don't want to lose what is running.  Riding power
> cuts that way would be nice.

That's what Rafael and I working on.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:38                                                 ` Linus Torvalds
@ 2007-04-26  2:04                                                   ` H. Peter Anvin
  2007-04-26  2:32                                                     ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: H. Peter Anvin @ 2007-04-26  2:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven

Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Pavel Machek wrote:
>> Ok, I guess I'll have nightmares of DMA controllers doing DMAs from
>> chips that are no longer there tonight.
> 
> Umm. Welcome to the 21st century: we don't do that "separate DMA 
> controller" thing any more. All devices do their own DMA.
> 

That was the 1990s.  On a brand new server system:

00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
Engine (rev b1)

For better or worse, slave DMA seems to be making a comeback of sorts.
Not to mention all kinds of embedded crap^Whardware with optimized DMA
engines which look nothing like PCI at all.

	-hpa

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  1:40                                           ` Nigel Cunningham
@ 2007-04-26  2:04                                             ` Linus Torvalds
  2007-04-26  2:13                                               ` Nigel Cunningham
  2007-04-26  2:31                                               ` Nigel Cunningham
  0 siblings, 2 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  2:04 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> That's where I think you're overstretching the argument. Like suspend 
>(to ram), we're concerned at the snapshot point with getting the hardware 
>in the same state at a later stage.

Really, no.

"suspend to ram" doesn't _have_ a "snapshot point".

I've tried to explain this multiple times, I don't know why it's not 
apparently sinking in. This is much more fundamental than the fact that 
you don't want to stop disks for snapshotting, although it really boils 
down to all the same issues: the operations are simply not at all the 
same!

I agree 100% that "snapshot to disk" is a "snapshot event". You have to 
create a single point in time when everything is stable. And I'd much 
rather call it "snapshot to disk" than "suspend to disk" to make it clear 
that it's something _totally_ different from "suspend".

Because the thing is, "suspend to ram" is *not* a snapshot event. At no 
point do you actually need to "snapshot" the system at all. You can just 
gradually shut more and more things down, and equally gradually bring them 
back up. There simply is *never* any "snapshot" time from a device 
standpoint, because you can just shut down devices in the right order AND 
YOU ARE DONE.

Really. 

[ Obviously s2ram does have one "magic moment", namely the time when the 
  CPU does the magic read from the northbridge that actually turns off 
  power for the CPU. But that's really a total non-event from a device 
  standpoint, so while it's undoubtedly a very interesting moment in the 
  suspend sequence, it's not really relevant in any way for device 
  drivers in general. Not at all like the "snapshot moment" that requires 
  the whole system to be totally quiescent in a "snapshot to disk"! ]

And the reason s2ram doesn't have a that "snapshot" moment is exactly that 
the RAM contents are just always there, so there's no need to have a 
"synchronization event" when ram and devices match. The RAM will *always* 
match whatever any particular device has done to it, and the proper way to 
handle things is to just do a simple per-device "save-and-suspend" event.

And yes, the _individual_ "save-and-suspend" events obviously needs to be 
"atomic", but it's purely about that particular individual device, so 
there's never any cross-device issues about that.

For example, if you're a USB hub controller, which is just about the most 
complex issue you can have, you obviously want to "save the state" with 
the controller in a STOPPED state, but that should just go without saying: 
if the controller isn't stopped, you simply *cannot* save the state, since 
the state is changing under you. 

The difference is, that the USB driver needs to just "stop, save, and 
suspend" as one simple operation for s2ram. In contrast, when doing 
snapshot to disk, it cannot do that, because while it does want to do the 
"stop" part, it needs to do so _separately_ from the "save" part because 
you need to stop everything else *too* before you "save" anythng at all.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  2:04                                             ` Linus Torvalds
@ 2007-04-26  2:13                                               ` Nigel Cunningham
  2007-04-26  3:03                                                 ` Linus Torvalds
  2007-04-26  2:31                                               ` Nigel Cunningham
  1 sibling, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  2:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 3837 bytes --]

Hi.

On Wed, 2007-04-25 at 19:04 -0700, Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> >
> > That's where I think you're overstretching the argument. Like suspend 
> >(to ram), we're concerned at the snapshot point with getting the hardware 
> >in the same state at a later stage.
> 
> Really, no.
> 
> "suspend to ram" doesn't _have_ a "snapshot point".

Sorry. I wasn't clear. I wasn't saying that suspend to ram has a
snapshot point. I was trying to say it has a point where you're seeking
to save information (PCI state / SCSI transaction number or whatever)
that you'll need to get the hardware into the same state at a later
stage. That (saving information) is the point of similarity.

> I've tried to explain this multiple times, I don't know why it's not 
> apparently sinking in. This is much more fundamental than the fact that 
> you don't want to stop disks for snapshotting, although it really boils 
> down to all the same issues: the operations are simply not at all the 
> same!

Miscommunication, I think. Does the above help?

> I agree 100% that "snapshot to disk" is a "snapshot event". You have to 
> create a single point in time when everything is stable. And I'd much 
> rather call it "snapshot to disk" than "suspend to disk" to make it clear 
> that it's something _totally_ different from "suspend".
> 
> Because the thing is, "suspend to ram" is *not* a snapshot event. At no 
> point do you actually need to "snapshot" the system at all. You can just 
> gradually shut more and more things down, and equally gradually bring them 
> back up. There simply is *never* any "snapshot" time from a device 
> standpoint, because you can just shut down devices in the right order AND 
> YOU ARE DONE.
> 
> Really. 

I suppose that's another point of similarity - for snapshotting, the
same ordering is probably needed?

> [ Obviously s2ram does have one "magic moment", namely the time when the 
>   CPU does the magic read from the northbridge that actually turns off 
>   power for the CPU. But that's really a total non-event from a device 
>   standpoint, so while it's undoubtedly a very interesting moment in the 
>   suspend sequence, it's not really relevant in any way for device 
>   drivers in general. Not at all like the "snapshot moment" that requires 
>   the whole system to be totally quiescent in a "snapshot to disk"! ]
> 
> And the reason s2ram doesn't have a that "snapshot" moment is exactly that 
> the RAM contents are just always there, so there's no need to have a 
> "synchronization event" when ram and devices match. The RAM will *always* 
> match whatever any particular device has done to it, and the proper way to 
> handle things is to just do a simple per-device "save-and-suspend" event.

Yeah.

> And yes, the _individual_ "save-and-suspend" events obviously needs to be 
> "atomic", but it's purely about that particular individual device, so 
> there's never any cross-device issues about that.

No interdependencies? I'm not sure.

> For example, if you're a USB hub controller, which is just about the most 
> complex issue you can have, you obviously want to "save the state" with 
> the controller in a STOPPED state, but that should just go without saying: 
> if the controller isn't stopped, you simply *cannot* save the state, since 
> the state is changing under you. 
> 
> The difference is, that the USB driver needs to just "stop, save, and 
> suspend" as one simple operation for s2ram. In contrast, when doing 
> snapshot to disk, it cannot do that, because while it does want to do the 
> "stop" part, it needs to do so _separately_ from the "save" part because 
> you need to stop everything else *too* before you "save" anythng at all.

Agree.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  2:04                                             ` Linus Torvalds
  2007-04-26  2:13                                               ` Nigel Cunningham
@ 2007-04-26  2:31                                               ` Nigel Cunningham
  1 sibling, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 539 bytes --]

Hi.

Hmm. Perhaps I should have added to that last reply that recognising
that they store similar information doesn't mean I think they need the
same high-level routine for both state transitions.

I'd really like to see each driver have some sort of state machine
controlling its power management, into which these calls were just
another input (an important one, but just another alongside information
about policy, whether we're on battery (UPS or laptop) or AC, whether
the device is actually being used and so on.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  2:04                                                   ` H. Peter Anvin
@ 2007-04-26  2:32                                                     ` Linus Torvalds
  2007-04-26 13:14                                                       ` Alan Cox
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  2:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven



On Wed, 25 Apr 2007, H. Peter Anvin wrote:
> 
> That was the 1990s.  On a brand new server system:
> 
> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
> Engine (rev b1)
> 
> For better or worse, slave DMA seems to be making a comeback of sorts.
> Not to mention all kinds of embedded crap^Whardware with optimized DMA
> engines which look nothing like PCI at all.

Well, the solution to that tends to be to just leave them be, and hold 
them on until the very end - and just ignore them (and just make-believe 
that it's actually the device itself that does the DMA transfer).

The PCI spec for controlling DMA is really pretty nasty. You can disable 
it in the PCI config word, of course, but that usually just messes up the 
device entirely.

So in practice, the way to shut up DMA (regardless of whether it's an 
internal DMA engine or an external one) is that you just tell the device 
not to listen any more (for example, for a network controller, the way to 
make sure it doesn't do DMA is just to make sure that you're not sending 
any frames, but also that it's not listening to any either)!

So whether it's internal to the device, or some "system DMA controller", 
the sequence for shutting down DMA always ends up being the same:

 - make sure the host itself doesn't generate any new traffic (eg shut 
   down the send-queue). This is generally a higher-level thing anyway, ie 
   not really a driver decision.
 - the driver needs to tell the hardware to stop listening (ie "stop 
   scanning the command mailboxes" or "stop walking USB command 
   structures" or "stop receiving data")
 - the driver then needs to wait for the controller to say "ok, I'm idle".

because regardless of whether it's the system DMA controller or some 
on-chip DMA controller, you generally can *not* just say "stop 
transferring DMA data", because that will generally just lock the chip up 
or cause other major unhappiness.

So I don't think an external DMA controller (like the i8237, ugh!) really 
_changes_ anything. Except for just the horrible pain of serializing 
access to them for programming etc horrible resource handling issues, of 
course (but that's not specific to suspend/resume).

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  2:13                                               ` Nigel Cunningham
@ 2007-04-26  3:03                                                 ` Linus Torvalds
  2007-04-26  3:34                                                   ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26  3:03 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> 
> Sorry. I wasn't clear. I wasn't saying that suspend to ram has a
> snapshot point. I was trying to say it has a point where you're seeking
> to save information (PCI state / SCSI transaction number or whatever)
> that you'll need to get the hardware into the same state at a later
> stage. That (saving information) is the point of similarity.

Yes, they do both save information, but I'm not actually convinced they 
would necessarily even save the *same* information.

Let's just take an example of USB, and to make things more interesting, 
say that the disk you want to suspend to is itself over USB (not 
necessarily something you _want_ to do, but I think we can all agree that 
it's something that should potentially work, no?)

Now, USB devices actually have per-connection state (at a minimum, the 
"toggle" bit or whatever), and that's obviously something that will 
inevitably *change* as a result of the device being used after 
snapshotting (and even if not used, by the rediscovery by the first kernel 
to boot), and we fundamentally cannot put the final toggle state in the 
snapshot.

So in the snapshot-to-disk scenario, there are some pieces of data that 
simply fundamentally *cannot* be snapshotted, because they are not 
controller state, they are "connection" state.

So in that case, you basically know that you *have* to rebuild the 
connection when you do the "snapshot_resume()" thing. So there's no point 
in even keeping these kinds of connection states (the same is true of 
keyboards, mice, anything else - it's how USB works).

In contrast, in suspend-to-RAM, USB connections might just be things you 
actually want to keep open and active, and you *can* do so, in ways you 
simply cannot do with "snapshot to disk". In fact, if you are something 
like an OLPC and actually go to s2ram very aggressively, you might well 
want to keep the connection established, because it's conceivable that you 
might otherwise lose keypresses etc issues)

See? There are real *technical* reasons to believe that the two "save 
state" operations are really fundamentally different. There are reasons to 
believe that a s2ram can actually happen while keeping some connections 
open that cannot be kept open over a disk snapshot.

Do they *have* to be different? Of course not. For many devices the "save" 
and "freeze" operations will likely all be no-ops, and there would be 
absolutely no difference between suspending and snapshotting, because the 
driver state already natively contains all the information needed to get 
the device going again.

Equally, I don't doubt that in many drivers you'll have very similar "save 
state" logic, but in fact I believe that in many cases that "save state" 
logic will often just be a simple

	pci_save_state(dev);

call, so it's literally the case that they will not be just shared between 
the "suspend" and "snapshot" case, they'll be shared across all simple PCI 
devices too!

But that doesn't mean that the functions to do so should be the same. You 
might have

	static int mypcidevice_suspend(struct pci_dev *dev)
	{
		pci_save_state(dev);
		pci_set_power_state(dev, PCI_D3);
		return 0;
	}

	static int mupcidevice_snapshot(struct pci_dev *dev)
	{
		pci_save_state(dev);
		return 0;
	}

and who cares if they both have that same call to a shared "save state" 
function? They're still totally different operations, and the fact that 
*some* devices may save the same things doesn't make them any more 
similar! See above why some devices might save totally *different* things 
for a "snapshot" vs a "suspend" event.

> I suppose that's another point of similarity - for snapshotting, the
> same ordering is probably needed?

I agree that you're likely to walk the device list in the same order. The 
whole "shut down leaf devices first", "start up root devices first" is 
pretty fundamental.

But that's true of reboot and device discovery too. Should that ordering 
mean that we should use the "discovery()" function and pass it a flag and 
say "you shouldn't discover, you should snapshot or suspend now"? No. 
Everybody agrees that device discovery is something different from device 
suspend. The fact that it's done in a topological order and thus they bear 
some kind of inverse relationship to each other doesn't make them "the 
same".

> > And yes, the _individual_ "save-and-suspend" events obviously needs to be 
> > "atomic", but it's purely about that particular individual device, so 
> > there's never any cross-device issues about that.
> 
> No interdependencies? I'm not sure.

Well, we pretty much count on it, since we will *suspend* the devices at 
the same time. So if they had interdependencies that aren't described by 
the ordering we enforce, they are pretty much screwed anyway ;)

So yes, the device list needs to be topologically sorted (and you need to 
walk it in the right direction), but apart from that we'd *better* not 
have any interdependencies, or we simply cannot suspend at all.

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  3:03                                                 ` Linus Torvalds
@ 2007-04-26  3:34                                                   ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  3:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 5779 bytes --]

Hi.

On Wed, 2007-04-25 at 20:03 -0700, Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> > 
> > Sorry. I wasn't clear. I wasn't saying that suspend to ram has a
> > snapshot point. I was trying to say it has a point where you're seeking
> > to save information (PCI state / SCSI transaction number or whatever)
> > that you'll need to get the hardware into the same state at a later
> > stage. That (saving information) is the point of similarity.
> 
> Yes, they do both save information, but I'm not actually convinced they 
> would necessarily even save the *same* information.
> 
> Let's just take an example of USB, and to make things more interesting, 
> say that the disk you want to suspend to is itself over USB (not 
> necessarily something you _want_ to do, but I think we can all agree that 
> it's something that should potentially work, no?)

Agreed - it would be nice.

> Now, USB devices actually have per-connection state (at a minimum, the 
> "toggle" bit or whatever), and that's obviously something that will 
> inevitably *change* as a result of the device being used after 
> snapshotting (and even if not used, by the rediscovery by the first kernel 
> to boot), and we fundamentally cannot put the final toggle state in the 
> snapshot.
> 
> So in the snapshot-to-disk scenario, there are some pieces of data that 
> simply fundamentally *cannot* be snapshotted, because they are not 
> controller state, they are "connection" state.
> 
> So in that case, you basically know that you *have* to rebuild the 
> connection when you do the "snapshot_resume()" thing. So there's no point 
> in even keeping these kinds of connection states (the same is true of 
> keyboards, mice, anything else - it's how USB works).

Sort of agree - you might want to record some serial number that might
let you recognise it as the same thing at resume time when everything is
re-hotplugged (assuming it's even there then). Nevertheless, I don't
think that diminishes what you're saying.

> In contrast, in suspend-to-RAM, USB connections might just be things you 
> actually want to keep open and active, and you *can* do so, in ways you 
> simply cannot do with "snapshot to disk". In fact, if you are something 
> like an OLPC and actually go to s2ram very aggressively, you might well 
> want to keep the connection established, because it's conceivable that you 
> might otherwise lose keypresses etc issues)
> 
> See? There are real *technical* reasons to believe that the two "save 
> state" operations are really fundamentally different. There are reasons to 
> believe that a s2ram can actually happen while keeping some connections 
> open that cannot be kept open over a disk snapshot.
> 
> Do they *have* to be different? Of course not. For many devices the "save" 
> and "freeze" operations will likely all be no-ops, and there would be 
> absolutely no difference between suspending and snapshotting, because the 
> driver state already natively contains all the information needed to get 
> the device going again.
> 
> Equally, I don't doubt that in many drivers you'll have very similar "save 
> state" logic, but in fact I believe that in many cases that "save state" 
> logic will often just be a simple
> 
> 	pci_save_state(dev);
> 
> call, so it's literally the case that they will not be just shared between 
> the "suspend" and "snapshot" case, they'll be shared across all simple PCI 
> devices too!
> 
> But that doesn't mean that the functions to do so should be the same. You 
> might have
> 
> 	static int mypcidevice_suspend(struct pci_dev *dev)
> 	{
> 		pci_save_state(dev);
> 		pci_set_power_state(dev, PCI_D3);
> 		return 0;
> 	}
> 
> 	static int mupcidevice_snapshot(struct pci_dev *dev)
> 	{
> 		pci_save_state(dev);
> 		return 0;
> 	}
> 
> and who cares if they both have that same call to a shared "save state" 
> function? They're still totally different operations, and the fact that 
> *some* devices may save the same things doesn't make them any more 
> similar! See above why some devices might save totally *different* things 
> for a "snapshot" vs a "suspend" event.

No disagreement here.

> > I suppose that's another point of similarity - for snapshotting, the
> > same ordering is probably needed?
> 
> I agree that you're likely to walk the device list in the same order. The 
> whole "shut down leaf devices first", "start up root devices first" is 
> pretty fundamental.
> 
> But that's true of reboot and device discovery too. Should that ordering 
> mean that we should use the "discovery()" function and pass it a flag and 
> say "you shouldn't discover, you should snapshot or suspend now"? No. 
> Everybody agrees that device discovery is something different from device 
> suspend. The fact that it's done in a topological order and thus they bear 
> some kind of inverse relationship to each other doesn't make them "the 
> same".
> 
> > > And yes, the _individual_ "save-and-suspend" events obviously needs to be 
> > > "atomic", but it's purely about that particular individual device, so 
> > > there's never any cross-device issues about that.
> > 
> > No interdependencies? I'm not sure.
> 
> Well, we pretty much count on it, since we will *suspend* the devices at 
> the same time. So if they had interdependencies that aren't described by 
> the ordering we enforce, they are pretty much screwed anyway ;)
> 
> So yes, the device list needs to be topologically sorted (and you need to 
> walk it in the right direction), but apart from that we'd *better* not 
> have any interdependencies, or we simply cannot suspend at all.

Thanks for your reply.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:24                                           ` Alan Cox
  2007-04-26  1:10                                             ` Linus Torvalds
@ 2007-04-26  7:08                                             ` Andy Grover
  1 sibling, 0 replies; 712+ messages in thread
From: Andy Grover @ 2007-04-26  7:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: suspend2-devel

Alan Cox wrote:

> Mind you some laptops think S2RAM is just a transition state on the way
> to disk, leave them in ACPI S2RAM and the firmware will magically turn it
> into a save to disk and back to ram if the battery runs low or you leave
> it idle too long.

The OS does this (or at least it's supposed to). STR with battery low,
it comes back on fully via a battery wake event, and then STD (aka
snapshot/poweroff). The ACPI state machine always goes through S0, FWIW.

OK, now back to the "should we have 2 function pointers or 4" debate...

-- Andy


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  1:56                                     ` Nigel Cunningham
@ 2007-04-26  7:27                                       ` David Lang
  2007-04-26  9:45                                         ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: David Lang @ 2007-04-26  7:27 UTC (permalink / raw)
  To: Nigel Cunningham
  Cc: Olivier Galibert, Linus Torvalds, Adrian Bunk, Pavel Machek,
	Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

On Thu, 26 Apr 2007, Nigel Cunningham wrote:

> Hi.
>
> On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote:
>> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
>>> .. but if the alternative is a feature that just isn't worth it, and
>>> likely to not only have its own bugs, but cause bugs elsewhere? (And yes,
>>> I believe STD is both of those. There's a reason it's called "STD". Go
>>> to google and type "STD" and press "I'm feeling lucky". Google is God).
>>
>> If it was correctly designed, it would be possible to change the
>> hardware or even the kernel through a STD cycle.  And that would be
>> damn interesting on servers.
>
> Those are different issues - hardware hot/cold plugging for the first.
>
> Changing the kernel through a cycle - that's not a design fault. The
> problem there is that the kernel and it's associated data structures are
> part of the state. Changing the kernel and keeping the image would
> require exactly correspondence in data structures, memory map and so on.
> That's why the same kernel is required.

that depends on exactly what you save in your snapshot.

one approach is to try and save absolutly everything in ram (this is the current 
approach)

if you do this then you do need to use the same kernel for the reasons that you 
list.

however, you could also decide to only save the information about processes on 
the system (i.e. what you absolutly have to) and let the kernel re-initialize 
itself (along with it's devices) then you could use a different kernel safely. 
doing this should also save you a significant amount of storage when makeing 
your snapshot

David Lang


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  7:27                                       ` David Lang
@ 2007-04-26  9:45                                         ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26  9:45 UTC (permalink / raw)
  To: David Lang
  Cc: Olivier Galibert, Linus Torvalds, Adrian Bunk, Pavel Machek,
	Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]

Hi.

On Thu, 2007-04-26 at 00:27 -0700, David Lang wrote:
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> 
> > Hi.
> >
> > On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote:
> >> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote:
> >>> .. but if the alternative is a feature that just isn't worth it, and
> >>> likely to not only have its own bugs, but cause bugs elsewhere? (And yes,
> >>> I believe STD is both of those. There's a reason it's called "STD". Go
> >>> to google and type "STD" and press "I'm feeling lucky". Google is God).
> >>
> >> If it was correctly designed, it would be possible to change the
> >> hardware or even the kernel through a STD cycle.  And that would be
> >> damn interesting on servers.
> >
> > Those are different issues - hardware hot/cold plugging for the first.
> >
> > Changing the kernel through a cycle - that's not a design fault. The
> > problem there is that the kernel and it's associated data structures are
> > part of the state. Changing the kernel and keeping the image would
> > require exactly correspondence in data structures, memory map and so on.
> > That's why the same kernel is required.
> 
> that depends on exactly what you save in your snapshot.
> 
> one approach is to try and save absolutly everything in ram (this is the current 
> approach)
> 
> if you do this then you do need to use the same kernel for the reasons that you 
> list.
> 
> however, you could also decide to only save the information about processes on 
> the system (i.e. what you absolutly have to) and let the kernel re-initialize 
> itself (along with it's devices) then you could use a different kernel safely. 
> doing this should also save you a significant amount of storage when makeing 
> your snapshot

Well, there is cryopid for individual processes. I suppose you could
potentially try doing a mass cryopiding. That would make things a lot
more complicated though. I'm not saying it's not doable.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-24 21:24                     ` Pavel Machek
  2007-04-24 23:41                       ` Linus Torvalds
@ 2007-04-26 10:17                       ` Johannes Berg
  2007-04-26 10:30                         ` Pavel Machek
  2007-04-26 11:35                         ` Christoph Hellwig
  1 sibling, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 10:17 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 290 bytes --]

On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote:

> I believe uswsusp user/kernel separation is clean enough. Kernel
> provides "snapshot image" and "resume image". (Thanks go to Rafael for
> very clean interface).

The interface isn't even 64/32-bit compatible...

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:17                       ` Johannes Berg
@ 2007-04-26 10:30                         ` Pavel Machek
  2007-04-26 10:40                           ` Pavel Machek
  2007-04-26 11:04                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg
  2007-04-26 11:35                         ` Christoph Hellwig
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 10:30 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

On Thu 2007-04-26 12:17:12, Johannes Berg wrote:
> On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote:
> 
> > I believe uswsusp user/kernel separation is clean enough. Kernel
> > provides "snapshot image" and "resume image". (Thanks go to Rafael for
> > very clean interface).
> 
> The interface isn't even 64/32-bit compatible...

Which parts?

read/write on /dev/snapshot looks ok.

ioctl(SNAPSHOT_FREEZE, UNFREEZE, ATOMIC_RESTORE, FREE, FREE_SWAP_PAGE,
	SNAPSHOT_S2RAM,
	is okay, because it does not pass any data.

ioctl(ATOMIC_SNAPSHOT,
	returns 0/1 through pointer. Should be ok. (Maybe we should do 

		if (!error)
			error = put_user(in_suspend, (u32 __user *)arg);

	...instead, to make it very explicit?

ioctl(SET_IMAGE_SIZE,
	is okay, because it just uses arg directly.

ioctl(PMOPS,
	is okay, because it just uses arg directly... and it is in
	range 0-3 or something.

ioctl(AVAIL_SWAP,
	...hmm, is this the one you are complaining about? It returns
	loff_t through a pointer.  Maybe there's another interface
	that can return available swap, and we should use that, instead?

ioctl(GET_SWAP_PAGE,
	returns sector_t through a pointer. NOt sure if that's good
	idea, either.

ioctl(SET_SWAP_FILE,
	does old_decode_dev(arg). Is that ok?

ioctl(SET_SWAP_AREA,
	shares struct resume_swap_area between user and kernel. I
	guess that's bad..?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 22:55                                         ` Linus Torvalds
  2007-04-25 23:13                                           ` Pavel Machek
  2007-04-26  1:40                                           ` Nigel Cunningham
@ 2007-04-26 10:39                                           ` Johannes Berg
  2007-04-26 11:30                                             ` Pavel Machek
  2 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 10:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith,
	linux-kernel, Con Kolivas, Andrew Morton, Pavel Machek,
	Thomas Gleixner, Ingo Molnar, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

On Wed, 2007-04-25 at 15:55 -0700, Linus Torvalds wrote:

> That's where I started: whole "suspend to disk" thing actually has _more_ 
> to do with "shutdown" than with "suspend". 

From looking at pm_ops which I was recently working with a lot, it seems
that it was designed by somebody who was reading the ACPI documentation
and was otherwise pretty clueless, even at that level std tries to look
like suspend. IMHO that is one of the first things that should be ripped
out, no pm_ops for STD, it's a pain to work with.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:30                         ` Pavel Machek
@ 2007-04-26 10:40                           ` Pavel Machek
  2007-04-26 11:11                             ` Johannes Berg
  2007-04-26 13:45                             ` Johannes Berg
  2007-04-26 11:04                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 10:40 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> > The interface isn't even 64/32-bit compatible...
> 
> Which parts?
> 
> ioctl(AVAIL_SWAP,
> 	...hmm, is this the one you are complaining about? It returns
> 	loff_t through a pointer.  Maybe there's another interface
> 	that can return available swap, and we should use that,
>	 instead?

loff_t is 64bit on i386, so I do not see immediate problem here, but
maybe we should just explicitely pass u64?

> ioctl(GET_SWAP_PAGE,
> 	returns sector_t through a pointer. NOt sure if that's good
> 	idea, either.

Ok, that's very bad idea, because sector_t can be 32-bit or 64-bit,
depending on CONFIG_LBD. We need to use u64 here.

> ioctl(SET_SWAP_FILE,
> 	does old_decode_dev(arg). Is that ok?
> 
> ioctl(SET_SWAP_AREA,
> 	shares struct resume_swap_area between user and kernel. I
> 	guess that's bad..?

struct resume_swap_area {
        loff_t offset;
        u_int32_t dev;
} __attribute__((packed));

...I guess we should change loff_t -> u64 and problem is solved?

Old_decode_dev takes u16. That sucks for majors/minors > 256, but
fortunately those are not common.

Does this seem to help?
									Pavel

diff --git a/kernel/power/power.h b/kernel/power/power.h
index eb461b8..dc13af5 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -114,7 +114,7 @@ extern int snapshot_image_loaded(struct 
  * SNAPSHOT_SET_SWAP_AREA ioctl
  */
 struct resume_swap_area {
-	loff_t offset;
+	u_int64_t offset;
 	u_int32_t dev;
 } __attribute__((packed));
 
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 558e18e..d0730c1 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -215,8 +215,7 @@ static int snapshot_ioctl(struct inode *
 {
 	int error = 0;
 	struct snapshot_data *data;
-	loff_t avail;
-	sector_t offset;
+	u64 avail, offset;
 
 	if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC)
 		return -ENOTTY;
@@ -286,7 +285,7 @@ static int snapshot_ioctl(struct inode *
 	case SNAPSHOT_AVAIL_SWAP:
 		avail = count_swap_pages(data->swap, 1);
 		avail <<= PAGE_SHIFT;
-		error = put_user(avail, (loff_t __user *)arg);
+		error = put_user(avail, (u64 __user *)arg);
 		break;
 
 	case SNAPSHOT_GET_SWAP_PAGE:
@@ -304,7 +303,7 @@ static int snapshot_ioctl(struct inode *
 		offset = alloc_swapdev_block(data->swap, data->bitmap);
 		if (offset) {
 			offset <<= PAGE_SHIFT;
-			error = put_user(offset, (sector_t __user *)arg);
+			error = put_user(offset, (u64 __user *)arg);
 		} else {
 			error = -ENOSPC;
 		}

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:30                         ` Pavel Machek
  2007-04-26 10:40                           ` Pavel Machek
@ 2007-04-26 11:04                           ` Johannes Berg
  2007-04-26 11:09                             ` Pavel Machek
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 11:04 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

On Thu, 2007-04-26 at 12:30 +0200, Pavel Machek wrote:
> On Thu 2007-04-26 12:17:12, Johannes Berg wrote:
> > On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote:
> > 
> > > I believe uswsusp user/kernel separation is clean enough. Kernel
> > > provides "snapshot image" and "resume image". (Thanks go to Rafael for
> > > very clean interface).
> > 
> > The interface isn't even 64/32-bit compatible...
> 
> Which parts?

ioctl numbers last time I talked about it with Rafael. No effort was
made to fix it.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:04                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg
@ 2007-04-26 11:09                             ` Pavel Machek
  2007-04-26 15:53                               ` Linus Torvalds
  2007-04-26 18:21                               ` Olivier Galibert
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:09 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> > > > I believe uswsusp user/kernel separation is clean enough. Kernel
> > > > provides "snapshot image" and "resume image". (Thanks go to Rafael for
> > > > very clean interface).
> > > 
> > > The interface isn't even 64/32-bit compatible...
> > 
> > Which parts?
> 
> ioctl numbers last time I talked about it with Rafael. No effort was
> made to fix it.

#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, void *)
#define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long)
#define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, void *)
#define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, void *)
#define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int)
#define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int)

Are these a problem? Do we need to just use u32 as a argument to keep
ioctl numbers same between 32 and 64bit versions?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:40                           ` Pavel Machek
@ 2007-04-26 11:11                             ` Johannes Berg
  2007-04-26 11:16                               ` Pavel Machek
  2007-04-26 13:45                             ` Johannes Berg
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 11:11 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 335 bytes --]

On Thu, 2007-04-26 at 12:40 +0200, Pavel Machek wrote:

> Does this seem to help?

No idea, I haven't actually tried it yet, last time I tried uswsusp on
my 32/32 machine it didn't work due to endian problems that were
supposed to be resolved but I haven't had a chance to pick all the bits
together that you need.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  8:10                               ` Pavel Machek
  2007-04-25  8:22                                 ` Dumitru Ciobarcianu
@ 2007-04-26 11:12                                 ` Pekka Enberg
  2007-04-26 14:48                                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Pekka Enberg @ 2007-04-26 11:12 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds,
	Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven, rjw

On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote:
> > Please stop using FUD.
> > Graphical progress it's not in the kernel, even with suspend2.
>
> It was ascii-art, but still 'graphical', last time I checked.

Suspend2 talks to an userspace client via netlink. While I find the
name of the message ("redraw UI") rather appaling, there's nothing
wrong in principle that userspace starts the suspend process and the
kernel keeps feeding back progress information ("I froze all processes
now") so it can display a graphical progress bar.

The real question here is what to do with compression and encryption.
However, if you settle for one compression algorithm (such as LZF in
the case of suspend2) and use the _existing_ in-kernel crypto API for
encryption, suddenly the benefits of userspace suspend are not clear.

As you and Rafael seem to be mostly interested in uswsusp, why don't
we replace the old in-kernel implementation with suspend2?

                                         Pekka

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:11                             ` Johannes Berg
@ 2007-04-26 11:16                               ` Pavel Machek
  2007-04-26 11:27                                 ` Johannes Berg
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:16 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> > Does this seem to help?
> 
> No idea, I haven't actually tried it yet, last time I tried uswsusp on
> my 32/32 machine it didn't work due to endian problems that were
> supposed to be resolved but I haven't had a chance to pick all the bits
> together that you need.

This one should prevent ioctl numbers changing, too.

diff --git a/kernel/power/power.h b/kernel/power/power.h
index eb461b8..a18b85a 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -114,23 +114,23 @@ extern int snapshot_image_loaded(struct 
  * SNAPSHOT_SET_SWAP_AREA ioctl
  */
 struct resume_swap_area {
-	loff_t offset;
+	u_int64_t offset;
 	u_int32_t dev;
 } __attribute__((packed));
 
 #define SNAPSHOT_IOC_MAGIC	'3'
 #define SNAPSHOT_FREEZE			_IO(SNAPSHOT_IOC_MAGIC, 1)
 #define SNAPSHOT_UNFREEZE		_IO(SNAPSHOT_IOC_MAGIC, 2)
-#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, void *)
+#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */
 #define SNAPSHOT_ATOMIC_RESTORE		_IO(SNAPSHOT_IOC_MAGIC, 4)
 #define SNAPSHOT_FREE			_IO(SNAPSHOT_IOC_MAGIC, 5)
-#define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long)
-#define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, void *)
-#define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, void *)
+#define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */
+#define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */
+#define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */
 #define SNAPSHOT_FREE_SWAP_PAGES	_IO(SNAPSHOT_IOC_MAGIC, 9)
-#define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int)
+#define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */
 #define SNAPSHOT_S2RAM			_IO(SNAPSHOT_IOC_MAGIC, 11)
-#define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int)
+#define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */
 #define SNAPSHOT_SET_SWAP_AREA		_IOW(SNAPSHOT_IOC_MAGIC, 13, \
 							struct resume_swap_area)
 #define SNAPSHOT_IOC_MAXNR	13
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 558e18e..d0730c1 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -215,8 +215,7 @@ static int snapshot_ioctl(struct inode *
 {
 	int error = 0;
 	struct snapshot_data *data;
-	loff_t avail;
-	sector_t offset;
+	u64 avail, offset;
 
 	if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC)
 		return -ENOTTY;
@@ -286,7 +285,7 @@ static int snapshot_ioctl(struct inode *
 	case SNAPSHOT_AVAIL_SWAP:
 		avail = count_swap_pages(data->swap, 1);
 		avail <<= PAGE_SHIFT;
-		error = put_user(avail, (loff_t __user *)arg);
+		error = put_user(avail, (u64 __user *)arg);
 		break;
 
 	case SNAPSHOT_GET_SWAP_PAGE:
@@ -304,7 +303,7 @@ static int snapshot_ioctl(struct inode *
 		offset = alloc_swapdev_block(data->swap, data->bitmap);
 		if (offset) {
 			offset <<= PAGE_SHIFT;
-			error = put_user(offset, (sector_t __user *)arg);
+			error = put_user(offset, (u64 __user *)arg);
 		} else {
 			error = -ENOSPC;
 		}




-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:27                                 ` Johannes Berg
@ 2007-04-26 11:26                                   ` Pavel Machek
  2007-04-26 11:35                                     ` Johannes Berg
  2007-04-26 15:56                                     ` Linus Torvalds
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:26 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> > This one should prevent ioctl numbers changing, too.
> 
> > -#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, void *)
> > +#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */
> 
> Afaict that'll actually change ioctl numbers breaking existing 64-bit
> userspace.

Yes, probably will. The other option is to break existing 32-bit
userspace, which is a bit more common AFAICT.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:16                               ` Pavel Machek
@ 2007-04-26 11:27                                 ` Johannes Berg
  2007-04-26 11:26                                   ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 11:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 369 bytes --]

On Thu, 2007-04-26 at 13:16 +0200, Pavel Machek wrote:

> This one should prevent ioctl numbers changing, too.

> -#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, void *)
> +#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */

Afaict that'll actually change ioctl numbers breaking existing 64-bit
userspace.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:41                         ` Matt Mackall
@ 2007-04-26 11:27                           ` Pavel Machek
  2007-04-26 19:04                           ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:27 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse,
	Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas,
	suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven

Hi!

> > We do not want to fragment the testing base, and suspend2 does not
> > really have any interesting features over uswsusp.
> 
> The testing base is already fragmented!
> 
> What the current situation means is that you simply never hear from
> the people who get fed up with suspend but who manage to get suspend2
> working.

Well, and what can I do?

I can wait for suspend2 to slowly disappear on their own. Merging
suspend2 would just make testing base _more_ fragmented then it is today. 
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:39                                           ` Johannes Berg
@ 2007-04-26 11:30                                             ` Pavel Machek
  2007-04-26 11:41                                               ` Johannes Berg
  2007-04-26 16:31                                                 ` Johannes Berg
  0 siblings, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:30 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel,
	Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Arjan van de Ven

Hi!

> > That's where I started: whole "suspend to disk" thing actually has _more_ 
> > to do with "shutdown" than with "suspend". 
> 
> From looking at pm_ops which I was recently working with a lot, it seems
> that it was designed by somebody who was reading the ACPI documentation
> and was otherwise pretty clueless, even at that level std tries to look
> like suspend. IMHO that is one of the first things that should be ripped
> out, no pm_ops for STD, it's a pain to work with.

That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
low-level enter is pretty similar).

Patches would be welcome, as would be "suspend-to-ram maintainer".
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:35                                     ` Johannes Berg
@ 2007-04-26 11:33                                       ` Pavel Machek
  2007-04-26 16:14                                       ` Chris Friesen
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 11:33 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> > Yes, probably will. The other option is to break existing 32-bit
> > userspace, which is a bit more common AFAICT.
> 
> Judging from experience with the wext 32/64 bit fiasco it seems to be
> rather uncommon to use 32-bit userspace on 64-bit machines.

Well, it would break 32-bit userspace on 32-bit kernel, which is the
most common version, AFAICT.

> Rafael hinted that we could just add these numbers, keep the existing
> ones and then phase them out over time, but I haven't really given it
> much thought.

We could probably do that... but it is slightly ugly.



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:26                                   ` Pavel Machek
@ 2007-04-26 11:35                                     ` Johannes Berg
  2007-04-26 11:33                                       ` Pavel Machek
  2007-04-26 16:14                                       ` Chris Friesen
  2007-04-26 15:56                                     ` Linus Torvalds
  1 sibling, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 11:35 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

On Thu, 2007-04-26 at 13:26 +0200, Pavel Machek wrote:

> Yes, probably will. The other option is to break existing 32-bit
> userspace, which is a bit more common AFAICT.

Judging from experience with the wext 32/64 bit fiasco it seems to be
rather uncommon to use 32-bit userspace on 64-bit machines.

Rafael hinted that we could just add these numbers, keep the existing
ones and then phase them out over time, but I haven't really given it
much thought.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:17                       ` Johannes Berg
  2007-04-26 10:30                         ` Pavel Machek
@ 2007-04-26 11:35                         ` Christoph Hellwig
  2007-04-26 12:15                           ` Ingo Molnar
  1 sibling, 1 reply; 712+ messages in thread
From: Christoph Hellwig @ 2007-04-26 11:35 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

On Thu, Apr 26, 2007 at 12:17:12PM +0200, Johannes Berg wrote:
> On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote:
> 
> > I believe uswsusp user/kernel separation is clean enough. Kernel
> > provides "snapshot image" and "resume image". (Thanks go to Rafael for
> > very clean interface).
> 
> The interface isn't even 64/32-bit compatible...

It's not .  And it's one of the worst interface I've seen lately.
Did anyone actually review this crap before it went in?  I completely
agree with Linus that these kind of boundaries that lead to horribly
complex ioctl interface are totally wrong.

Now suspend2 wasn't exactly nice either when I last reviewed it,
but we should probably give it another attempt if we can sort out
a proper incremental merge.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:30                                             ` Pavel Machek
@ 2007-04-26 11:41                                               ` Johannes Berg
  2007-04-26 16:31                                                 ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 11:41 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel,
	Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:

> That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> low-level enter is pretty similar).

But that doesn't excuse abusing the same interface, IMHO.

> Patches would be welcome

:) I'll see what I can do. Shouldn't be too hard to add an interface
just for ACPI here and get platform disk-mode into there from a
different angle.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:35                         ` Christoph Hellwig
@ 2007-04-26 12:15                           ` Ingo Molnar
  2007-04-26 12:41                             ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Ingo Molnar @ 2007-04-26 12:15 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Berg, Pavel Machek, Linus Torvalds,
	Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner,
	Con Kolivas, suspend2-devel, Andrew Morton, Arjan van de Ven


* Christoph Hellwig <hch@infradead.org> wrote:

> > The interface isn't even 64/32-bit compatible...
> 
> It's not .  And it's one of the worst interface I've seen lately. Did 
> anyone actually review this crap before it went in?  I completely 
> agree with Linus that these kind of boundaries that lead to horribly 
> complex ioctl interface are totally wrong.

it's a bit hard to see the point of it anyway: the resume binary (much 
of the focus of the ioctls) fundamentally lives as an 'initrd binary' - 
and most of the stuff that wants to execute in an initrd is 
fundamentally tied to the kernel anyway.

Perhaps we should allow "in-kernel userspace" that would be allowed to 
grow ad-hoc interfaces and linking that would only be compatible with 
the kernel they are embedded into: e.g. the klibc stuff in linux/usr/* 
could link to the kernel (via whatever method) and just be in essence 
another type of kernel code - but happening to execute in user-space, 
having access to the normal user-space facilities and being able to link 
to (GPL) user-space libraries. Perhaps this would bridge the "i want to 
tinker in user-space because it's technically easier/cleaner there" and 
"fine but that needs formalized ABIs for your connection to 
kernel-space" gap.

> Now suspend2 wasn't exactly nice either when I last reviewed it, but 
> we should probably give it another attempt if we can sort out a proper 
> incremental merge.

yeah, it still has quite a bit of work left, but it looked fundamentally 
split-uppable.

	Ingo

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 12:15                           ` Ingo Molnar
@ 2007-04-26 12:41                             ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 12:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Johannes Berg, Linus Torvalds, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Andrew Morton, Arjan van de Ven

Hi!

> > > The interface isn't even 64/32-bit compatible...
> > 
> > It's not .  And it's one of the worst interface I've seen lately. Did 
> > anyone actually review this crap before it went in?  I completely 
> > agree with Linus that these kind of boundaries that lead to horribly 
> > complex ioctl interface are totally wrong.
> 
> it's a bit hard to see the point of it anyway: the resume binary (much 
> of the focus of the ioctls) fundamentally lives as an 'initrd binary' - 
> and most of the stuff that wants to execute in an initrd is 
> fundamentally tied to the kernel anyway.

Typically... yes, it needs to be in initrd.

And yes, klibc would help here.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  2:32                                                     ` Linus Torvalds
@ 2007-04-26 13:14                                                       ` Alan Cox
  2007-04-26 16:02                                                         ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Cox @ 2007-04-26 13:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Pavel Machek, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven

> The PCI spec for controlling DMA is really pretty nasty. You can disable 
> it in the PCI config word, of course, but that usually just messes up the 
> device entirely.

And some devices ignore it. Some of the older Cyrix stuff I have appears
not to care how the master bit is set.


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 10:40                           ` Pavel Machek
  2007-04-26 11:11                             ` Johannes Berg
@ 2007-04-26 13:45                             ` Johannes Berg
  2007-06-29 22:44                               ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 13:45 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 324 bytes --]

By the way.

> diff --git a/kernel/power/power.h b/kernel/power/power.h
> index eb461b8..dc13af5 100644
> --- a/kernel/power/power.h
> +++ b/kernel/power/power.h
        ^^^^^^^^^^^^^^^^^^^^

Don't these definitions need to be exported to userspace? That
definitely is not a header file for userspace.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  1:10                                             ` Linus Torvalds
@ 2007-04-26 14:04                                               ` Mark Lord
  2007-04-26 16:10                                                 ` Linus Torvalds
  0 siblings, 1 reply; 712+ messages in thread
From: Mark Lord @ 2007-04-26 14:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Pavel Machek, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven

Linus Torvalds wrote:
> 
> See? Two *totally* different cases. They have *nothing* in common. Not the 
> call sequence, not the logic, not *anything*.

Except that both methods cannot rely upon hot-pluggable devices
still being present on resume/restore.  It is exceptionally common
to unplug all USB/firewire cables, mouse, keyboard, docking cables etc..
after a machine is in S2R state.

Cheers

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:12                                 ` Pekka Enberg
@ 2007-04-26 14:48                                   ` Rafael J. Wysocki
  2007-04-26 16:10                                     ` Pekka Enberg
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 14:48 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds,
	Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

On Thursday, 26 April 2007 13:12, Pekka Enberg wrote:
> On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote:
> > > Please stop using FUD.
> > > Graphical progress it's not in the kernel, even with suspend2.
> >
> > It was ascii-art, but still 'graphical', last time I checked.
> 
> Suspend2 talks to an userspace client via netlink. While I find the
> name of the message ("redraw UI") rather appaling, there's nothing
> wrong in principle that userspace starts the suspend process and the
> kernel keeps feeding back progress information ("I froze all processes
> now") so it can display a graphical progress bar.
> 
> The real question here is what to do with compression and encryption.
> However, if you settle for one compression algorithm (such as LZF in
> the case of suspend2) and use the _existing_ in-kernel crypto API for
> encryption, suddenly the benefits of userspace suspend are not clear.
> 
> As you and Rafael seem to be mostly interested in uswsusp, why don't
> we replace the old in-kernel implementation with suspend2?

It has a lot of common code with uswsusp.  Practically, the saving of the image
is the only part of it that could be removed, but this is simple and _really_
helps with debugging.

In principle, we could add suspend2 as an alternative (in analogy with the I/O
schedulers, for example), but I think for this purpose it should be reviewed
properly.

There also is a real problem with how it uses the LRU pages.  It _seems_ to
work, but at least to me it seems to be potentially dangerous.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:09                             ` Pavel Machek
@ 2007-04-26 15:53                               ` Linus Torvalds
  2007-04-26 18:21                               ` Olivier Galibert
  1 sibling, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26 15:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> #define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, void *)
> #define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long)
> #define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, void *)
> #define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, void *)
> #define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int)
> #define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int)
> 
> Are these a problem? Do we need to just use u32 as a argument to keep
> ioctl numbers same between 32 and 64bit versions?

No, you need to use the *proper* type as an argument, and assuming that 
type has the same representation in both 32-bit and 64-bit world, the 
numbers will automatically match.

Using "void *" is totally bogus. It's supposed to be the actual argument 
you pass in, not the pointer to it. If your argument doesn't have a 
"struct xyz" kind of format, then you could use "int" (or u32 or 
something: but realistically int is 32-bit for the forseeable future), but 
it's always wrong to pass in "void *" or "unsigned long", since either of 
those are just a sign of the interface being either (a) misunderstood or 
(b) broken.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:26                                   ` Pavel Machek
  2007-04-26 11:35                                     ` Johannes Berg
@ 2007-04-26 15:56                                     ` Linus Torvalds
  2007-04-26 21:06                                       ` Theodore Tso
  1 sibling, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26 15:56 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki



On Thu, 26 Apr 2007, Pavel Machek wrote:
> 
> Yes, probably will. The other option is to break existing 32-bit
> userspace, which is a bit more common AFAICT.

And *this* is why kernel/userspace things simply should not be done.

It's simply better to do things entirely in the kernel. Because you add 
bugs and complications otherwise, and doing it in the kernel allows you 
to just switch things around.

As it is, it appears that user-space suspend is just broken whichever way 
we turn.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 13:14                                                       ` Alan Cox
@ 2007-04-26 16:02                                                         ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26 16:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: H. Peter Anvin, Pavel Machek, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Alan Cox wrote:
>
> > The PCI spec for controlling DMA is really pretty nasty. You can disable 
> > it in the PCI config word, of course, but that usually just messes up the 
> > device entirely.
> 
> And some devices ignore it. Some of the older Cyrix stuff I have appears
> not to care how the master bit is set.

I'm not surprised. If the choice is between locking up the PCI bus by 
hanging the device in endless retries, or just ignoring the bit, I suspect 
"just ignore it" is actually the better choice.

Of course, in a perfect world you'd happily honor it, raise a PCI error, 
and all is good, but in practice the internal state machine of most 
non-trivial hardware is simply so complicated that the "abort gracefully" 
simply isn't an option.

The hw people have enough problems in getting things to work when 
everything is peachy and well, and a lot of companies end up releasing 
stuff with known errata for even the _normal_ cases, just because they 
expect software to work around them ("Doctor, doctor, it hurts when I do 
the documented access!" "You didn't read errata #317, did you? Don't do 
that, then!")

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 14:04                                               ` Mark Lord
@ 2007-04-26 16:10                                                 ` Linus Torvalds
  2007-04-26 21:00                                                   ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26 16:10 UTC (permalink / raw)
  To: Mark Lord
  Cc: Alan Cox, Pavel Machek, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven



On Thu, 26 Apr 2007, Mark Lord wrote:
> Linus Torvalds wrote:
> > 
> > See? Two *totally* different cases. They have *nothing* in common. Not the
> > call sequence, not the logic, not *anything*.
> 
> Except that both methods cannot rely upon hot-pluggable devices
> still being present on resume/restore.  It is exceptionally common
> to unplug all USB/firewire cables, mouse, keyboard, docking cables etc..
> after a machine is in S2R state.

Right, and that has nothing to do with suspend/resume. You'd better be 
able to handle unexpected hotplugs _regardless_.

For example, it's quite common that people just "remove" the 
pcmcia/cardbus card while the driver is active. And in fact, when that 
happens, it's also quite common that the hardware raises the irq for that 
(active) driver (in fact, it's more than common: since the "card removal" 
interrupt for the Cardbus controller is generally always the same as the 
"card interrupt" interrupt for the low-level card driver, you can pretty 
much *guarantee* that you get that interrupt).

So the end result is that the interrupt handler and all normal IO routines 
for a hotpluggable piece of hardware baically _have_ to be able to 
gracefully handle the "oops, the hw simply isn't there any more" case!

The resume code isn't any different at all. It should run perfectly 
normally, but for hotpluggable devices, it has to follow all the same 
rules: handle the "oops, the hw is gone" case gracefully.  No different, 
and it's totally unrelated to suspend/resume: it's a *generic* issue.

In fact, suspend/resume is better off than a lot of the other code is, 
simply because it's easier to test that case and know you hit that 
particular sequence! It's much harder to verify that the "send packet" 
case is safe, because how are you going to know to remove the card at the 
right point to trigger it?

			Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 14:48                                   ` Rafael J. Wysocki
@ 2007-04-26 16:10                                     ` Pekka Enberg
  2007-04-26 19:28                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Pekka Enberg @ 2007-04-26 16:10 UTC (permalink / raw)
  To: rjw
  Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds,
	Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven


On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> In principle, we could add suspend2 as an alternative (in analogy with the I/O
> schedulers, for example), but I think for this purpose it should be reviewed
> properly.

Yeah, this makes sense.

On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> There also is a real problem with how it uses the LRU pages.  It _seems_ to
> work, but at least to me it seems to be potentially dangerous.

I am new to suspend2 so can you please explain what exactly is dangerous
about it?

                                     Pekka

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:35                                     ` Johannes Berg
  2007-04-26 11:33                                       ` Pavel Machek
@ 2007-04-26 16:14                                       ` Chris Friesen
  2007-04-26 16:27                                         ` Linus Torvalds
  2007-04-26 17:11                                         ` Johannes Berg
  1 sibling, 2 replies; 712+ messages in thread
From: Chris Friesen @ 2007-04-26 16:14 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Johannes Berg wrote:

> Judging from experience with the wext 32/64 bit fiasco it seems to be
> rather uncommon to use 32-bit userspace on 64-bit machines.

I disagree...it's quite common.  I think its the standard way of doing 
things for ppc64, for instance.

Chris

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 16:14                                       ` Chris Friesen
@ 2007-04-26 16:27                                         ` Linus Torvalds
  2007-04-26 17:11                                         ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-26 16:27 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Johannes Berg, Pavel Machek, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki



On Thu, 26 Apr 2007, Chris Friesen wrote:
> 
> I disagree...it's quite common.  I think its the standard way of doing things
> for ppc64, for instance.

It is, although most x86-64 installations seem to be 64-bit user space 
*if*you*install*from*scatch*.

Of course, at least some users (yeah, I've done it) started with a 32-bit 
CD they had lying around, and upgraded just the kernel. And I'm sure some 
distro out there just defaults to 32-bit binaries just because (in 
practice, you have to use a 32-bit firefox anyway if you want flash etc, 
so you need all the 32-bit libraries, so the argument might go that you 
might as well use 32-bit stuff for all the common stuff, and only 64-bit 
binaries when actually needed).

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:30                                             ` Pavel Machek
@ 2007-04-26 16:31                                                 ` Johannes Berg
  2007-04-26 16:31                                                 ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 16:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel,
	Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm

On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:

> > From looking at pm_ops which I was recently working with a lot, it seems
> > that it was designed by somebody who was reading the ACPI documentation
> > and was otherwise pretty clueless, even at that level std tries to look
> > like suspend. IMHO that is one of the first things that should be ripped
> > out, no pm_ops for STD, it's a pain to work with.
> 
> That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> low-level enter is pretty similar).
> 
> Patches would be welcome

That was easier than I thought. This applies on top of a patch that
makes kernel/power/user.c optional since I had no idea how to fix it,
problems I see:
 * it surfaces kernel implementation details about pm_ops and thus makes
   the whole thing very fragile
 * it has yet another interface (yuck) to determine whether to reboot,
   shut down etc, doesn't use /sys/power/disk
 * I generally had no idea wtf it is doing in some places

Anyway, this patch is only compile tested, it
 * introduces include/linux/hibernate.h with hibernate_ops and
   a new hibernate() function to hibernate the system
 * rips apart a lot of the suspend code and puts it back together using
   the hibernate_ops
 * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
 * might apply/compile against -mm, I have all my and some of Rafael's
   suspend/hibernate work in my tree.
 * breaks user suspend as I noted above
 * is incomplete, somewhere pm_suspend_disk() is still defined iirc

johannes
---
 Documentation/power/userland-swsusp.txt |   26 +++----
 drivers/acpi/sleep/main.c               |   89 ++++++++++++++++++++----
 drivers/acpi/sleep/proc.c               |    3 
 drivers/i2c/chips/tps65010.c            |    2 
 include/linux/hibernate.h               |   36 +++++++++
 include/linux/pm.h                      |   31 --------
 kernel/power/disk.c                     |  117 +++++++++++++++++++-------------
 kernel/power/main.c                     |   47 +++++-------
 kernel/power/power.h                    |   13 ---
 kernel/power/user.c                     |   28 +------
 kernel/sys.c                            |    3 
 11 files changed, 231 insertions(+), 164 deletions(-)

--- wireless-dev.orig/include/linux/pm.h	2007-04-26 18:15:00.440691185 +0200
+++ wireless-dev/include/linux/pm.h	2007-04-26 18:15:09.410691185 +0200
@@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_ON		((__force suspend_state_t) 0)
 #define PM_SUSPEND_STANDBY	((__force suspend_state_t) 1)
 #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
-#define PM_SUSPEND_DISK		((__force suspend_state_t) 4)
-#define PM_SUSPEND_MAX		((__force suspend_state_t) 5)
-
-typedef int __bitwise suspend_disk_method_t;
-
-/* invalid must be 0 so struct pm_ops initialisers can leave it out */
-#define PM_DISK_INVALID		((__force suspend_disk_method_t) 0)
-#define	PM_DISK_PLATFORM	((__force suspend_disk_method_t) 1)
-#define	PM_DISK_SHUTDOWN	((__force suspend_disk_method_t) 2)
-#define	PM_DISK_REBOOT		((__force suspend_disk_method_t) 3)
-#define	PM_DISK_TEST		((__force suspend_disk_method_t) 4)
-#define	PM_DISK_TESTPROC	((__force suspend_disk_method_t) 5)
-#define	PM_DISK_MAX		((__force suspend_disk_method_t) 6)
+#define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
 
 /**
  * struct pm_ops - Callbacks for managing platform dependent suspend states.
  * @valid: Callback to determine whether the given state can be entered.
- * 	If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
- *	always valid and never passed to this call. If not assigned,
- *	no suspend states are valid.
  *	Valid states are advertised in /sys/power/state but can still
  *	be rejected by prepare or enter if the conditions aren't right.
  *	There is a %pm_valid_only_mem function available that can be assigned
@@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
  *
  * @finish: Called when the system has left the given state and all devices
  *	are resumed. The return value is ignored.
- *
- * @pm_disk_mode: The generic code always allows one of the shutdown methods
- *	%PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
- *	%PM_DISK_TESTPROC. If this variable is set, the mode it is set
- *	to is allowed in addition to those modes and is also made default.
- *	When this mode is sent selected, the @prepare call will be called
- *	before suspending to disk (if present), the @enter call should be
- *	present and will be called after all state has been saved and the
- *	machine is ready to be powered off; the @finish callback is called
- *	after state has been restored. All these calls are called with
- *	%PM_SUSPEND_DISK as the state.
  */
 struct pm_ops {
 	int (*valid)(suspend_state_t state);
 	int (*prepare)(suspend_state_t state);
 	int (*enter)(suspend_state_t state);
 	int (*finish)(suspend_state_t state);
-	suspend_disk_method_t pm_disk_mode;
 };
 
 /**
@@ -276,8 +249,6 @@ extern void device_power_up(void);
 extern void device_resume(void);
 
 #ifdef CONFIG_PM
-extern suspend_disk_method_t pm_disk_mode;
-
 extern int device_suspend(pm_message_t state);
 extern int device_prepare_suspend(pm_message_t state);
 
--- wireless-dev.orig/kernel/power/main.c	2007-04-26 18:15:00.790691185 +0200
+++ wireless-dev/kernel/power/main.c	2007-04-26 18:15:09.410691185 +0200
@@ -21,6 +21,7 @@
 #include <linux/resume-trace.h>
 #include <linux/freezer.h>
 #include <linux/vmstat.h>
+#include <linux/hibernate.h>
 
 #include "power.h"
 
@@ -30,7 +31,6 @@
 DEFINE_MUTEX(pm_mutex);
 
 struct pm_ops *pm_ops;
-suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
 
 /**
  *	pm_set_ops - Set the global power method table. 
@@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops)
 {
 	mutex_lock(&pm_mutex);
 	pm_ops = ops;
-	if (ops && ops->pm_disk_mode != PM_DISK_INVALID) {
-		pm_disk_mode = ops->pm_disk_mode;
-	} else
-		pm_disk_mode = PM_DISK_SHUTDOWN;
 	mutex_unlock(&pm_mutex);
 }
 
@@ -184,24 +180,12 @@ static void suspend_finish(suspend_state
 static const char * const pm_states[PM_SUSPEND_MAX] = {
 	[PM_SUSPEND_STANDBY]	= "standby",
 	[PM_SUSPEND_MEM]	= "mem",
-	[PM_SUSPEND_DISK]	= "disk",
 };
 
 static inline int valid_state(suspend_state_t state)
 {
-	/* Suspend-to-disk does not really need low-level support.
-	 * It can work with shutdown/reboot if needed. If it isn't
-	 * configured, then it cannot be supported.
-	 */
-	if (state == PM_SUSPEND_DISK)
-#ifdef CONFIG_SOFTWARE_SUSPEND
-		return 1;
-#else
-		return 0;
-#endif
-
-	/* all other states need lowlevel support and need to be
-	 * valid to the lowlevel implementation, no valid callback
+	/* All states need lowlevel support and need to be valid
+	 * to the lowlevel implementation, no valid callback
 	 * implies that none are valid. */
 	if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state))
 		return 0;
@@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s
 	if (!mutex_trylock(&pm_mutex))
 		return -EBUSY;
 
-	if (state == PM_SUSPEND_DISK) {
-		error = pm_suspend_disk();
-		goto Unlock;
-	}
-
 	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
 	if ((error = suspend_prepare(state)))
 		goto Unlock;
@@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s
 
 /**
  *	pm_suspend - Externally visible function for suspending system.
- *	@state:		Enumarted value of state to enter.
+ *	@state:		Enumerated value of state to enter.
  *
  *	Determine whether or not value is within range, get state 
  *	structure, and enter (above).
@@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL);
 static ssize_t state_show(struct subsystem * subsys, char * buf)
 {
 	int i;
-	char * s = buf;
+	char *s = buf;
 
 	for (i = 0; i < PM_SUSPEND_MAX; i++) {
 		if (pm_states[i] && valid_state(i))
-			s += sprintf(s,"%s ", pm_states[i]);
+			s += sprintf(s, "%s ", pm_states[i]);
 	}
-	s += sprintf(s,"\n");
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	s += sprintf(s, "%s\n", "disk");
+#else
+	if (s != buf)
+		/* convert the last space to a newline */
+		*(s-1) = "\n";
+#endif
 	return (s - buf);
 }
 
@@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
+	/* first check hibernate */
+	if (strncmp(buf, "disk", len)) {
+		error = hibernate();
+		return error ? error : n;
+	}
+
 	for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) {
 		if (*s && !strncmp(buf, *s, len))
 			break;
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ wireless-dev/include/linux/hibernate.h	2007-04-26 18:21:38.130691185 +0200
@@ -0,0 +1,36 @@
+#ifndef __LINUX_HIBERNATE
+#define __LINUX_HIBERNATE
+/*
+ * hibernate ('suspend to disk') functionality
+ */
+
+/**
+ * struct hibernate_ops - hibernate platform support
+ *
+ * The methods in this structure allow a platform to override what
+ * happens for shutting down the machine when going into hibernation.
+ *
+ * All three methods must be assigned.
+ *
+ * @prepare: prepare system for hibernation
+ * @enter: shut down system after state has been saved to disk
+ * @finish: finish/clean up after state has been reloaded
+ */
+struct hibernate_ops {
+	int (*prepare)(void);
+	int (*enter)(void);
+	void (*finish)(void);
+};
+
+/**
+ * hibernate_set_ops - set the global hibernate operations
+ * @ops: the hibernate operations to use from now on.
+ */
+void hibernate_set_ops(struct hibernate_ops *ops);
+
+/**
+ * hibernate - hibernate the system
+ */
+int hibernate(void);
+
+#endif /* __LINUX_HIBERNATE */
--- wireless-dev.orig/kernel/power/disk.c	2007-04-26 18:15:00.800691185 +0200
+++ wireless-dev/kernel/power/disk.c	2007-04-26 18:15:09.420691185 +0200
@@ -21,45 +21,72 @@
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/freezer.h>
+#include <linux/hibernate.h>
 
 #include "power.h"
 
 
-static int noresume = 0;
+static int noresume;
 char resume_file[256] = CONFIG_PM_STD_PARTITION;
 dev_t swsusp_resume_device;
 sector_t swsusp_resume_block;
 
+static struct hibernate_ops *hibernate_ops;
+static int pm_disk_mode;
+
+enum {
+	PM_DISK_INVALID,
+	PM_DISK_PLATFORM,
+	PM_DISK_TEST,
+	PM_DISK_TESTPROC,
+	PM_DISK_SHUTDOWN,
+	PM_DISK_REBOOT,
+	/* keep last */
+	__PM_DISK_AFTER_LAST
+};
+#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1)
+#define PM_DISK_FIRST (PM_DISK_INVALID + 1)
+
+void hibernate_set_ops(struct hibernate_ops *ops)
+{
+	BUG_ON(!hibernate_ops->prepare);
+	BUG_ON(!hibernate_ops->enter);
+	BUG_ON(!hibernate_ops->finish);
+	mutex_lock(&pm_mutex);
+	hibernate_ops = ops;
+	mutex_unlock(&pm_mutex);
+}
+
+
 /**
- *	platform_prepare - prepare the machine for hibernation using the
- *	platform driver if so configured and return an error code if it fails
+ *	hibernate_platform_prepare - prepare the machine for hibernation using
+ *	the platform driver if so configured and return an error code if it
+ *	fails.
  */
 
-static inline int platform_prepare(void)
+int hibernate_platform_prepare(void)
 {
-	int error = 0;
-
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
 	case PM_DISK_TESTPROC:
 	case PM_DISK_SHUTDOWN:
 	case PM_DISK_REBOOT:
 		break;
-	default:
-		if (pm_ops && pm_ops->prepare)
-			error = pm_ops->prepare(PM_SUSPEND_DISK);
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops)
+			return hibernate_ops->prepare();
 	}
-	return error;
+	return 0;
 }
 
 /**
- *	power_down - Shut machine down for hibernate.
+ *	hibernate_power_down - Shut machine down for hibernate.
  *
  *	Use the platform driver, if configured so; otherwise try
  *	to power off or reboot.
  */
 
-static void power_down(void)
+static void hibernate_power_down(void)
 {
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
@@ -70,11 +97,10 @@ static void power_down(void)
 	case PM_DISK_REBOOT:
 		kernel_restart(NULL);
 		break;
-	default:
-		if (pm_ops && pm_ops->enter) {
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops) {
 			kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-			pm_ops->enter(PM_SUSPEND_DISK);
-			break;
+			hibernate_ops->enter();
 		}
 	}
 
@@ -85,7 +111,7 @@ static void power_down(void)
 	while(1);
 }
 
-static inline void platform_finish(void)
+void hibernate_platform_finish(void)
 {
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
@@ -93,9 +119,9 @@ static inline void platform_finish(void)
 	case PM_DISK_SHUTDOWN:
 	case PM_DISK_REBOOT:
 		break;
-	default:
-		if (pm_ops && pm_ops->finish)
-			pm_ops->finish(PM_SUSPEND_DISK);
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops)
+			hibernate_ops->finish();
 	}
 }
 
@@ -118,13 +144,13 @@ static int prepare_processes(void)
 }
 
 /**
- *	pm_suspend_disk - The granpappy of hibernation power management.
+ *	hibernate - The granpappy of hibernation power management.
  *
  *	If not, then call swsusp to do its thing, then figure out how
  *	to power down the system.
  */
 
-int pm_suspend_disk(void)
+int hibernate(void)
 {
 	int error;
 
@@ -147,7 +173,7 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Finish;
 
-	error = platform_prepare();
+	error = hibernate_platform_prepare();
 	if (error)
 		goto Finish;
 
@@ -175,13 +201,13 @@ int pm_suspend_disk(void)
 
 	if (in_suspend) {
 		enable_nonboot_cpus();
-		platform_finish();
+		hibernate_platform_finish();
 		device_resume();
 		resume_console();
 		pr_debug("PM: writing image.\n");
 		error = swsusp_write();
 		if (!error)
-			power_down();
+			hibernate_power_down();
 		else {
 			swsusp_free();
 			goto Finish;
@@ -194,7 +220,7 @@ int pm_suspend_disk(void)
  Enable_cpus:
 	enable_nonboot_cpus();
  Resume_devices:
-	platform_finish();
+	hibernate_platform_finish();
 	device_resume();
 	resume_console();
  Finish:
@@ -211,7 +237,7 @@ int pm_suspend_disk(void)
  *	Called as a late_initcall (so all devices are discovered and
  *	initialized), we call swsusp to see if we have a saved image or not.
  *	If so, we quiesce devices, the restore the saved image. We will
- *	return above (in pm_suspend_disk() ) if everything goes well.
+ *	return above (in hibernate() ) if everything goes well.
  *	Otherwise, we fail gracefully and return to the normally
  *	scheduled program.
  *
@@ -311,12 +337,13 @@ static const char * const pm_disk_modes[
  *
  *	Suspend-to-disk can be handled in several ways. We have a few options
  *	for putting the system to sleep - using the platform driver (e.g. ACPI
- *	or other pm_ops), powering off the system or rebooting the system
- *	(for testing) as well as the two test modes.
+ *	or other hibernate_ops), powering off the system or rebooting the
+ *	system (for testing) as well as the two test modes.
  *
  *	The system can support 'platform', and that is known a priori (and
- *	encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot'
- *	as alternatives, as well as the test modes 'test' and 'testproc'.
+ *	encoded by the presence of hibernate_ops). However, the user may choose
+ *	'shutdown' or 'reboot' as alternatives, as well as the test modes 'test'
+ *	and 'testproc'.
  *
  *	show() will display what the mode is currently set to.
  *	store() will accept one of
@@ -328,7 +355,7 @@ static const char * const pm_disk_modes[
  *	'testproc'
  *
  *	It will only change to 'platform' if the system
- *	supports it (as determined from pm_ops->pm_disk_mode).
+ *	supports it (as determined by having hibernate_ops).
  */
 
 static ssize_t disk_show(struct subsystem * subsys, char * buf)
@@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste
 	int i;
 	char *start = buf;
 
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
+	for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) {
 		if (!pm_disk_modes[i])
 			continue;
 		switch (i) {
@@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste
 		case PM_DISK_TEST:
 		case PM_DISK_TESTPROC:
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (i == pm_ops->pm_disk_mode))
+		case PM_DISK_PLATFORM:
+			if (hibernate_ops)
 				break;
 			/* not a valid mode, continue with loop */
 			continue;
@@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst
 	int i;
 	int len;
 	char *p;
-	suspend_disk_method_t mode = 0;
+	int mode = PM_DISK_INVALID;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
 	mutex_lock(&pm_mutex);
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
+	for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) {
 		if (!strncmp(buf, pm_disk_modes[i], len)) {
 			mode = i;
 			break;
 		}
 	}
-	if (mode) {
+	if (mode != PM_DISK_INVALID) {
 		switch (mode) {
 		case PM_DISK_SHUTDOWN:
 		case PM_DISK_REBOOT:
@@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst
 		case PM_DISK_TESTPROC:
 			pm_disk_mode = mode;
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (mode == pm_ops->pm_disk_mode))
+		case PM_DISK_PLATFORM:
+			if (hibernate_ops)
 				pm_disk_mode = mode;
 			else
 				error = -EINVAL;
 		}
-	} else {
+	} else
 		error = -EINVAL;
-	}
 
-	pr_debug("PM: suspend-to-disk mode set to '%s'\n",
-		 pm_disk_modes[mode]);
+	if (!error)
+		pr_debug("PM: suspend-to-disk mode set to '%s'\n",
+			 pm_disk_modes[mode]);
 	mutex_unlock(&pm_mutex);
 	return error ? error : n;
 }
--- wireless-dev.orig/kernel/power/user.c	2007-04-26 18:15:01.130691185 +0200
+++ wireless-dev/kernel/power/user.c	2007-04-26 18:15:09.420691185 +0200
@@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil
 	return res;
 }
 
-static inline int platform_prepare(void)
-{
-	int error = 0;
-
-	if (pm_ops && pm_ops->prepare)
-		error = pm_ops->prepare(PM_SUSPEND_DISK);
-
-	return error;
-}
-
-static inline void platform_finish(void)
-{
-	if (pm_ops && pm_ops->finish)
-		pm_ops->finish(PM_SUSPEND_DISK);
-}
-
 static inline int snapshot_suspend(int platform_suspend)
 {
 	int error;
@@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p
 		goto Finish;
 
 	if (platform_suspend) {
-		error = platform_prepare();
+		error = hibernate_platform_prepare();
 		if (error)
 			goto Finish;
 	}
@@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p
 	enable_nonboot_cpus();
  Resume_devices:
 	if (platform_suspend)
-		platform_finish();
+		hibernate_platform_finish();
 
 	device_resume();
 	resume_console();
@@ -188,7 +172,7 @@ static inline int snapshot_restore(int p
 	mutex_lock(&pm_mutex);
 	pm_prepare_console();
 	if (platform_suspend) {
-		error = platform_prepare();
+		error = hibernate_platform_prepare();
 		if (error)
 			goto Finish;
 	}
@@ -204,7 +188,7 @@ static inline int snapshot_restore(int p
 	enable_nonboot_cpus();
  Resume_devices:
 	if (platform_suspend)
-		platform_finish();
+		hibernate_platform_finish();
 
 	device_resume();
 	resume_console();
@@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode *
 		case PMOPS_ENTER:
 			if (data->platform_suspend) {
 				kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-				error = pm_ops->enter(PM_SUSPEND_DISK);
+				error = hibernate_ops->enter();
+				/* how can this possibly do the right thing? */
 				error = 0;
 			}
 			break;
 
 		case PMOPS_FINISH:
 			if (data->platform_suspend)
+				/* and why doesn't this invoke anything??? */
 				error = 0;
 
 			break;
--- wireless-dev.orig/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:02.120691185 +0200
+++ wireless-dev/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:09.440691185 +0200
@@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t
 	to resume the system from RAM if there's enough battery power or restore
 	its state on the basis of the saved suspend image otherwise)
 
-SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and
-	pmops->finish methods (the in-kernel swsusp knows these as the "platform
-	method") which are needed on many machines to (among others) speed up
-	the resume by letting the BIOS skip some steps or to let the system
-	recognise the correct state of the hardware after the resume (in
-	particular on many machines this ensures that unplugged AC
-	adapters get correctly detected and that kacpid does not run wild after
-	the resume).  The last ioctl() argument can take one of the three
-	values, defined in kernel/power/power.h:
+SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare,
+	hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel
+	swsusp knows these as the "platform method") which are needed on many
+	machines to (among others) speed up the resume by letting the BIOS skip
+	some steps or to let the system recognise the correct state of the
+	hardware after the resume (in particular on many machines this ensures
+	that unplugged AC adapters get correctly detected and that kacpid does
+	not run wild after the resume).  The last ioctl() argument can take one
+	of the three values, defined in kernel/power/power.h:
 	PMOPS_PREPARE - make the kernel carry out the
-		pm_ops->prepare(PM_SUSPEND_DISK) operation
+		hibernate_ops->prepare() operation
 	PMOPS_ENTER - make the kernel power off the system by calling
-		pm_ops->enter(PM_SUSPEND_DISK)
+		hibernate_ops->enter()
 	PMOPS_FINISH - make the kernel carry out the
-		pm_ops->finish(PM_SUSPEND_DISK) operation
+		hibernate_ops->finish() operation
+	Note that the actual constants are misnamed because they surface
+	internal kernel implementation details that have changed.
 
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
--- wireless-dev.orig/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:02.150691185 +0200
+++ wireless-dev/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:09.440691185 +0200
@@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp
 			 * also needs to get error handling and probably
 			 * an #ifdef CONFIG_SOFTWARE_SUSPEND
 			 */
-			pm_suspend(PM_SUSPEND_DISK);
+			hibernate();
 #endif
 			poll = 1;
 		}
--- wireless-dev.orig/kernel/sys.c	2007-04-26 18:15:01.310691185 +0200
+++ wireless-dev/kernel/sys.c	2007-04-26 18:15:09.450691185 +0200
@@ -25,6 +25,7 @@
 #include <linux/security.h>
 #include <linux/dcookies.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
@@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		{
-			int ret = pm_suspend(PM_SUSPEND_DISK);
+			int ret = hibernate();
 			unlock_kernel();
 			return ret;
 		}
--- wireless-dev.orig/drivers/acpi/sleep/main.c	2007-04-26 18:15:02.290691185 +0200
+++ wireless-dev/drivers/acpi/sleep/main.c	2007-04-26 18:15:09.630691185 +0200
@@ -15,6 +15,7 @@
 #include <linux/dmi.h>
 #include <linux/device.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <acpi/acpi_bus.h>
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
@@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = {
 	[PM_SUSPEND_ON] = ACPI_STATE_S0,
 	[PM_SUSPEND_STANDBY] = ACPI_STATE_S1,
 	[PM_SUSPEND_MEM] = ACPI_STATE_S3,
-	[PM_SUSPEND_DISK] = ACPI_STATE_S4,
 	[PM_SUSPEND_MAX] = ACPI_STATE_S5
 };
 
@@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t
 		do_suspend_lowlevel();
 		break;
 
-	case PM_SUSPEND_DISK:
-		if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM)
-			status = acpi_enter_sleep_state(acpi_state);
-		break;
-	case PM_SUSPEND_MAX:
-		acpi_power_off();
-		break;
-
 	default:
 		return -EINVAL;
 	}
@@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state)
 	suspend_state_t states[] = {
 		[1] = PM_SUSPEND_STANDBY,
 		[3] = PM_SUSPEND_MEM,
-		[4] = PM_SUSPEND_DISK,
 		[5] = PM_SUSPEND_MAX
 	};
 
 	if (acpi_state < 6 && states[acpi_state])
 		return pm_suspend(states[acpi_state]);
+	if (acpi_state == 4)
+		return hibernate();
 	return -EINVAL;
 }
 
@@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = {
 	.finish = acpi_pm_finish,
 };
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int acpi_hib_prepare(void)
+{
+	return acpi_sleep_prepare(ACPI_STATE_S4);
+}
+
+static int acpi_hib_enter(void)
+{
+	acpi_status status = AE_OK;
+	unsigned long flags = 0;
+	u32 acpi_state = acpi_suspend_states[pm_state];
+
+	ACPI_FLUSH_CPU_CACHE();
+
+	/* Do arch specific saving of state. */
+	int error = acpi_save_state_mem();
+	if (error)
+		return error;
+
+	local_irq_save(flags);
+	acpi_enable_wakeup_device(acpi_state);
+	status = acpi_enter_sleep_state(acpi_state);
+
+	/* ACPI 3.0 specs (P62) says that it's the responsabilty
+	 * of the OSPM to clear the status bit [ implying that the
+	 * POWER_BUTTON event should not reach userspace ]
+	 */
+	if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3))
+		acpi_clear_event(ACPI_EVENT_POWER_BUTTON);
+
+	local_irq_restore(flags);
+	printk(KERN_DEBUG "Back to C!\n");
+
+	/* restore processor state
+	 * We should only be here if we're coming back from STR or STD.
+	 * And, in the case of the latter, the memory image should have already
+	 * been loaded from disk.
+	 */
+	acpi_restore_state_mem();
+
+	return ACPI_SUCCESS(status) ? 0 : -EFAULT;
+}
+
+static void acpi_hib_finish(void)
+{
+	acpi_leave_sleep_state(ACPI_STATE_S4);
+	acpi_disable_wakeup_device(ACPI_STATE_S4);
+
+	/* reset firmware waking vector */
+	acpi_set_firmware_waking_vector((acpi_physical_address) 0);
+
+	if (init_8259A_after_S1) {
+		printk("Broken toshiba laptop -> kicking interrupts\n");
+		init_8259A(0);
+	}
+	return 0;
+}
+
+static struct hibernate_ops acpi_hib_ops = {
+	.prepare = acpi_hib_prepare,
+	.enter = acpi_hib_enter,
+	.finish = acpi_hib_finish,
+};
+#endif /* CONFIG_SOFTWARE_SUSPEND */
+
 /*
  * Toshiba fails to preserve interrupts over S1, reinitialization
  * of 8259 is needed after S1 resume.
@@ -227,13 +285,16 @@ int __init acpi_sleep_init(void)
 			sleep_states[i] = 1;
 			printk(" S%d", i);
 		}
-		if (i == ACPI_STATE_S4) {
-			if (sleep_states[i])
-				acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM;
-		}
 	}
 	printk(")\n");
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	if (sleep_states[ACPI_STATE_S4])
+		hibernate_set_ops(&acpi_hib_ops);
+#else
+	sleep_states[ACPI_STATE_S4] = 0;
+#endif
+
 	pm_set_ops(&acpi_pm_ops);
 	return 0;
 }
--- wireless-dev.orig/kernel/power/power.h	2007-04-26 18:15:01.240691185 +0200
+++ wireless-dev/kernel/power/power.h	2007-04-26 18:15:09.630691185 +0200
@@ -13,16 +13,6 @@ struct swsusp_info {
 
 
 
-#ifdef CONFIG_SOFTWARE_SUSPEND
-extern int pm_suspend_disk(void);
-
-#else
-static inline int pm_suspend_disk(void)
-{
-	return -EPERM;
-}
-#endif
-
 extern struct mutex pm_mutex;
 
 #define power_attr(_name) \
@@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t
 struct timeval;
 extern void swsusp_show_speed(struct timeval *, struct timeval *,
 				unsigned int, char *);
+
+extern int hibernate_platform_prepare(void);
+extern void hibernate_platform_finish(void);
--- wireless-dev.orig/drivers/acpi/sleep/proc.c	2007-04-26 18:15:02.720691185 +0200
+++ wireless-dev/drivers/acpi/sleep/proc.c	2007-04-26 18:15:09.630691185 +0200
@@ -1,6 +1,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <linux/bcd.h>
 #include <asm/uaccess.h>
 
@@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil
 	state = simple_strtoul(str, NULL, 0);
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	if (state == 4) {
-		error = pm_suspend(PM_SUSPEND_DISK);
+		error = hibernate();
 		goto Done;
 	}
 #endif



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
@ 2007-04-26 16:31                                                 ` Johannes Berg
  0 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 16:31 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, linux-pm,
	Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven

On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:

> > From looking at pm_ops which I was recently working with a lot, it seems
> > that it was designed by somebody who was reading the ACPI documentation
> > and was otherwise pretty clueless, even at that level std tries to look
> > like suspend. IMHO that is one of the first things that should be ripped
> > out, no pm_ops for STD, it's a pain to work with.
> 
> That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> low-level enter is pretty similar).
> 
> Patches would be welcome

That was easier than I thought. This applies on top of a patch that
makes kernel/power/user.c optional since I had no idea how to fix it,
problems I see:
 * it surfaces kernel implementation details about pm_ops and thus makes
   the whole thing very fragile
 * it has yet another interface (yuck) to determine whether to reboot,
   shut down etc, doesn't use /sys/power/disk
 * I generally had no idea wtf it is doing in some places

Anyway, this patch is only compile tested, it
 * introduces include/linux/hibernate.h with hibernate_ops and
   a new hibernate() function to hibernate the system
 * rips apart a lot of the suspend code and puts it back together using
   the hibernate_ops
 * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
 * might apply/compile against -mm, I have all my and some of Rafael's
   suspend/hibernate work in my tree.
 * breaks user suspend as I noted above
 * is incomplete, somewhere pm_suspend_disk() is still defined iirc

johannes
---
 Documentation/power/userland-swsusp.txt |   26 +++----
 drivers/acpi/sleep/main.c               |   89 ++++++++++++++++++++----
 drivers/acpi/sleep/proc.c               |    3 
 drivers/i2c/chips/tps65010.c            |    2 
 include/linux/hibernate.h               |   36 +++++++++
 include/linux/pm.h                      |   31 --------
 kernel/power/disk.c                     |  117 +++++++++++++++++++-------------
 kernel/power/main.c                     |   47 +++++-------
 kernel/power/power.h                    |   13 ---
 kernel/power/user.c                     |   28 +------
 kernel/sys.c                            |    3 
 11 files changed, 231 insertions(+), 164 deletions(-)

--- wireless-dev.orig/include/linux/pm.h	2007-04-26 18:15:00.440691185 +0200
+++ wireless-dev/include/linux/pm.h	2007-04-26 18:15:09.410691185 +0200
@@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_ON		((__force suspend_state_t) 0)
 #define PM_SUSPEND_STANDBY	((__force suspend_state_t) 1)
 #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
-#define PM_SUSPEND_DISK		((__force suspend_state_t) 4)
-#define PM_SUSPEND_MAX		((__force suspend_state_t) 5)
-
-typedef int __bitwise suspend_disk_method_t;
-
-/* invalid must be 0 so struct pm_ops initialisers can leave it out */
-#define PM_DISK_INVALID		((__force suspend_disk_method_t) 0)
-#define	PM_DISK_PLATFORM	((__force suspend_disk_method_t) 1)
-#define	PM_DISK_SHUTDOWN	((__force suspend_disk_method_t) 2)
-#define	PM_DISK_REBOOT		((__force suspend_disk_method_t) 3)
-#define	PM_DISK_TEST		((__force suspend_disk_method_t) 4)
-#define	PM_DISK_TESTPROC	((__force suspend_disk_method_t) 5)
-#define	PM_DISK_MAX		((__force suspend_disk_method_t) 6)
+#define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
 
 /**
  * struct pm_ops - Callbacks for managing platform dependent suspend states.
  * @valid: Callback to determine whether the given state can be entered.
- * 	If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
- *	always valid and never passed to this call. If not assigned,
- *	no suspend states are valid.
  *	Valid states are advertised in /sys/power/state but can still
  *	be rejected by prepare or enter if the conditions aren't right.
  *	There is a %pm_valid_only_mem function available that can be assigned
@@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
  *
  * @finish: Called when the system has left the given state and all devices
  *	are resumed. The return value is ignored.
- *
- * @pm_disk_mode: The generic code always allows one of the shutdown methods
- *	%PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
- *	%PM_DISK_TESTPROC. If this variable is set, the mode it is set
- *	to is allowed in addition to those modes and is also made default.
- *	When this mode is sent selected, the @prepare call will be called
- *	before suspending to disk (if present), the @enter call should be
- *	present and will be called after all state has been saved and the
- *	machine is ready to be powered off; the @finish callback is called
- *	after state has been restored. All these calls are called with
- *	%PM_SUSPEND_DISK as the state.
  */
 struct pm_ops {
 	int (*valid)(suspend_state_t state);
 	int (*prepare)(suspend_state_t state);
 	int (*enter)(suspend_state_t state);
 	int (*finish)(suspend_state_t state);
-	suspend_disk_method_t pm_disk_mode;
 };
 
 /**
@@ -276,8 +249,6 @@ extern void device_power_up(void);
 extern void device_resume(void);
 
 #ifdef CONFIG_PM
-extern suspend_disk_method_t pm_disk_mode;
-
 extern int device_suspend(pm_message_t state);
 extern int device_prepare_suspend(pm_message_t state);
 
--- wireless-dev.orig/kernel/power/main.c	2007-04-26 18:15:00.790691185 +0200
+++ wireless-dev/kernel/power/main.c	2007-04-26 18:15:09.410691185 +0200
@@ -21,6 +21,7 @@
 #include <linux/resume-trace.h>
 #include <linux/freezer.h>
 #include <linux/vmstat.h>
+#include <linux/hibernate.h>
 
 #include "power.h"
 
@@ -30,7 +31,6 @@
 DEFINE_MUTEX(pm_mutex);
 
 struct pm_ops *pm_ops;
-suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
 
 /**
  *	pm_set_ops - Set the global power method table. 
@@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops)
 {
 	mutex_lock(&pm_mutex);
 	pm_ops = ops;
-	if (ops && ops->pm_disk_mode != PM_DISK_INVALID) {
-		pm_disk_mode = ops->pm_disk_mode;
-	} else
-		pm_disk_mode = PM_DISK_SHUTDOWN;
 	mutex_unlock(&pm_mutex);
 }
 
@@ -184,24 +180,12 @@ static void suspend_finish(suspend_state
 static const char * const pm_states[PM_SUSPEND_MAX] = {
 	[PM_SUSPEND_STANDBY]	= "standby",
 	[PM_SUSPEND_MEM]	= "mem",
-	[PM_SUSPEND_DISK]	= "disk",
 };
 
 static inline int valid_state(suspend_state_t state)
 {
-	/* Suspend-to-disk does not really need low-level support.
-	 * It can work with shutdown/reboot if needed. If it isn't
-	 * configured, then it cannot be supported.
-	 */
-	if (state == PM_SUSPEND_DISK)
-#ifdef CONFIG_SOFTWARE_SUSPEND
-		return 1;
-#else
-		return 0;
-#endif
-
-	/* all other states need lowlevel support and need to be
-	 * valid to the lowlevel implementation, no valid callback
+	/* All states need lowlevel support and need to be valid
+	 * to the lowlevel implementation, no valid callback
 	 * implies that none are valid. */
 	if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state))
 		return 0;
@@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s
 	if (!mutex_trylock(&pm_mutex))
 		return -EBUSY;
 
-	if (state == PM_SUSPEND_DISK) {
-		error = pm_suspend_disk();
-		goto Unlock;
-	}
-
 	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
 	if ((error = suspend_prepare(state)))
 		goto Unlock;
@@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s
 
 /**
  *	pm_suspend - Externally visible function for suspending system.
- *	@state:		Enumarted value of state to enter.
+ *	@state:		Enumerated value of state to enter.
  *
  *	Determine whether or not value is within range, get state 
  *	structure, and enter (above).
@@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL);
 static ssize_t state_show(struct subsystem * subsys, char * buf)
 {
 	int i;
-	char * s = buf;
+	char *s = buf;
 
 	for (i = 0; i < PM_SUSPEND_MAX; i++) {
 		if (pm_states[i] && valid_state(i))
-			s += sprintf(s,"%s ", pm_states[i]);
+			s += sprintf(s, "%s ", pm_states[i]);
 	}
-	s += sprintf(s,"\n");
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	s += sprintf(s, "%s\n", "disk");
+#else
+	if (s != buf)
+		/* convert the last space to a newline */
+		*(s-1) = "\n";
+#endif
 	return (s - buf);
 }
 
@@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
+	/* first check hibernate */
+	if (strncmp(buf, "disk", len)) {
+		error = hibernate();
+		return error ? error : n;
+	}
+
 	for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) {
 		if (*s && !strncmp(buf, *s, len))
 			break;
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ wireless-dev/include/linux/hibernate.h	2007-04-26 18:21:38.130691185 +0200
@@ -0,0 +1,36 @@
+#ifndef __LINUX_HIBERNATE
+#define __LINUX_HIBERNATE
+/*
+ * hibernate ('suspend to disk') functionality
+ */
+
+/**
+ * struct hibernate_ops - hibernate platform support
+ *
+ * The methods in this structure allow a platform to override what
+ * happens for shutting down the machine when going into hibernation.
+ *
+ * All three methods must be assigned.
+ *
+ * @prepare: prepare system for hibernation
+ * @enter: shut down system after state has been saved to disk
+ * @finish: finish/clean up after state has been reloaded
+ */
+struct hibernate_ops {
+	int (*prepare)(void);
+	int (*enter)(void);
+	void (*finish)(void);
+};
+
+/**
+ * hibernate_set_ops - set the global hibernate operations
+ * @ops: the hibernate operations to use from now on.
+ */
+void hibernate_set_ops(struct hibernate_ops *ops);
+
+/**
+ * hibernate - hibernate the system
+ */
+int hibernate(void);
+
+#endif /* __LINUX_HIBERNATE */
--- wireless-dev.orig/kernel/power/disk.c	2007-04-26 18:15:00.800691185 +0200
+++ wireless-dev/kernel/power/disk.c	2007-04-26 18:15:09.420691185 +0200
@@ -21,45 +21,72 @@
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/freezer.h>
+#include <linux/hibernate.h>
 
 #include "power.h"
 
 
-static int noresume = 0;
+static int noresume;
 char resume_file[256] = CONFIG_PM_STD_PARTITION;
 dev_t swsusp_resume_device;
 sector_t swsusp_resume_block;
 
+static struct hibernate_ops *hibernate_ops;
+static int pm_disk_mode;
+
+enum {
+	PM_DISK_INVALID,
+	PM_DISK_PLATFORM,
+	PM_DISK_TEST,
+	PM_DISK_TESTPROC,
+	PM_DISK_SHUTDOWN,
+	PM_DISK_REBOOT,
+	/* keep last */
+	__PM_DISK_AFTER_LAST
+};
+#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1)
+#define PM_DISK_FIRST (PM_DISK_INVALID + 1)
+
+void hibernate_set_ops(struct hibernate_ops *ops)
+{
+	BUG_ON(!hibernate_ops->prepare);
+	BUG_ON(!hibernate_ops->enter);
+	BUG_ON(!hibernate_ops->finish);
+	mutex_lock(&pm_mutex);
+	hibernate_ops = ops;
+	mutex_unlock(&pm_mutex);
+}
+
+
 /**
- *	platform_prepare - prepare the machine for hibernation using the
- *	platform driver if so configured and return an error code if it fails
+ *	hibernate_platform_prepare - prepare the machine for hibernation using
+ *	the platform driver if so configured and return an error code if it
+ *	fails.
  */
 
-static inline int platform_prepare(void)
+int hibernate_platform_prepare(void)
 {
-	int error = 0;
-
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
 	case PM_DISK_TESTPROC:
 	case PM_DISK_SHUTDOWN:
 	case PM_DISK_REBOOT:
 		break;
-	default:
-		if (pm_ops && pm_ops->prepare)
-			error = pm_ops->prepare(PM_SUSPEND_DISK);
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops)
+			return hibernate_ops->prepare();
 	}
-	return error;
+	return 0;
 }
 
 /**
- *	power_down - Shut machine down for hibernate.
+ *	hibernate_power_down - Shut machine down for hibernate.
  *
  *	Use the platform driver, if configured so; otherwise try
  *	to power off or reboot.
  */
 
-static void power_down(void)
+static void hibernate_power_down(void)
 {
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
@@ -70,11 +97,10 @@ static void power_down(void)
 	case PM_DISK_REBOOT:
 		kernel_restart(NULL);
 		break;
-	default:
-		if (pm_ops && pm_ops->enter) {
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops) {
 			kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-			pm_ops->enter(PM_SUSPEND_DISK);
-			break;
+			hibernate_ops->enter();
 		}
 	}
 
@@ -85,7 +111,7 @@ static void power_down(void)
 	while(1);
 }
 
-static inline void platform_finish(void)
+void hibernate_platform_finish(void)
 {
 	switch (pm_disk_mode) {
 	case PM_DISK_TEST:
@@ -93,9 +119,9 @@ static inline void platform_finish(void)
 	case PM_DISK_SHUTDOWN:
 	case PM_DISK_REBOOT:
 		break;
-	default:
-		if (pm_ops && pm_ops->finish)
-			pm_ops->finish(PM_SUSPEND_DISK);
+	case PM_DISK_PLATFORM:
+		if (hibernate_ops)
+			hibernate_ops->finish();
 	}
 }
 
@@ -118,13 +144,13 @@ static int prepare_processes(void)
 }
 
 /**
- *	pm_suspend_disk - The granpappy of hibernation power management.
+ *	hibernate - The granpappy of hibernation power management.
  *
  *	If not, then call swsusp to do its thing, then figure out how
  *	to power down the system.
  */
 
-int pm_suspend_disk(void)
+int hibernate(void)
 {
 	int error;
 
@@ -147,7 +173,7 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Finish;
 
-	error = platform_prepare();
+	error = hibernate_platform_prepare();
 	if (error)
 		goto Finish;
 
@@ -175,13 +201,13 @@ int pm_suspend_disk(void)
 
 	if (in_suspend) {
 		enable_nonboot_cpus();
-		platform_finish();
+		hibernate_platform_finish();
 		device_resume();
 		resume_console();
 		pr_debug("PM: writing image.\n");
 		error = swsusp_write();
 		if (!error)
-			power_down();
+			hibernate_power_down();
 		else {
 			swsusp_free();
 			goto Finish;
@@ -194,7 +220,7 @@ int pm_suspend_disk(void)
  Enable_cpus:
 	enable_nonboot_cpus();
  Resume_devices:
-	platform_finish();
+	hibernate_platform_finish();
 	device_resume();
 	resume_console();
  Finish:
@@ -211,7 +237,7 @@ int pm_suspend_disk(void)
  *	Called as a late_initcall (so all devices are discovered and
  *	initialized), we call swsusp to see if we have a saved image or not.
  *	If so, we quiesce devices, the restore the saved image. We will
- *	return above (in pm_suspend_disk() ) if everything goes well.
+ *	return above (in hibernate() ) if everything goes well.
  *	Otherwise, we fail gracefully and return to the normally
  *	scheduled program.
  *
@@ -311,12 +337,13 @@ static const char * const pm_disk_modes[
  *
  *	Suspend-to-disk can be handled in several ways. We have a few options
  *	for putting the system to sleep - using the platform driver (e.g. ACPI
- *	or other pm_ops), powering off the system or rebooting the system
- *	(for testing) as well as the two test modes.
+ *	or other hibernate_ops), powering off the system or rebooting the
+ *	system (for testing) as well as the two test modes.
  *
  *	The system can support 'platform', and that is known a priori (and
- *	encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot'
- *	as alternatives, as well as the test modes 'test' and 'testproc'.
+ *	encoded by the presence of hibernate_ops). However, the user may choose
+ *	'shutdown' or 'reboot' as alternatives, as well as the test modes 'test'
+ *	and 'testproc'.
  *
  *	show() will display what the mode is currently set to.
  *	store() will accept one of
@@ -328,7 +355,7 @@ static const char * const pm_disk_modes[
  *	'testproc'
  *
  *	It will only change to 'platform' if the system
- *	supports it (as determined from pm_ops->pm_disk_mode).
+ *	supports it (as determined by having hibernate_ops).
  */
 
 static ssize_t disk_show(struct subsystem * subsys, char * buf)
@@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste
 	int i;
 	char *start = buf;
 
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
+	for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) {
 		if (!pm_disk_modes[i])
 			continue;
 		switch (i) {
@@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste
 		case PM_DISK_TEST:
 		case PM_DISK_TESTPROC:
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (i == pm_ops->pm_disk_mode))
+		case PM_DISK_PLATFORM:
+			if (hibernate_ops)
 				break;
 			/* not a valid mode, continue with loop */
 			continue;
@@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst
 	int i;
 	int len;
 	char *p;
-	suspend_disk_method_t mode = 0;
+	int mode = PM_DISK_INVALID;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
 	mutex_lock(&pm_mutex);
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
+	for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) {
 		if (!strncmp(buf, pm_disk_modes[i], len)) {
 			mode = i;
 			break;
 		}
 	}
-	if (mode) {
+	if (mode != PM_DISK_INVALID) {
 		switch (mode) {
 		case PM_DISK_SHUTDOWN:
 		case PM_DISK_REBOOT:
@@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst
 		case PM_DISK_TESTPROC:
 			pm_disk_mode = mode;
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (mode == pm_ops->pm_disk_mode))
+		case PM_DISK_PLATFORM:
+			if (hibernate_ops)
 				pm_disk_mode = mode;
 			else
 				error = -EINVAL;
 		}
-	} else {
+	} else
 		error = -EINVAL;
-	}
 
-	pr_debug("PM: suspend-to-disk mode set to '%s'\n",
-		 pm_disk_modes[mode]);
+	if (!error)
+		pr_debug("PM: suspend-to-disk mode set to '%s'\n",
+			 pm_disk_modes[mode]);
 	mutex_unlock(&pm_mutex);
 	return error ? error : n;
 }
--- wireless-dev.orig/kernel/power/user.c	2007-04-26 18:15:01.130691185 +0200
+++ wireless-dev/kernel/power/user.c	2007-04-26 18:15:09.420691185 +0200
@@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil
 	return res;
 }
 
-static inline int platform_prepare(void)
-{
-	int error = 0;
-
-	if (pm_ops && pm_ops->prepare)
-		error = pm_ops->prepare(PM_SUSPEND_DISK);
-
-	return error;
-}
-
-static inline void platform_finish(void)
-{
-	if (pm_ops && pm_ops->finish)
-		pm_ops->finish(PM_SUSPEND_DISK);
-}
-
 static inline int snapshot_suspend(int platform_suspend)
 {
 	int error;
@@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p
 		goto Finish;
 
 	if (platform_suspend) {
-		error = platform_prepare();
+		error = hibernate_platform_prepare();
 		if (error)
 			goto Finish;
 	}
@@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p
 	enable_nonboot_cpus();
  Resume_devices:
 	if (platform_suspend)
-		platform_finish();
+		hibernate_platform_finish();
 
 	device_resume();
 	resume_console();
@@ -188,7 +172,7 @@ static inline int snapshot_restore(int p
 	mutex_lock(&pm_mutex);
 	pm_prepare_console();
 	if (platform_suspend) {
-		error = platform_prepare();
+		error = hibernate_platform_prepare();
 		if (error)
 			goto Finish;
 	}
@@ -204,7 +188,7 @@ static inline int snapshot_restore(int p
 	enable_nonboot_cpus();
  Resume_devices:
 	if (platform_suspend)
-		platform_finish();
+		hibernate_platform_finish();
 
 	device_resume();
 	resume_console();
@@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode *
 		case PMOPS_ENTER:
 			if (data->platform_suspend) {
 				kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-				error = pm_ops->enter(PM_SUSPEND_DISK);
+				error = hibernate_ops->enter();
+				/* how can this possibly do the right thing? */
 				error = 0;
 			}
 			break;
 
 		case PMOPS_FINISH:
 			if (data->platform_suspend)
+				/* and why doesn't this invoke anything??? */
 				error = 0;
 
 			break;
--- wireless-dev.orig/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:02.120691185 +0200
+++ wireless-dev/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:09.440691185 +0200
@@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t
 	to resume the system from RAM if there's enough battery power or restore
 	its state on the basis of the saved suspend image otherwise)
 
-SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and
-	pmops->finish methods (the in-kernel swsusp knows these as the "platform
-	method") which are needed on many machines to (among others) speed up
-	the resume by letting the BIOS skip some steps or to let the system
-	recognise the correct state of the hardware after the resume (in
-	particular on many machines this ensures that unplugged AC
-	adapters get correctly detected and that kacpid does not run wild after
-	the resume).  The last ioctl() argument can take one of the three
-	values, defined in kernel/power/power.h:
+SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare,
+	hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel
+	swsusp knows these as the "platform method") which are needed on many
+	machines to (among others) speed up the resume by letting the BIOS skip
+	some steps or to let the system recognise the correct state of the
+	hardware after the resume (in particular on many machines this ensures
+	that unplugged AC adapters get correctly detected and that kacpid does
+	not run wild after the resume).  The last ioctl() argument can take one
+	of the three values, defined in kernel/power/power.h:
 	PMOPS_PREPARE - make the kernel carry out the
-		pm_ops->prepare(PM_SUSPEND_DISK) operation
+		hibernate_ops->prepare() operation
 	PMOPS_ENTER - make the kernel power off the system by calling
-		pm_ops->enter(PM_SUSPEND_DISK)
+		hibernate_ops->enter()
 	PMOPS_FINISH - make the kernel carry out the
-		pm_ops->finish(PM_SUSPEND_DISK) operation
+		hibernate_ops->finish() operation
+	Note that the actual constants are misnamed because they surface
+	internal kernel implementation details that have changed.
 
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
--- wireless-dev.orig/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:02.150691185 +0200
+++ wireless-dev/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:09.440691185 +0200
@@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp
 			 * also needs to get error handling and probably
 			 * an #ifdef CONFIG_SOFTWARE_SUSPEND
 			 */
-			pm_suspend(PM_SUSPEND_DISK);
+			hibernate();
 #endif
 			poll = 1;
 		}
--- wireless-dev.orig/kernel/sys.c	2007-04-26 18:15:01.310691185 +0200
+++ wireless-dev/kernel/sys.c	2007-04-26 18:15:09.450691185 +0200
@@ -25,6 +25,7 @@
 #include <linux/security.h>
 #include <linux/dcookies.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
@@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		{
-			int ret = pm_suspend(PM_SUSPEND_DISK);
+			int ret = hibernate();
 			unlock_kernel();
 			return ret;
 		}
--- wireless-dev.orig/drivers/acpi/sleep/main.c	2007-04-26 18:15:02.290691185 +0200
+++ wireless-dev/drivers/acpi/sleep/main.c	2007-04-26 18:15:09.630691185 +0200
@@ -15,6 +15,7 @@
 #include <linux/dmi.h>
 #include <linux/device.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <acpi/acpi_bus.h>
 #include <acpi/acpi_drivers.h>
 #include "sleep.h"
@@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = {
 	[PM_SUSPEND_ON] = ACPI_STATE_S0,
 	[PM_SUSPEND_STANDBY] = ACPI_STATE_S1,
 	[PM_SUSPEND_MEM] = ACPI_STATE_S3,
-	[PM_SUSPEND_DISK] = ACPI_STATE_S4,
 	[PM_SUSPEND_MAX] = ACPI_STATE_S5
 };
 
@@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t
 		do_suspend_lowlevel();
 		break;
 
-	case PM_SUSPEND_DISK:
-		if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM)
-			status = acpi_enter_sleep_state(acpi_state);
-		break;
-	case PM_SUSPEND_MAX:
-		acpi_power_off();
-		break;
-
 	default:
 		return -EINVAL;
 	}
@@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state)
 	suspend_state_t states[] = {
 		[1] = PM_SUSPEND_STANDBY,
 		[3] = PM_SUSPEND_MEM,
-		[4] = PM_SUSPEND_DISK,
 		[5] = PM_SUSPEND_MAX
 	};
 
 	if (acpi_state < 6 && states[acpi_state])
 		return pm_suspend(states[acpi_state]);
+	if (acpi_state == 4)
+		return hibernate();
 	return -EINVAL;
 }
 
@@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = {
 	.finish = acpi_pm_finish,
 };
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int acpi_hib_prepare(void)
+{
+	return acpi_sleep_prepare(ACPI_STATE_S4);
+}
+
+static int acpi_hib_enter(void)
+{
+	acpi_status status = AE_OK;
+	unsigned long flags = 0;
+	u32 acpi_state = acpi_suspend_states[pm_state];
+
+	ACPI_FLUSH_CPU_CACHE();
+
+	/* Do arch specific saving of state. */
+	int error = acpi_save_state_mem();
+	if (error)
+		return error;
+
+	local_irq_save(flags);
+	acpi_enable_wakeup_device(acpi_state);
+	status = acpi_enter_sleep_state(acpi_state);
+
+	/* ACPI 3.0 specs (P62) says that it's the responsabilty
+	 * of the OSPM to clear the status bit [ implying that the
+	 * POWER_BUTTON event should not reach userspace ]
+	 */
+	if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3))
+		acpi_clear_event(ACPI_EVENT_POWER_BUTTON);
+
+	local_irq_restore(flags);
+	printk(KERN_DEBUG "Back to C!\n");
+
+	/* restore processor state
+	 * We should only be here if we're coming back from STR or STD.
+	 * And, in the case of the latter, the memory image should have already
+	 * been loaded from disk.
+	 */
+	acpi_restore_state_mem();
+
+	return ACPI_SUCCESS(status) ? 0 : -EFAULT;
+}
+
+static void acpi_hib_finish(void)
+{
+	acpi_leave_sleep_state(ACPI_STATE_S4);
+	acpi_disable_wakeup_device(ACPI_STATE_S4);
+
+	/* reset firmware waking vector */
+	acpi_set_firmware_waking_vector((acpi_physical_address) 0);
+
+	if (init_8259A_after_S1) {
+		printk("Broken toshiba laptop -> kicking interrupts\n");
+		init_8259A(0);
+	}
+	return 0;
+}
+
+static struct hibernate_ops acpi_hib_ops = {
+	.prepare = acpi_hib_prepare,
+	.enter = acpi_hib_enter,
+	.finish = acpi_hib_finish,
+};
+#endif /* CONFIG_SOFTWARE_SUSPEND */
+
 /*
  * Toshiba fails to preserve interrupts over S1, reinitialization
  * of 8259 is needed after S1 resume.
@@ -227,13 +285,16 @@ int __init acpi_sleep_init(void)
 			sleep_states[i] = 1;
 			printk(" S%d", i);
 		}
-		if (i == ACPI_STATE_S4) {
-			if (sleep_states[i])
-				acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM;
-		}
 	}
 	printk(")\n");
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	if (sleep_states[ACPI_STATE_S4])
+		hibernate_set_ops(&acpi_hib_ops);
+#else
+	sleep_states[ACPI_STATE_S4] = 0;
+#endif
+
 	pm_set_ops(&acpi_pm_ops);
 	return 0;
 }
--- wireless-dev.orig/kernel/power/power.h	2007-04-26 18:15:01.240691185 +0200
+++ wireless-dev/kernel/power/power.h	2007-04-26 18:15:09.630691185 +0200
@@ -13,16 +13,6 @@ struct swsusp_info {
 
 
 
-#ifdef CONFIG_SOFTWARE_SUSPEND
-extern int pm_suspend_disk(void);
-
-#else
-static inline int pm_suspend_disk(void)
-{
-	return -EPERM;
-}
-#endif
-
 extern struct mutex pm_mutex;
 
 #define power_attr(_name) \
@@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t
 struct timeval;
 extern void swsusp_show_speed(struct timeval *, struct timeval *,
 				unsigned int, char *);
+
+extern int hibernate_platform_prepare(void);
+extern void hibernate_platform_finish(void);
--- wireless-dev.orig/drivers/acpi/sleep/proc.c	2007-04-26 18:15:02.720691185 +0200
+++ wireless-dev/drivers/acpi/sleep/proc.c	2007-04-26 18:15:09.630691185 +0200
@@ -1,6 +1,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/suspend.h>
+#include <linux/hibernate.h>
 #include <linux/bcd.h>
 #include <asm/uaccess.h>
 
@@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil
 	state = simple_strtoul(str, NULL, 0);
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	if (state == 4) {
-		error = pm_suspend(PM_SUSPEND_DISK);
+		error = hibernate();
 		goto Done;
 	}
 #endif

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 16:14                                       ` Chris Friesen
  2007-04-26 16:27                                         ` Linus Torvalds
@ 2007-04-26 17:11                                         ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 17:11 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On Thu, 2007-04-26 at 10:14 -0600, Chris Friesen wrote:
> Johannes Berg wrote:
> 
> > Judging from experience with the wext 32/64 bit fiasco it seems to be
> > rather uncommon to use 32-bit userspace on 64-bit machines.
> 
> I disagree...it's quite common.  I think its the standard way of doing 
> things for ppc64, for instance.

I know. My only 64-bit machine is ppc64 :)

But still nobody noticed the wext 32/64 bit compat bug for like forever.
On the other hand, maybe that just means that most 64-bit machines are
desktop machines without wireless.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  9:07                               ` Xavier Bestel
  2007-04-25  9:19                                 ` Nigel Cunningham
@ 2007-04-26 18:18                                 ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-26 18:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: suspend2-devel

Xavier Bestel wrote:
> On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote:
>>> (And guess what, it uses APM and suspend is really faster and way more
>>> reliable than each kernel implementation I could try).
>> If you tried Suspend2 and had problems with reliability, please send me
>> logs. I'll do all I can to help. (I have to qualify it a bit, because
>> I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and
>> I'll fix it).
> 
> Does suspend2 work with APM ? After much trying, I think now the ACPI
> implementation of my laptop (a vintage Compaq Armada 1700) is busted,
> only APM works.
> 
> AFAIR the problem with suspend2 was that it didn't poweroff some parts
> of the laptop (the led of the wifi pcmcia card was on, and the lcd light
> was on too), but that was last year. Kernel's suspend kind of worked but
> didn't resume (no reaction on button press). As I tried all this last
> year, I may have forgotten some things.
> Honestly, I like this laptop when it works flawlessly, so I don't see
> many reasons to try *susp* again. I'll do it when I'm bored, just not
> today.
> 
Actually on some old laptops I just use the apm command, with -s (or -S, 
I forget by now), and that works.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 11:09                             ` Pavel Machek
  2007-04-26 15:53                               ` Linus Torvalds
@ 2007-04-26 18:21                               ` Olivier Galibert
  2007-04-26 21:30                                 ` Pavel Machek
  1 sibling, 1 reply; 712+ messages in thread
From: Olivier Galibert @ 2007-04-26 18:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Berg, Linus Torvalds, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

On Thu, Apr 26, 2007 at 01:09:53PM +0200, Pavel Machek wrote:
> #define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long)

So I'm not supposed to be able to suspend the 16Gb-ram, 32bits servers
I have here?

  OG.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 18:40                                                   ` Rafael J. Wysocki
  (?)
@ 2007-04-26 18:40                                                   ` Johannes Berg
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
  -1 siblings, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 18:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote:

> >  * it surfaces kernel implementation details about pm_ops and thus makes
> >    the whole thing very fragile
> 
> Can you elaborate?

Well it tells userspace about pm_ops->enter/prepare/finish etc.
Also, it seems that it needs a "release memory now" operation instead of
just releasing it when the fd is closed?

> >  * it has yet another interface (yuck) to determine whether to reboot,
> >    shut down etc, doesn't use /sys/power/disk
> 
> Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.

Heh.

> >  * I generally had no idea wtf it is doing in some places
> 
> I could have told you if you had asked. :-)

I was offline ;)

> Do we need hibernate_ops at all?  There's only one user anyway and I'm not
> sure there will be more of them in the future.

I'm pretty sure there won't be, but there's no way to do it cleanly
without pm_ops since even acpi doesn't do this all the time but only
when some set of conditions is true. Hence, it needs to be able to
determine the availability of the platform mode at run time rather than
build time (build time => we could use weak symbols, arch hooks, ...)

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 18:40                                                   ` Rafael J. Wysocki
  (?)
  (?)
@ 2007-04-26 18:40                                                   ` Johannes Berg
  -1 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-26 18:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 1240 bytes --]

On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote:

> >  * it surfaces kernel implementation details about pm_ops and thus makes
> >    the whole thing very fragile
> 
> Can you elaborate?

Well it tells userspace about pm_ops->enter/prepare/finish etc.
Also, it seems that it needs a "release memory now" operation instead of
just releasing it when the fd is closed?

> >  * it has yet another interface (yuck) to determine whether to reboot,
> >    shut down etc, doesn't use /sys/power/disk
> 
> Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.

Heh.

> >  * I generally had no idea wtf it is doing in some places
> 
> I could have told you if you had asked. :-)

I was offline ;)

> Do we need hibernate_ops at all?  There's only one user anyway and I'm not
> sure there will be more of them in the future.

I'm pretty sure there won't be, but there's no way to do it cleanly
without pm_ops since even acpi doesn't do this all the time but only
when some set of conditions is true. Hence, it needs to be able to
determine the availability of the platform mode at run time rather than
build time (build time => we could use weak symbols, arch hooks, ...)

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 16:31                                                 ` Johannes Berg
@ 2007-04-26 18:40                                                   ` Rafael J. Wysocki
  -1 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 18:40 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

On Thursday, 26 April 2007 18:31, Johannes Berg wrote:
> On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:
> 
> > > From looking at pm_ops which I was recently working with a lot, it seems
> > > that it was designed by somebody who was reading the ACPI documentation
> > > and was otherwise pretty clueless, even at that level std tries to look
> > > like suspend. IMHO that is one of the first things that should be ripped
> > > out, no pm_ops for STD, it's a pain to work with.
> > 
> > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> > low-level enter is pretty similar).
> > 
> > Patches would be welcome
> 
> That was easier than I thought. This applies on top of a patch that
> makes kernel/power/user.c optional since I had no idea how to fix it,
> problems I see:
>  * it surfaces kernel implementation details about pm_ops and thus makes
>    the whole thing very fragile

Can you elaborate?

>  * it has yet another interface (yuck) to determine whether to reboot,
>    shut down etc, doesn't use /sys/power/disk

Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.

>  * I generally had no idea wtf it is doing in some places

I could have told you if you had asked. :-)

> Anyway, this patch is only compile tested, it
>  * introduces include/linux/hibernate.h with hibernate_ops and
>    a new hibernate() function to hibernate the system

Do we need hibernate_ops at all?  There's only one user anyway and I'm not
sure there will be more of them in the future.

>  * rips apart a lot of the suspend code and puts it back together using
>    the hibernate_ops
>  * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
>  * might apply/compile against -mm, I have all my and some of Rafael's
>    suspend/hibernate work in my tree.
>  * breaks user suspend as I noted above
>  * is incomplete, somewhere pm_suspend_disk() is still defined iirc

I think I can fix it up, just give me some time.

The idea is good, I think we should do someting like this.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
@ 2007-04-26 18:40                                                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 18:40 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven

On Thursday, 26 April 2007 18:31, Johannes Berg wrote:
> On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:
> 
> > > From looking at pm_ops which I was recently working with a lot, it seems
> > > that it was designed by somebody who was reading the ACPI documentation
> > > and was otherwise pretty clueless, even at that level std tries to look
> > > like suspend. IMHO that is one of the first things that should be ripped
> > > out, no pm_ops for STD, it's a pain to work with.
> > 
> > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> > low-level enter is pretty similar).
> > 
> > Patches would be welcome
> 
> That was easier than I thought. This applies on top of a patch that
> makes kernel/power/user.c optional since I had no idea how to fix it,
> problems I see:
>  * it surfaces kernel implementation details about pm_ops and thus makes
>    the whole thing very fragile

Can you elaborate?

>  * it has yet another interface (yuck) to determine whether to reboot,
>    shut down etc, doesn't use /sys/power/disk

Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.

>  * I generally had no idea wtf it is doing in some places

I could have told you if you had asked. :-)

> Anyway, this patch is only compile tested, it
>  * introduces include/linux/hibernate.h with hibernate_ops and
>    a new hibernate() function to hibernate the system

Do we need hibernate_ops at all?  There's only one user anyway and I'm not
sure there will be more of them in the future.

>  * rips apart a lot of the suspend code and puts it back together using
>    the hibernate_ops
>  * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
>  * might apply/compile against -mm, I have all my and some of Rafael's
>    suspend/hibernate work in my tree.
>  * breaks user suspend as I noted above
>  * is incomplete, somewhere pm_suspend_disk() is still defined iirc

I think I can fix it up, just give me some time.

The idea is good, I think we should do someting like this.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 18:40                                                   ` Johannes Berg
@ 2007-04-26 19:02                                                     ` Rafael J. Wysocki
  2007-04-27  9:41                                                       ` Johannes Berg
  2007-04-27  9:41                                                       ` Johannes Berg
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
  1 sibling, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 19:02 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

On Thursday, 26 April 2007 20:40, Johannes Berg wrote:
> On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote:
> 
> > >  * it surfaces kernel implementation details about pm_ops and thus makes
> > >    the whole thing very fragile
> > 
> > Can you elaborate?
> 
> Well it tells userspace about pm_ops->enter/prepare/finish etc.
> Also, it seems that it needs a "release memory now" operation instead of
> just releasing it when the fd is closed?

Yes.  That's because we want to be able to repeat creating the image
without closing the fd in some situations.

> > >  * it has yet another interface (yuck) to determine whether to reboot,
> > >    shut down etc, doesn't use /sys/power/disk
> > 
> > Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.
> 
> Heh.
> 
> > >  * I generally had no idea wtf it is doing in some places
> > 
> > I could have told you if you had asked. :-)
> 
> I was offline ;)
> 
> > Do we need hibernate_ops at all?  There's only one user anyway and I'm not
> > sure there will be more of them in the future.
> 
> I'm pretty sure there won't be, but there's no way to do it cleanly
> without pm_ops since even acpi doesn't do this all the time but only
> when some set of conditions is true. Hence, it needs to be able to
> determine the availability of the platform mode at run time rather than
> build time (build time => we could use weak symbols, arch hooks, ...)

Still, we could use a global var 'platform_hibernation' or something like this,
I think.  Then, we can do

#define platform_hibernation	0

on the architectures that don't need it and make ACPI use it instead of this
"dynamic linking".

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 18:40                                                   ` Johannes Berg
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
@ 2007-04-26 19:02                                                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 19:02 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven

On Thursday, 26 April 2007 20:40, Johannes Berg wrote:
> On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote:
> 
> > >  * it surfaces kernel implementation details about pm_ops and thus makes
> > >    the whole thing very fragile
> > 
> > Can you elaborate?
> 
> Well it tells userspace about pm_ops->enter/prepare/finish etc.
> Also, it seems that it needs a "release memory now" operation instead of
> just releasing it when the fd is closed?

Yes.  That's because we want to be able to repeat creating the image
without closing the fd in some situations.

> > >  * it has yet another interface (yuck) to determine whether to reboot,
> > >    shut down etc, doesn't use /sys/power/disk
> > 
> > Yes.  In fact it was meant as a replacement for /sys/power/disk at one point.
> 
> Heh.
> 
> > >  * I generally had no idea wtf it is doing in some places
> > 
> > I could have told you if you had asked. :-)
> 
> I was offline ;)
> 
> > Do we need hibernate_ops at all?  There's only one user anyway and I'm not
> > sure there will be more of them in the future.
> 
> I'm pretty sure there won't be, but there's no way to do it cleanly
> without pm_ops since even acpi doesn't do this all the time but only
> when some set of conditions is true. Hence, it needs to be able to
> determine the availability of the platform mode at run time rather than
> build time (build time => we could use weak symbols, arch hooks, ...)

Still, we could use a global var 'platform_hibernation' or something like this,
I think.  Then, we can do

#define platform_hibernation	0

on the architectures that don't need it and make ACPI use it instead of this
"dynamic linking".

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25 21:41                         ` Matt Mackall
  2007-04-26 11:27                           ` Pavel Machek
@ 2007-04-26 19:04                           ` Bill Davidsen
  1 sibling, 0 replies; 712+ messages in thread
From: Bill Davidsen @ 2007-04-26 19:04 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Pavel Machek, Ingo Molnar, Linus Torvalds, Nigel Cunningham,
	Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel,
	Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner,
	Arjan van de Ven

Matt Mackall wrote:
> On Tue, Apr 24, 2007 at 11:29:56PM +0200, Pavel Machek wrote:
>> We do not want to fragment the testing base, and suspend2 does not
>> really have any interesting features over uswsusp.
> 
> The testing base is already fragmented!
> 
> What the current situation means is that you simply never hear from
> the people who get fed up with suspend but who manage to get suspend2
> working.
> 
I have to feel that having a *working resume* capability is "any 
interesting features" enough. What you say about "simply never hear 
from" is unfortunately true.

On 04/25/2007 05:30 PM EDT, Pavel Machek wrote:

 >It is not Rafael's fault. Actually it is quite hard to work with
 >Nigel, because he implements every feature someone asks for, and wants
 >to merge them all  :-( . I don't expect to ever agree with Nigel on
 >anything important, sorry.

The fact that Pavel thinks giving the users what they want is a problem 
certainly defines the difference between them, the populist "give them 
what they want" and the elitist "let's them make do with what I think 
they should have."

I do respect Pavel for all the stuff he has done and is doing, I wish I 
could have found a nicer way to say that.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 16:10                                     ` Pekka Enberg
@ 2007-04-26 19:28                                       ` Rafael J. Wysocki
  2007-04-26 20:16                                         ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 19:28 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds,
	Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> 
> On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > schedulers, for example), but I think for this purpose it should be reviewed
> > properly.
> 
> Yeah, this makes sense.
> 
> On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > work, but at least to me it seems to be potentially dangerous.
> 
> I am new to suspend2 so can you please explain what exactly is dangerous
> about it?

After freezing tasks, it first saves the contents of the LRU pages, freezes
devices and then uses the LRU pages for storing the suspend image (if more
memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
warranty that the LRU pages are not updated after we've saved their contents
(first potential problem here).

After the image has been created, we have to unfreeze devices and save the
image.  Now, we have no warranty that no one will be writing to the LRU pages
that we have used to store the image, for whatever reasons known to him, so the
image can potentially get corrupted while it's being saved.

In principle, device drivers can do this and there are some kernel threads that
also can do this (we don't freeze them, because they're needed for the image
saving).

The design is conceptually really really complicated and it makes strong
assumptions about the behavior of different subsystems.  While these
assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
of them in the future if suspend2 were merged.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26  0:34                                               ` Linus Torvalds
@ 2007-04-26 20:12                                                 ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 20:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven

On Thursday, 26 April 2007 02:34, Linus Torvalds wrote:
> 
> On Wed, 25 Apr 2007, Linus Torvalds wrote:
> > 
> > The *thaw* needs to happen with devices quiescent. 
> 
> Btw, I sure as hell hope you didn't use "suspend()" for that. You're 
> (again) much better off having a totally separate function that just 
> freezes stuff.
> 
> So in the "snapshot+shutdown" path, you should have:
> 
>  - prepare_to_snapshot() - allocate memory, and possibly return errors
> 
>    We can skip this, if we just make the rule be that any devices that 
>    want to support snapshotting must always have the memory required for 
>    snapshotting pre-allocated. Most devices really do allocate memory for 
>    their state anyway, and the only real reason for the "prepare" stage 
>    here is becasue the final snapshot has to happen with interrupts off, 
>    obviously. So *if* we don't need to allocate any memory, and if we 
>    don't expect to want to accept some early error case, this is likely 
>    useless.

I think we need this.  Apparently, some device drivers need as much as 30 meg
of RAM at later stages (I don't know why and what for).

>  - snapshot() - actually save device state that is consistent with the 
>    memory image at the time. Called with interrupts off, but the device 
>    has to be usable both before and afterwards!
> 
> And I would seriously suggest that "snapshot()" be documented to not rely 
> on any DMA memory, exactly because the device has to be accessible both 
> before and after (before - because we're running and allocating memory, 
> and after - because we'll be writing thigns out). But see later:

Please note that some drivers are compiled as modules and they may deal with
uninitialized hardware (or worse, with the hardware initialized by the BIOS in
a crazy way) after restart_snapshot().  It may be better for them to actually
quiesce the devices here too to avoid problems after restart_snapshot() .

> For the "resume snapshot" path, I would suggest having 
> 
>  - freeze(): quiesce the device. This literally just does the absolute 
>    minimum to make sure that the device doesn't do anything surprising (no 
>    interrupts, no DMA, no nothing). For many devices, it's a no-op, even 
>    if they can do DMA (eg most disk controllers will do DMA, but only as 
>    an actual result of a request, and upper layers will be quiescent 
>    anyway, so they do *not* need to disable DMA)
> 
>    NOTE! The "freeze()" gets called from the *old* kernel just _before_ a
>    snapshot unpacking!!

Yes, and usually the majority of modules is not loaded at that time.

>  - restart_snapshot() - actually restart the snapshot (and usually this 
>    would involve re-setting the device, not so much trying to restore all 
>    the saved state. IOW, it's easier to just re-initialize the DMA command 
>    queues than to try to make them "atomic" in the snapshot).
> 
>    NOTE! This gets called by the *new* kernel _after_ the snapshot resume!

I think devices _should_ be resetted in restart_snapshot(), unless it's
possible to check if they have already been initialized by the "old" kernel -
but this information would have to be available from the device itself.

> And if you *want* to, I can see that you might want to actually do a 
> "unfreeze()" thing too, and make the actual shapshotting be:

What unfreeze() would be needed for in that case?

> 	/* We may not even need this.. */
> 	for_each_device() {
> 		err = prepare_to_snapshot();
> 		if (err)
> 			return err;
> 	}

We need to free as much memory as we'll need for the image creation at this
point.

> 	/* This is the real work for snapshotting */
> 	cli();
> 	for_each_device()
> 		freeze(dev);

You've added freeze() here, but it's not on your list above?

> 	for_each_device()
> 		snapshot(dev);
> 	.. snapshot current memory image ..
> 	for_each_device_depth_first()
> 		unfreeze(dev);
> 	sti();
> 
> and maybe it's worth it, but I would almost suggest that you just make the 
> rule be that any DMA etc just *has* to be re-initialized by 
> "restart_snapshot()", in which case it's not even necessary to 
> freeze/unfreeze over the device, and "snapshot()" itself only needs to 
> make sure any non-DMA data is safe.
>
> But adding the freeze/unfreeze (which is a no-op for most hardware anyway) 
> might make things easier to think about, so I would certainly not *object* 
> to it, even if I suspect it's not necessary.

I think it's not necessary.

> Anyway, the restore_snapshot() sequence should be:
> 
> 	/* Old kernel.. Normal boot, load snapshot image */
> 	cli()
> 	for_each_device()
> 		freeze(dev);
> 	restore_snapshot_image();
> 	restore_regs_and_jump_to_image();
> 	/* noreturn */
> 
> 
> 	/* New kernel, gets called at the snapshot restore address
> 	 * with interrupts off and devices frozen, and memory image
> 	 * constsntent with what it was at "snapshot()" time
> 	 */
> 	for_each_dev_depth_first()
> 		restore_snapshot(dev);
> 	/* And if you want to, just to be "symmetric"
> 
> 		for_each_dev_depth_first()
> 			unfreeze(dev)
> 
> 	   although I think you could just make "restore_snapshot()" 
> 	   implicitly unfreeze it too..

Agreed.

> 	 */
> 	sti();
> 	/* We're up */
> 
> and notice how *different* this is from what happens for s2ram. There 
> really isn't anything in common here. Exactly because s2ram simply doesn't 
> _have_ any of the issues with atomic memory images.

Agreed again.

Moreover, in the s2ram case there are no problems with device drivers compiled
as modules.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 19:28                                       ` Rafael J. Wysocki
@ 2007-04-26 20:16                                         ` Nigel Cunningham
  2007-04-26 20:37                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26 20:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar,
	Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2986 bytes --]

Hi.

On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote:
> On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> > 
> > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > > schedulers, for example), but I think for this purpose it should be reviewed
> > > properly.
> > 
> > Yeah, this makes sense.
> > 
> > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > > work, but at least to me it seems to be potentially dangerous.
> > 
> > I am new to suspend2 so can you please explain what exactly is dangerous
> > about it?
> 
> After freezing tasks, it first saves the contents of the LRU pages, freezes
> devices and then uses the LRU pages for storing the suspend image (if more
> memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
> warranty that the LRU pages are not updated after we've saved their contents
> (first potential problem here).
> 
> After the image has been created, we have to unfreeze devices and save the
> image.  Now, we have no warranty that no one will be writing to the LRU pages
> that we have used to store the image, for whatever reasons known to him, so the
> image can potentially get corrupted while it's being saved.
> 
> In principle, device drivers can do this and there are some kernel threads that
> also can do this (we don't freeze them, because they're needed for the image
> saving).
> 
> The design is conceptually really really complicated and it makes strong
> assumptions about the behavior of different subsystems.  While these
> assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
> of them in the future if suspend2 were merged.

That's a good description of the issue, although I think _may_ and
_seems_ are stating things a bit more pessimistically than is
necessary. 

You see, we need to remember that the pages which are saved separately
are LRU pages. Because userspace is frozen, their contents are going to
be static. The only possibilities for modifying them come from timer
routines, improperly frozen filesystems and device drivers.

We have code to check that the LRU isn't changing, and I've only seen
one report of modifications to about 20 LRU pages. I haven't had the
time yet to chase down the cause, but hope to do so soon.

The general scheme has been working for four or five years - if there
was a fundamental issue, we would have found it by now.

The scheme isn't complicated. The algo for figuring out whether to save
the page in an atomic copy just says: Iterate through all LRU pages. For
each page, ask: Is this used by the thread suspending, or by userui? No?
Save separately. Yes? Save in the atomic copy.... oh, and save
everything else that needs to be saved in the atomic copy.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 20:16                                         ` Nigel Cunningham
@ 2007-04-26 20:37                                           ` Rafael J. Wysocki
  2007-04-26 20:49                                             ` David Lang
  2007-04-26 20:55                                             ` Nigel Cunningham
  0 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 20:37 UTC (permalink / raw)
  To: nigel
  Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar,
	Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote:
> Hi.
> 
> On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote:
> > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> > > 
> > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > > > schedulers, for example), but I think for this purpose it should be reviewed
> > > > properly.
> > > 
> > > Yeah, this makes sense.
> > > 
> > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > > > work, but at least to me it seems to be potentially dangerous.
> > > 
> > > I am new to suspend2 so can you please explain what exactly is dangerous
> > > about it?
> > 
> > After freezing tasks, it first saves the contents of the LRU pages, freezes
> > devices and then uses the LRU pages for storing the suspend image (if more
> > memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
> > warranty that the LRU pages are not updated after we've saved their contents
> > (first potential problem here).
> > 
> > After the image has been created, we have to unfreeze devices and save the
> > image.  Now, we have no warranty that no one will be writing to the LRU pages
> > that we have used to store the image, for whatever reasons known to him, so the
> > image can potentially get corrupted while it's being saved.
> > 
> > In principle, device drivers can do this and there are some kernel threads that
> > also can do this (we don't freeze them, because they're needed for the image
> > saving).
> > 
> > The design is conceptually really really complicated and it makes strong
> > assumptions about the behavior of different subsystems.  While these
> > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
> > of them in the future if suspend2 were merged.
> 
> That's a good description of the issue, although I think _may_ and
> _seems_ are stating things a bit more pessimistically than is
> necessary. 

I've used them to express my personal concerns.

> You see, we need to remember that the pages which are saved separately
> are LRU pages. Because userspace is frozen, their contents are going to
> be static. The only possibilities for modifying them come from timer
> routines, improperly frozen filesystems and device drivers.

And kernel threads that we don't freeze deliberately.  Currently, these are
all worker threads, dm-related kernel threads and some others.

> We have code to check that the LRU isn't changing, and I've only seen
> one report of modifications to about 20 LRU pages. I haven't had the
> time yet to chase down the cause, but hope to do so soon.

I didn't say that would be common.  If it had been, you'd have seen problems
with it.  To me the problem is the lack of warranty that it won't happen.

> The general scheme has been working for four or five years - if there
> was a fundamental issue, we would have found it by now.
> 
> The scheme isn't complicated.

Conceptually, it is complicated just because you're using the LRU.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 20:37                                           ` Rafael J. Wysocki
@ 2007-04-26 20:49                                             ` David Lang
  2007-04-26 20:55                                             ` Nigel Cunningham
  1 sibling, 0 replies; 712+ messages in thread
From: David Lang @ 2007-04-26 20:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: nigel, Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu,
	Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	Andrew Morton, Thomas Gleixner, Arjan van de Ven

On Thu, 26 Apr 2007, Rafael J. Wysocki wrote:

>> The general scheme has been working for four or five years - if there
>> was a fundamental issue, we would have found it by now.
>>
>> The scheme isn't complicated.
>
> Conceptually, it is complicated just because you're using the LRU.

I know that I've seen many projects that are working on or claim to have 
suceeded in being able to do live migration of processes from one system to 
another. has anyone looked at useing any of these mechanisms for snapshoting the 
user processes for the std situation? if you can do this a process at a time you 
may be able to avoid the massive blob of a write

instead of what linus was saying

buff = snapshot()
write(buff)

it would be
start_snapshot()   /* stops all userspace schedulers except for this process */
foreach(pid) {
   buff = snapshotpid(pid)
   write(buff)
}

David Lang

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 20:37                                           ` Rafael J. Wysocki
  2007-04-26 20:49                                             ` David Lang
@ 2007-04-26 20:55                                             ` Nigel Cunningham
  2007-04-26 21:22                                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26 20:55 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar,
	Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 3955 bytes --]

Hi.

On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote:
> On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote:
> > Hi.
> > 
> > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote:
> > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> > > > 
> > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > > > > schedulers, for example), but I think for this purpose it should be reviewed
> > > > > properly.
> > > > 
> > > > Yeah, this makes sense.
> > > > 
> > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > > > > work, but at least to me it seems to be potentially dangerous.
> > > > 
> > > > I am new to suspend2 so can you please explain what exactly is dangerous
> > > > about it?
> > > 
> > > After freezing tasks, it first saves the contents of the LRU pages, freezes
> > > devices and then uses the LRU pages for storing the suspend image (if more
> > > memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
> > > warranty that the LRU pages are not updated after we've saved their contents
> > > (first potential problem here).
> > > 
> > > After the image has been created, we have to unfreeze devices and save the
> > > image.  Now, we have no warranty that no one will be writing to the LRU pages
> > > that we have used to store the image, for whatever reasons known to him, so the
> > > image can potentially get corrupted while it's being saved.
> > > 
> > > In principle, device drivers can do this and there are some kernel threads that
> > > also can do this (we don't freeze them, because they're needed for the image
> > > saving).
> > > 
> > > The design is conceptually really really complicated and it makes strong
> > > assumptions about the behavior of different subsystems.  While these
> > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
> > > of them in the future if suspend2 were merged.
> > 
> > That's a good description of the issue, although I think _may_ and
> > _seems_ are stating things a bit more pessimistically than is
> > necessary. 
> 
> I've used them to express my personal concerns.
> 
> > You see, we need to remember that the pages which are saved separately
> > are LRU pages. Because userspace is frozen, their contents are going to
> > be static. The only possibilities for modifying them come from timer
> > routines, improperly frozen filesystems and device drivers.
> 
> And kernel threads that we don't freeze deliberately.  Currently, these are
> all worker threads, dm-related kernel threads and some others.
> 
> > We have code to check that the LRU isn't changing, and I've only seen
> > one report of modifications to about 20 LRU pages. I haven't had the
> > time yet to chase down the cause, but hope to do so soon.
> 
> I didn't say that would be common.  If it had been, you'd have seen problems
> with it.  To me the problem is the lack of warranty that it won't happen.
> 
> > The general scheme has been working for four or five years - if there
> > was a fundamental issue, we would have found it by now.
> > 
> > The scheme isn't complicated.
> 
> Conceptually, it is complicated just because you're using the LRU.

Well, I'm willing to look at other ideas. I actually selected LRU
because it was simple. Prior to this, we did have a try at just
iterating over the pages of frozen processes, but it didn't yield enough
pages to be viable. I wouldn't be surprised if hunting down the cause of
these changing pages leads to doing the opposite - starting with LRU
pages and then removing all the ones owned by processes. (Am I right in
thinking the remainder would be anonymous pages? I must learn more mm
inards :>).

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 16:10                                                 ` Linus Torvalds
@ 2007-04-26 21:00                                                   ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 21:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Lord, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Hi!

> > > See? Two *totally* different cases. They have *nothing* in common. Not the
> > > call sequence, not the logic, not *anything*.
> > 
> > Except that both methods cannot rely upon hot-pluggable devices
> > still being present on resume/restore.  It is exceptionally common
> > to unplug all USB/firewire cables, mouse, keyboard, docking cables etc..
> > after a machine is in S2R state.
> 
> Right, and that has nothing to do with suspend/resume. You'd better be 
> able to handle unexpected hotplugs _regardless_.

Actually, with suspend/resume it is quite easy to cheat, and just
"unplug" the hardware on suspend, then "plug it back" on resume. That
works very well for devices like keyboards and mice (where you can't
tell if you are talking to the same hw, anyway).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 15:56                                     ` Linus Torvalds
@ 2007-04-26 21:06                                       ` Theodore Tso
  2007-04-26 21:12                                         ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: Theodore Tso @ 2007-04-26 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Johannes Berg, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

On Thu, Apr 26, 2007 at 08:56:48AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 26 Apr 2007, Pavel Machek wrote:
> > 
> > Yes, probably will. The other option is to break existing 32-bit
> > userspace, which is a bit more common AFAICT.
> 
> And *this* is why kernel/userspace things simply should not be done.
> 
> It's simply better to do things entirely in the kernel. Because you add 
> bugs and complications otherwise, and doing it in the kernel allows you 
> to just switch things around.
> 
> As it is, it appears that user-space suspend is just broken whichever way 
> we turn.

Well, in that case maybe suspend2 should be very seriously considered,
since it has the features of uswsusp --- basic features which every
single Microsoft and MacOSX user are used to like, like progress bars
--- and it's all done in the kernel.

					- Ted


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 21:06                                       ` Theodore Tso
@ 2007-04-26 21:12                                         ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26 21:12 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Pavel Machek, Johannes Berg, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven,
	Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

Hi.

On Thu, 2007-04-26 at 17:06 -0400, Theodore Tso wrote:
> On Thu, Apr 26, 2007 at 08:56:48AM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Thu, 26 Apr 2007, Pavel Machek wrote:
> > > 
> > > Yes, probably will. The other option is to break existing 32-bit
> > > userspace, which is a bit more common AFAICT.
> > 
> > And *this* is why kernel/userspace things simply should not be done.
> > 
> > It's simply better to do things entirely in the kernel. Because you add 
> > bugs and complications otherwise, and doing it in the kernel allows you 
> > to just switch things around.
> > 
> > As it is, it appears that user-space suspend is just broken whichever way 
> > we turn.
> 
> Well, in that case maybe suspend2 should be very seriously considered,
> since it has the features of uswsusp --- basic features which every
> single Microsoft and MacOSX user are used to like, like progress bars
> --- and it's all done in the kernel.

Umm. I don't want to be picky, but that's not quite true. The progress
bar is done in userspace.

There's also the possibility of using a userspace app to manage storage
too (I did work on establishing/tearing down an NBD connection as
necessary but didn't quite get it finished and have never released it).
That said, this bit can be torn out by simply removing a file and the
Makefile line.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 20:55                                             ` Nigel Cunningham
@ 2007-04-26 21:22                                               ` Rafael J. Wysocki
  2007-04-26 22:08                                                 ` Nigel Cunningham
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 21:22 UTC (permalink / raw)
  To: nigel
  Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar,
	Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

On Thursday, 26 April 2007 22:55, Nigel Cunningham wrote:
> Hi.
> 
> On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote:
> > On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote:
> > > Hi.
> > > 
> > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote:
> > > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> > > > > 
> > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > > > > > schedulers, for example), but I think for this purpose it should be reviewed
> > > > > > properly.
> > > > > 
> > > > > Yeah, this makes sense.
> > > > > 
> > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > > > > > work, but at least to me it seems to be potentially dangerous.
> > > > > 
> > > > > I am new to suspend2 so can you please explain what exactly is dangerous
> > > > > about it?
> > > > 
> > > > After freezing tasks, it first saves the contents of the LRU pages, freezes
> > > > devices and then uses the LRU pages for storing the suspend image (if more
> > > > memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
> > > > warranty that the LRU pages are not updated after we've saved their contents
> > > > (first potential problem here).
> > > > 
> > > > After the image has been created, we have to unfreeze devices and save the
> > > > image.  Now, we have no warranty that no one will be writing to the LRU pages
> > > > that we have used to store the image, for whatever reasons known to him, so the
> > > > image can potentially get corrupted while it's being saved.
> > > > 
> > > > In principle, device drivers can do this and there are some kernel threads that
> > > > also can do this (we don't freeze them, because they're needed for the image
> > > > saving).
> > > > 
> > > > The design is conceptually really really complicated and it makes strong
> > > > assumptions about the behavior of different subsystems.  While these
> > > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
> > > > of them in the future if suspend2 were merged.
> > > 
> > > That's a good description of the issue, although I think _may_ and
> > > _seems_ are stating things a bit more pessimistically than is
> > > necessary. 
> > 
> > I've used them to express my personal concerns.
> > 
> > > You see, we need to remember that the pages which are saved separately
> > > are LRU pages. Because userspace is frozen, their contents are going to
> > > be static. The only possibilities for modifying them come from timer
> > > routines, improperly frozen filesystems and device drivers.
> > 
> > And kernel threads that we don't freeze deliberately.  Currently, these are
> > all worker threads, dm-related kernel threads and some others.
> > 
> > > We have code to check that the LRU isn't changing, and I've only seen
> > > one report of modifications to about 20 LRU pages. I haven't had the
> > > time yet to chase down the cause, but hope to do so soon.
> > 
> > I didn't say that would be common.  If it had been, you'd have seen problems
> > with it.  To me the problem is the lack of warranty that it won't happen.
> > 
> > > The general scheme has been working for four or five years - if there
> > > was a fundamental issue, we would have found it by now.
> > > 
> > > The scheme isn't complicated.
> > 
> > Conceptually, it is complicated just because you're using the LRU.
> 
> Well, I'm willing to look at other ideas. I actually selected LRU
> because it was simple. Prior to this, we did have a try at just
> iterating over the pages of frozen processes, but it didn't yield enough
> pages to be viable. I wouldn't be surprised if hunting down the cause of
> these changing pages leads to doing the opposite - starting with LRU
> pages and then removing all the ones owned by processes. (Am I right in
> thinking the remainder would be anonymous pages? I must learn more mm
> inards :>).

I think we can discuss that, and the other things too.  I'm open to
cooperation. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 18:21                               ` Olivier Galibert
@ 2007-04-26 21:30                                 ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-26 21:30 UTC (permalink / raw)
  To: Olivier Galibert, Johannes Berg, Linus Torvalds, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven,
	Rafael J. Wysocki

Hi!

> > #define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long)
> 
> So I'm not supposed to be able to suspend the 16Gb-ram, 32bits servers
> I have here?

(You are right, this should have been u64)

Snapshot image is by design limited by ammount of lowmem. If you want
to change that, this unsigned long limit will be least of your
problems.

(And no, I'd not expect loaded 6GB box to suspend properly. It will
just realize it does not have enough lowmem and refuse to suspend).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 21:22                                               ` Rafael J. Wysocki
@ 2007-04-26 22:08                                                 ` Nigel Cunningham
  0 siblings, 0 replies; 712+ messages in thread
From: Nigel Cunningham @ 2007-04-26 22:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar,
	Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith,
	linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton,
	Thomas Gleixner, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 4505 bytes --]

Hi.

On Thu, 2007-04-26 at 23:22 +0200, Rafael J. Wysocki wrote:
> On Thursday, 26 April 2007 22:55, Nigel Cunningham wrote:
> > Hi.
> > 
> > On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote:
> > > On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote:
> > > > Hi.
> > > > 
> > > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote:
> > > > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote:
> > > > > > 
> > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O
> > > > > > > schedulers, for example), but I think for this purpose it should be reviewed
> > > > > > > properly.
> > > > > > 
> > > > > > Yeah, this makes sense.
> > > > > > 
> > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > > > > > > There also is a real problem with how it uses the LRU pages.  It _seems_ to
> > > > > > > work, but at least to me it seems to be potentially dangerous.
> > > > > > 
> > > > > > I am new to suspend2 so can you please explain what exactly is dangerous
> > > > > > about it?
> > > > > 
> > > > > After freezing tasks, it first saves the contents of the LRU pages, freezes
> > > > > devices and then uses the LRU pages for storing the suspend image (if more
> > > > > memory is needed, it's allocated, but that's irrelevant here).  Now, we have no
> > > > > warranty that the LRU pages are not updated after we've saved their contents
> > > > > (first potential problem here).
> > > > > 
> > > > > After the image has been created, we have to unfreeze devices and save the
> > > > > image.  Now, we have no warranty that no one will be writing to the LRU pages
> > > > > that we have used to store the image, for whatever reasons known to him, so the
> > > > > image can potentially get corrupted while it's being saved.
> > > > > 
> > > > > In principle, device drivers can do this and there are some kernel threads that
> > > > > also can do this (we don't freeze them, because they're needed for the image
> > > > > saving).
> > > > > 
> > > > > The design is conceptually really really complicated and it makes strong
> > > > > assumptions about the behavior of different subsystems.  While these
> > > > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction
> > > > > of them in the future if suspend2 were merged.
> > > > 
> > > > That's a good description of the issue, although I think _may_ and
> > > > _seems_ are stating things a bit more pessimistically than is
> > > > necessary. 
> > > 
> > > I've used them to express my personal concerns.
> > > 
> > > > You see, we need to remember that the pages which are saved separately
> > > > are LRU pages. Because userspace is frozen, their contents are going to
> > > > be static. The only possibilities for modifying them come from timer
> > > > routines, improperly frozen filesystems and device drivers.
> > > 
> > > And kernel threads that we don't freeze deliberately.  Currently, these are
> > > all worker threads, dm-related kernel threads and some others.
> > > 
> > > > We have code to check that the LRU isn't changing, and I've only seen
> > > > one report of modifications to about 20 LRU pages. I haven't had the
> > > > time yet to chase down the cause, but hope to do so soon.
> > > 
> > > I didn't say that would be common.  If it had been, you'd have seen problems
> > > with it.  To me the problem is the lack of warranty that it won't happen.
> > > 
> > > > The general scheme has been working for four or five years - if there
> > > > was a fundamental issue, we would have found it by now.
> > > > 
> > > > The scheme isn't complicated.
> > > 
> > > Conceptually, it is complicated just because you're using the LRU.
> > 
> > Well, I'm willing to look at other ideas. I actually selected LRU
> > because it was simple. Prior to this, we did have a try at just
> > iterating over the pages of frozen processes, but it didn't yield enough
> > pages to be viable. I wouldn't be surprised if hunting down the cause of
> > these changing pages leads to doing the opposite - starting with LRU
> > pages and then removing all the ones owned by processes. (Am I right in
> > thinking the remainder would be anonymous pages? I must learn more mm
> > inards :>).
> 
> I think we can discuss that, and the other things too.  I'm open to
> cooperation. :-)

Great!

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
@ 2007-04-27  9:41                                                       ` Johannes Berg
  2007-04-27 10:09                                                         ` [linux-pm] " Johannes Berg
                                                                           ` (3 more replies)
  2007-04-27  9:41                                                       ` Johannes Berg
  1 sibling, 4 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27  9:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote:

> Yes.  That's because we want to be able to repeat creating the image
> without closing the fd in some situations.

Oh yeah, I just checked and it's not in fact necessary. I'm just
confused.

> Still, we could use a global var 'platform_hibernation' or something like this,
> I think.  Then, we can do
> 
> #define platform_hibernation	0
> 
> on the architectures that don't need it and make ACPI use it instead of this
> "dynamic linking".

No, because acpi doesn't know at build time whether it can actually do
S4 or not.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-26 19:02                                                     ` Rafael J. Wysocki
  2007-04-27  9:41                                                       ` Johannes Berg
@ 2007-04-27  9:41                                                       ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27  9:41 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 616 bytes --]

On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote:

> Yes.  That's because we want to be able to repeat creating the image
> without closing the fd in some situations.

Oh yeah, I just checked and it's not in fact necessary. I'm just
confused.

> Still, we could use a global var 'platform_hibernation' or something like this,
> I think.  Then, we can do
> 
> #define platform_hibernation	0
> 
> on the architectures that don't need it and make ACPI use it instead of this
> "dynamic linking".

No, because acpi doesn't know at build time whether it can actually do
S4 or not.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27  9:41                                                       ` Johannes Berg
@ 2007-04-27 10:09                                                         ` Johannes Berg
  2007-04-27 10:09                                                         ` Johannes Berg
                                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 415 bytes --]

On Fri, 2007-04-27 at 11:41 +0200, Johannes Berg wrote:

> No, because acpi doesn't know at build time whether it can actually do
> S4 or not.

Actually, you could probably do it by making some weak symbol for it
that only ACPI overrides, and then check in the ACPI code if S4 is
possible, otherwise somehow invoke the old symbol or copy the code or
something. Seems a bit more fragile though.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27  9:41                                                       ` Johannes Berg
  2007-04-27 10:09                                                         ` [linux-pm] " Johannes Berg
@ 2007-04-27 10:09                                                         ` Johannes Berg
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
  3 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:09 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, suspend2-devel, Mike Galbraith,
	linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner,
	Pavel Machek, Ingo Molnar, Linus Torvalds, linux-pm,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 415 bytes --]

On Fri, 2007-04-27 at 11:41 +0200, Johannes Berg wrote:

> No, because acpi doesn't know at build time whether it can actually do
> S4 or not.

Actually, you could probably do it by making some weak symbol for it
that only ACPI overrides, and then check in the ACPI code if S4 is
possible, otherwise somehow invoke the old symbol or copy the code or
something. Seems a bit more fragile though.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27  9:41                                                       ` Johannes Berg
                                                                           ` (2 preceding siblings ...)
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
@ 2007-04-27 10:18                                                         ` Rafael J. Wysocki
  2007-04-27 10:19                                                           ` Johannes Berg
  2007-04-27 10:19                                                           ` Johannes Berg
  3 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 10:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

On Friday, 27 April 2007 11:41, Johannes Berg wrote:
> On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote:
> 
> > Yes.  That's because we want to be able to repeat creating the image
> > without closing the fd in some situations.
> 
> Oh yeah, I just checked and it's not in fact necessary. I'm just
> confused.
> 
> > Still, we could use a global var 'platform_hibernation' or something like this,
> > I think.  Then, we can do
> > 
> > #define platform_hibernation	0
> > 
> > on the architectures that don't need it and make ACPI use it instead of this
> > "dynamic linking".
> 
> No, because acpi doesn't know at build time whether it can actually do
> S4 or not.

That's not a problem, I think.

1) We define platform_hibernation if CONFIG_ACPI is set.

2) In the ACPI code we do

if (can do S4)
	platform_hibernation = 1;

3) We have functions arch_platform_prepare()/finish()/enter() that are defined
to be noops for anything but ACPI systems and for ACPI systems they are
defined like this:

int arch_platform_enter(void)
{
	if (!platform_hibernation)
		return 0;

	...
}

I think it should work.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27  9:41                                                       ` Johannes Berg
  2007-04-27 10:09                                                         ` [linux-pm] " Johannes Berg
  2007-04-27 10:09                                                         ` Johannes Berg
@ 2007-04-27 10:18                                                         ` Rafael J. Wysocki
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
  3 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 10:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven

On Friday, 27 April 2007 11:41, Johannes Berg wrote:
> On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote:
> 
> > Yes.  That's because we want to be able to repeat creating the image
> > without closing the fd in some situations.
> 
> Oh yeah, I just checked and it's not in fact necessary. I'm just
> confused.
> 
> > Still, we could use a global var 'platform_hibernation' or something like this,
> > I think.  Then, we can do
> > 
> > #define platform_hibernation	0
> > 
> > on the architectures that don't need it and make ACPI use it instead of this
> > "dynamic linking".
> 
> No, because acpi doesn't know at build time whether it can actually do
> S4 or not.

That's not a problem, I think.

1) We define platform_hibernation if CONFIG_ACPI is set.

2) In the ACPI code we do

if (can do S4)
	platform_hibernation = 1;

3) We have functions arch_platform_prepare()/finish()/enter() that are defined
to be noops for anything but ACPI systems and for ACPI systems they are
defined like this:

int arch_platform_enter(void)
{
	if (!platform_hibernation)
		return 0;

	...
}

I think it should work.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
  2007-04-27 10:19                                                           ` Johannes Berg
@ 2007-04-27 10:19                                                           ` Johannes Berg
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
  1 sibling, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 1248 bytes --]

On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote:

> 1) We define platform_hibernation if CONFIG_ACPI is set.

Let's just define it always then in the common code so we don't have
even more magic bits platforms need to define even if they don't care at
all. And please don't put #ifdef CONFIG_ACPI into the common code ;)
Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something.

> 2) In the ACPI code we do
> 
> if (can do S4)
> 	platform_hibernation = 1;

Gotcha.

> 3) We have functions arch_platform_prepare()/finish()/enter() that are defined
> to be noops for anything but ACPI systems and for ACPI systems they are
> defined like this:
> 
> int arch_platform_enter(void)
> {
> 	if (!platform_hibernation)
> 		return 0;
> 
> 	...
> }
> 
> I think it should work.

You could reduce code churn in all other platforms by making these weak
symbols like the irq hooks I did for pm_ops. It looks like it can work
and possibly is even less intrusive than my hibernate_ops patch. Though
then again my hibernate_ops patch removed a lot of stuff that is now no
longer necessary, and also completely removed the PM_SUSPEND_DISK foo...
we probably want that regardless of how we invoke ACPI.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 10:18                                                         ` Rafael J. Wysocki
@ 2007-04-27 10:19                                                           ` Johannes Berg
  2007-04-27 10:19                                                           ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 1248 bytes --]

On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote:

> 1) We define platform_hibernation if CONFIG_ACPI is set.

Let's just define it always then in the common code so we don't have
even more magic bits platforms need to define even if they don't care at
all. And please don't put #ifdef CONFIG_ACPI into the common code ;)
Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something.

> 2) In the ACPI code we do
> 
> if (can do S4)
> 	platform_hibernation = 1;

Gotcha.

> 3) We have functions arch_platform_prepare()/finish()/enter() that are defined
> to be noops for anything but ACPI systems and for ACPI systems they are
> defined like this:
> 
> int arch_platform_enter(void)
> {
> 	if (!platform_hibernation)
> 		return 0;
> 
> 	...
> }
> 
> I think it should work.

You could reduce code churn in all other platforms by making these weak
symbols like the irq hooks I did for pm_ops. It looks like it can work
and possibly is even less intrusive than my hibernate_ops patch. Though
then again my hibernate_ops patch removed a lot of stuff that is now no
longer necessary, and also completely removed the PM_SUSPEND_DISK foo...
we probably want that regardless of how we invoke ACPI.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* driver power operations (was Re: suspend2 merge)
  2007-04-25 20:31                                           ` Pavel Machek
  2007-04-27 10:21                                             ` driver power operations (was Re: suspend2 merge) Johannes Berg
@ 2007-04-27 10:21                                             ` Johannes Berg
  2007-04-27 12:06                                               ` Rafael J. Wysocki
                                                                 ` (6 more replies)
  1 sibling, 7 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote:

> > So, the "suspend" and "resume" for the functions being called for that are
> > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > be that difficult ...
> 
> It would be ugly big patch I'm afraid.

It'd be a lot of code churn, but well worth it. And most of the changes
would be trivial too. You need to start looking beyond "this is ugly in
the short term" and realise that it's much more maintainable in the long
term if driver writers know what they're supposed to do as opposed to
just hacking at it until it mostly works or just doing a full device
down/up cycle including resetting full driver state.

Look at it now:

 * FREEZE       Quiesce operations so that a consistent image can be saved;
 *              but do NOT otherwise enter a low power device state, and do
 *              NOT emit system wakeup events.
 *
 * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
 *              the system from a snapshot taken after an earlier FREEZE.
 *              Some drivers will need to reset their hardware state instead
 *              of preserving it, to ensure that it's never mistaken for the
 *              state which that earlier snapshot had set up.

Why is prethaw even necessary? As far as I can tell it's only necessary
because resume() can't tell you whether you just want to thaw or need to
reset since it doesn't tell you at what point it's invoked.

Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
better name?) that are called at the appropriate places (with
freeze/thaw around preparing the image and freeze/restart around
restoring would go a long way of clearing up the confusion in all the
drivers. Of course, it'd have to be documented that freeze/thaw isn't
the only valid combination but that freeze/restart is used too, but
that's not hard to do nor hard to understand.

And, incidentally, it could possibly make both suspend and hibernate
work much faster too. The comments there talk about "minimally power
management aware" drivers which always do the wrong thing for suspend,
in that they always reset everything... Of course, some drivers will
actually need to do that, but if freeze/suspend and thaw/restart/resume
have the same prototypes (probably just int <function>(void)) then
drivers can trivially assign the same there.
And hibernate would benefit since a lot of drivers could do a lot less
work for freeze/thaw.

Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
just for these callback pointers for each device we could just as well
have a single callback ->pm(what) and make "what" indicate which one of
these five things... But then drivers can't make that code depend on the
swsusp configuration which would be doable with five callbacks.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* driver power operations (was Re: suspend2 merge)
  2007-04-25 20:31                                           ` Pavel Machek
@ 2007-04-27 10:21                                             ` Johannes Berg
  2007-04-27 10:21                                             ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 10:21 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel,
	Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 2996 bytes --]

On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote:

> > So, the "suspend" and "resume" for the functions being called for that are
> > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > be that difficult ...
> 
> It would be ugly big patch I'm afraid.

It'd be a lot of code churn, but well worth it. And most of the changes
would be trivial too. You need to start looking beyond "this is ugly in
the short term" and realise that it's much more maintainable in the long
term if driver writers know what they're supposed to do as opposed to
just hacking at it until it mostly works or just doing a full device
down/up cycle including resetting full driver state.

Look at it now:

 * FREEZE       Quiesce operations so that a consistent image can be saved;
 *              but do NOT otherwise enter a low power device state, and do
 *              NOT emit system wakeup events.
 *
 * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
 *              the system from a snapshot taken after an earlier FREEZE.
 *              Some drivers will need to reset their hardware state instead
 *              of preserving it, to ensure that it's never mistaken for the
 *              state which that earlier snapshot had set up.

Why is prethaw even necessary? As far as I can tell it's only necessary
because resume() can't tell you whether you just want to thaw or need to
reset since it doesn't tell you at what point it's invoked.

Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
better name?) that are called at the appropriate places (with
freeze/thaw around preparing the image and freeze/restart around
restoring would go a long way of clearing up the confusion in all the
drivers. Of course, it'd have to be documented that freeze/thaw isn't
the only valid combination but that freeze/restart is used too, but
that's not hard to do nor hard to understand.

And, incidentally, it could possibly make both suspend and hibernate
work much faster too. The comments there talk about "minimally power
management aware" drivers which always do the wrong thing for suspend,
in that they always reset everything... Of course, some drivers will
actually need to do that, but if freeze/suspend and thaw/restart/resume
have the same prototypes (probably just int <function>(void)) then
drivers can trivially assign the same there.
And hibernate would benefit since a lot of drivers could do a lot less
work for freeze/thaw.

Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
just for these callback pointers for each device we could just as well
have a single callback ->pm(what) and make "what" indicate which one of
these five things... But then drivers can't make that code depend on the
swsusp configuration which would be doable with five callbacks.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
@ 2007-04-27 12:06                                               ` Rafael J. Wysocki
  2007-04-27 12:40                                                 ` Pavel Machek
  2007-04-27 12:40                                                 ` Pavel Machek
  2007-04-27 12:06                                               ` Rafael J. Wysocki
                                                                 ` (5 subsequent siblings)
  6 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 12:06 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

On Friday, 27 April 2007 12:21, Johannes Berg wrote:
> On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote:
> 
> > > So, the "suspend" and "resume" for the functions being called for that are
> > > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > > be that difficult ...
> > 
> > It would be ugly big patch I'm afraid.
> 
> It'd be a lot of code churn, but well worth it. And most of the changes
> would be trivial too. You need to start looking beyond "this is ugly in
> the short term" and realise that it's much more maintainable in the long
> term if driver writers know what they're supposed to do as opposed to
> just hacking at it until it mostly works or just doing a full device
> down/up cycle including resetting full driver state.
> 
> Look at it now:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary? As far as I can tell it's only necessary
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.
> 
> Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> better name?) that are called at the appropriate places (with
> freeze/thaw around preparing the image and freeze/restart around
> restoring would go a long way of clearing up the confusion in all the
> drivers. Of course, it'd have to be documented that freeze/thaw isn't
> the only valid combination but that freeze/restart is used too, but
> that's not hard to do nor hard to understand.
> 
> And, incidentally, it could possibly make both suspend and hibernate
> work much faster too. The comments there talk about "minimally power
> management aware" drivers which always do the wrong thing for suspend,
> in that they always reset everything... Of course, some drivers will
> actually need to do that, but if freeze/suspend and thaw/restart/resume
> have the same prototypes (probably just int <function>(void)) then
> drivers can trivially assign the same there.
> And hibernate would benefit since a lot of drivers could do a lot less
> work for freeze/thaw.

I violently agree with all of the above.

Moreover, for the hibernation we have two special cases that are of no interest
for the suspend:
1) drivers compiled as modules and not loaded before we restore the image
2) drivers that need to allocate much memory in .freeze()

> Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
> just for these callback pointers for each device we could just as well
> have a single callback ->pm(what) and make "what" indicate which one of
> these five things... But then drivers can't make that code depend on the
> swsusp configuration which would be doable with five callbacks.

Five callbacks are fine by me, especially if we can define reasonable defaults
for the hibernation (and can we?).

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
  2007-04-27 12:06                                               ` Rafael J. Wysocki
@ 2007-04-27 12:06                                               ` Rafael J. Wysocki
  2007-04-27 14:34                                                 ` Alan Stern
                                                                 ` (4 subsequent siblings)
  6 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 12:06 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel,
	Con Kolivas, Adrian Bunk, Andrew Morton, Pavel Machek, linux-pm,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven

On Friday, 27 April 2007 12:21, Johannes Berg wrote:
> On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote:
> 
> > > So, the "suspend" and "resume" for the functions being called for that are
> > > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > > be that difficult ...
> > 
> > It would be ugly big patch I'm afraid.
> 
> It'd be a lot of code churn, but well worth it. And most of the changes
> would be trivial too. You need to start looking beyond "this is ugly in
> the short term" and realise that it's much more maintainable in the long
> term if driver writers know what they're supposed to do as opposed to
> just hacking at it until it mostly works or just doing a full device
> down/up cycle including resetting full driver state.
> 
> Look at it now:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary? As far as I can tell it's only necessary
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.
> 
> Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> better name?) that are called at the appropriate places (with
> freeze/thaw around preparing the image and freeze/restart around
> restoring would go a long way of clearing up the confusion in all the
> drivers. Of course, it'd have to be documented that freeze/thaw isn't
> the only valid combination but that freeze/restart is used too, but
> that's not hard to do nor hard to understand.
> 
> And, incidentally, it could possibly make both suspend and hibernate
> work much faster too. The comments there talk about "minimally power
> management aware" drivers which always do the wrong thing for suspend,
> in that they always reset everything... Of course, some drivers will
> actually need to do that, but if freeze/suspend and thaw/restart/resume
> have the same prototypes (probably just int <function>(void)) then
> drivers can trivially assign the same there.
> And hibernate would benefit since a lot of drivers could do a lot less
> work for freeze/thaw.

I violently agree with all of the above.

Moreover, for the hibernation we have two special cases that are of no interest
for the suspend:
1) drivers compiled as modules and not loaded before we restore the image
2) drivers that need to allocate much memory in .freeze()

> Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
> just for these callback pointers for each device we could just as well
> have a single callback ->pm(what) and make "what" indicate which one of
> these five things... But then drivers can't make that code depend on the
> swsusp configuration which would be doable with five callbacks.

Five callbacks are fine by me, especially if we can define reasonable defaults
for the hibernation (and can we?).

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
@ 2007-04-27 12:07                                                               ` Johannes Berg
  2007-04-27 12:07                                                               ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 12:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

On Fri, 2007-04-27 at 14:09 +0200, Rafael J. Wysocki wrote:

> Yes.  Still, I'd like to rework your patch to deal with ACPI without
> introducing hibernate_ops .  I'm going to do this later today if you don't
> mind. :-)

Not at all :) That's why I actually sent it out instead of just saying
"well I give up it breaks user.c"

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
  2007-04-27 12:07                                                               ` Johannes Berg
@ 2007-04-27 12:07                                                               ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 12:07 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 347 bytes --]

On Fri, 2007-04-27 at 14:09 +0200, Rafael J. Wysocki wrote:

> Yes.  Still, I'd like to rework your patch to deal with ACPI without
> introducing hibernate_ops .  I'm going to do this later today if you don't
> mind. :-)

Not at all :) That's why I actually sent it out instead of just saying
"well I give up it breaks user.c"

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 10:19                                                           ` Johannes Berg
@ 2007-04-27 12:09                                                             ` Rafael J. Wysocki
  2007-04-27 12:07                                                               ` Johannes Berg
  2007-04-27 12:07                                                               ` Johannes Berg
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
  1 sibling, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 12:09 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin,
	suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven,
	linux-pm

On Friday, 27 April 2007 12:19, Johannes Berg wrote:
> On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote:
> 
> > 1) We define platform_hibernation if CONFIG_ACPI is set.
> 
> Let's just define it always then in the common code so we don't have
> even more magic bits platforms need to define even if they don't care at
> all. And please don't put #ifdef CONFIG_ACPI into the common code ;)
> Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something.
> 
> > 2) In the ACPI code we do
> > 
> > if (can do S4)
> > 	platform_hibernation = 1;
> 
> Gotcha.
> 
> > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined
> > to be noops for anything but ACPI systems and for ACPI systems they are
> > defined like this:
> > 
> > int arch_platform_enter(void)
> > {
> > 	if (!platform_hibernation)
> > 		return 0;
> > 
> > 	...
> > }
> > 
> > I think it should work.
> 
> You could reduce code churn in all other platforms by making these weak
> symbols like the irq hooks I did for pm_ops. It looks like it can work
> and possibly is even less intrusive than my hibernate_ops patch. Though
> then again my hibernate_ops patch removed a lot of stuff that is now no
> longer necessary, and also completely removed the PM_SUSPEND_DISK foo...
> we probably want that regardless of how we invoke ACPI.

Yes.  Still, I'd like to rework your patch to deal with ACPI without
introducing hibernate_ops .  I'm going to do this later today if you don't
mind. :-)

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-27 10:19                                                           ` Johannes Berg
  2007-04-27 12:09                                                             ` Rafael J. Wysocki
@ 2007-04-27 12:09                                                             ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 12:09 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek,
	Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel,
	linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Arjan van de Ven

On Friday, 27 April 2007 12:19, Johannes Berg wrote:
> On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote:
> 
> > 1) We define platform_hibernation if CONFIG_ACPI is set.
> 
> Let's just define it always then in the common code so we don't have
> even more magic bits platforms need to define even if they don't care at
> all. And please don't put #ifdef CONFIG_ACPI into the common code ;)
> Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something.
> 
> > 2) In the ACPI code we do
> > 
> > if (can do S4)
> > 	platform_hibernation = 1;
> 
> Gotcha.
> 
> > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined
> > to be noops for anything but ACPI systems and for ACPI systems they are
> > defined like this:
> > 
> > int arch_platform_enter(void)
> > {
> > 	if (!platform_hibernation)
> > 		return 0;
> > 
> > 	...
> > }
> > 
> > I think it should work.
> 
> You could reduce code churn in all other platforms by making these weak
> symbols like the irq hooks I did for pm_ops. It looks like it can work
> and possibly is even less intrusive than my hibernate_ops patch. Though
> then again my hibernate_ops patch removed a lot of stuff that is now no
> longer necessary, and also completely removed the PM_SUSPEND_DISK foo...
> we probably want that regardless of how we invoke ACPI.

Yes.  Still, I'd like to rework your patch to deal with ACPI without
introducing hibernate_ops .  I'm going to do this later today if you don't
mind. :-)

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge
  2007-04-25 19:38                                     ` Linus Torvalds
                                                         ` (2 preceding siblings ...)
  2007-04-25 20:23                                       ` Adrian Bunk
@ 2007-04-27 12:36                                       ` Martin Steigerwald
  3 siblings, 0 replies; 712+ messages in thread
From: Martin Steigerwald @ 2007-04-27 12:36 UTC (permalink / raw)
  To: suspend2-devel
  Cc: Linus Torvalds, Adrian Bunk, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek,
	Ingo Molnar, Andrew Morton, Arjan van de Ven

Am Mittwoch 25 April 2007 schrieb Linus Torvalds:

> And that's a *fundamental* problem. If the STD people cannot even
> realize that they have less to do with "suspend" than to "reboot", how
> do you ever expect them to get anything to work, and not affect other
> things negatively?
>
> Yeah, I'm down on it. I'm down on it because every person involved with
> the whole STD thing seems to have basically zero taste, and a total
> inability to work with anybody else.

Hello Linus!

I am no kernel developer. But I understand what you are trying to tell 
here.

I agree that suspend to ram and snapshot should be handled differently by 
drivers. And unlike schedulers - whether it be I/O or process related 
ones - I think it should be quite easy to settle and decide on *one* 
implementation for each feature. It least it doesn't look as difficult as 
deciding on a scheduler which works for all the different workloads to 
me.

I do not believe that the reasons preventing this to happen until now are 
of pure technical nature.

I think snapshotting is a very important feature. I would patch it into my 
kernels if it was removed. But then I am using suspend2 anyway.

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 12:06                                               ` Rafael J. Wysocki
  2007-04-27 12:40                                                 ` Pavel Machek
@ 2007-04-27 12:40                                                 ` Pavel Machek
  2007-04-27 12:46                                                   ` Johannes Berg
  2007-04-27 12:46                                                   ` Johannes Berg
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-27 12:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

Hi!

> > And, incidentally, it could possibly make both suspend and hibernate
> > work much faster too. The comments there talk about "minimally power
> > management aware" drivers which always do the wrong thing for suspend,
> > in that they always reset everything... Of course, some drivers will
> > actually need to do that, but if freeze/suspend and thaw/restart/resume
> > have the same prototypes (probably just int <function>(void)) then
> > drivers can trivially assign the same there.
> > And hibernate would benefit since a lot of drivers could do a lot less
> > work for freeze/thaw.
> 
> I violently agree with all of the above.
> 
> Moreover, for the hibernation we have two special cases that are of no interest
> for the suspend:
> 1) drivers compiled as modules and not loaded before we restore the image
> 2) drivers that need to allocate much memory in .freeze()
> 
> > Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
> > just for these callback pointers for each device we could just as well
> > have a single callback ->pm(what) and make "what" indicate which one of
> > these five things... But then drivers can't make that code depend on the
> > swsusp configuration which would be doable with five callbacks.
> 
> Five callbacks are fine by me, especially if we can define reasonable defaults
> for the hibernation (and can we?).

Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
resume() for thaw(). Anything else is just not sane way forward.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 12:06                                               ` Rafael J. Wysocki
@ 2007-04-27 12:40                                                 ` Pavel Machek
  2007-04-27 12:40                                                 ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-27 12:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Mike Galbraith,
	linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm,
	Johannes Berg, Linus Torvalds, Thomas Gleixner, Arjan van de Ven

Hi!

> > And, incidentally, it could possibly make both suspend and hibernate
> > work much faster too. The comments there talk about "minimally power
> > management aware" drivers which always do the wrong thing for suspend,
> > in that they always reset everything... Of course, some drivers will
> > actually need to do that, but if freeze/suspend and thaw/restart/resume
> > have the same prototypes (probably just int <function>(void)) then
> > drivers can trivially assign the same there.
> > And hibernate would benefit since a lot of drivers could do a lot less
> > work for freeze/thaw.
> 
> I violently agree with all of the above.
> 
> Moreover, for the hibernation we have two special cases that are of no interest
> for the suspend:
> 1) drivers compiled as modules and not loaded before we restore the image
> 2) drivers that need to allocate much memory in .freeze()
> 
> > Or, if we don't want to have five calls and use 40 bytes (on 64-bit)
> > just for these callback pointers for each device we could just as well
> > have a single callback ->pm(what) and make "what" indicate which one of
> > these five things... But then drivers can't make that code depend on the
> > swsusp configuration which would be doable with five callbacks.
> 
> Five callbacks are fine by me, especially if we can define reasonable defaults
> for the hibernation (and can we?).

Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
resume() for thaw(). Anything else is just not sane way forward.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 12:40                                                 ` Pavel Machek
@ 2007-04-27 12:46                                                   ` Johannes Berg
  2007-04-27 12:50                                                       ` Pavel Machek
  2007-04-27 12:46                                                   ` Johannes Berg
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 12:46 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

[-- Attachment #1: Type: text/plain, Size: 754 bytes --]

On Fri, 2007-04-27 at 14:40 +0200, Pavel Machek wrote:

> > Five callbacks are fine by me, especially if we can define reasonable defaults
> > for the hibernation (and can we?).
> 
> Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
> resume() for thaw(). Anything else is just not sane way forward.

I think we should remove the argument to suspend() in the same patch
series. Yes, that would mean porting all drivers that currently use it,
but that's not actually all that many since most drivers are dumbed-down
wrt. power management.

And realistically, resume for thaw makes no sense, nor does suspend for
freeze, so we probably want to change those over to suspend/restart and
use them. or something.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 12:40                                                 ` Pavel Machek
  2007-04-27 12:46                                                   ` Johannes Berg
@ 2007-04-27 12:46                                                   ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 12:46 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel,
	Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 754 bytes --]

On Fri, 2007-04-27 at 14:40 +0200, Pavel Machek wrote:

> > Five callbacks are fine by me, especially if we can define reasonable defaults
> > for the hibernation (and can we?).
> 
> Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
> resume() for thaw(). Anything else is just not sane way forward.

I think we should remove the argument to suspend() in the same patch
series. Yes, that would mean porting all drivers that currently use it,
but that's not actually all that many since most drivers are dumbed-down
wrt. power management.

And realistically, resume for thaw makes no sense, nor does suspend for
freeze, so we probably want to change those over to suspend/restart and
use them. or something.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 12:46                                                   ` Johannes Berg
@ 2007-04-27 12:50                                                       ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-27 12:50 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

Hi!

> > > Five callbacks are fine by me, especially if we can define reasonable defaults
> > > for the hibernation (and can we?).
> > 
> > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
> > resume() for thaw(). Anything else is just not sane way forward.
> 
> I think we should remove the argument to suspend() in the same patch
> series. Yes, that would mean porting all drivers that currently use
> it,

Well, if you can do it in one patch series, go ahead. But I think such
massive change all over kernel will take slightly longer, so I'd
prefer to keep the dummy argument for a while (so it still compiles)
and fix it slowly.

Of course, it is up to the person doing the series. And yes, such
series would be welcome.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
@ 2007-04-27 12:50                                                       ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-04-27 12:50 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel,
	Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven

Hi!

> > > Five callbacks are fine by me, especially if we can define reasonable defaults
> > > for the hibernation (and can we?).
> > 
> > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and
> > resume() for thaw(). Anything else is just not sane way forward.
> 
> I think we should remove the argument to suspend() in the same patch
> series. Yes, that would mean porting all drivers that currently use
> it,

Well, if you can do it in one patch series, go ahead. But I think such
massive change all over kernel will take slightly longer, so I'd
prefer to keep the dummy argument for a while (so it still compiles)
and fix it slowly.

Of course, it is up to the person doing the series. And yes, such
series would be welcome.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
@ 2007-04-27 14:34                                                 ` Alan Stern
  2007-04-27 12:06                                               ` Rafael J. Wysocki
                                                                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-04-27 14:34 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk,
	suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds,
	Ingo Molnar, Arjan van de Ven

On Fri, 27 Apr 2007, Johannes Berg wrote:

> Look at it now:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary? As far as I can tell it's only necessary
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.

I think you're wrong here.  It's a little hard to say because the 
terminology is confusing and not yet standardized.

For the sake of argument, let's call the stages of STD and STR by these 
names (also noted are the current PSMG values):

	Suspend to disk:
	"prepare to create snapshot" (= FREEZE)
	"continue after snapshot" (= RESUME)

	Resume from disk:
	"prepare to restore snapshot" (= PRETHAW)
	"continue after restore" (= RESUME)

	Suspend to RAM:
	"suspend" (= SUSPEND)
	"resume" (= RESUME)

The real reason for adding PRETHAW was that drivers couldn't distinguish
between "continue after restore" and "resume", other than by examining the
device's state -- since the PM core doesn't pass any information to the
resume() method.

I suppose we could have modified the "prepare to create snapshot" code 
instead, but doing so would mean that "continue after snapshot" and 
"continue after restore" would always do the same thing, which is not 
necessarily a good idea.

Anyway, based on this analysis it seems reasonable to have Six (6) method 
pointers.  Suggested names (in the same order as above):

	pre_snaphot()
	post_snapshot()
	pre_restore()
	post_restore()
	suspend()
	resume()

People apparently assume that pre_snapshot() and pre_restore() would 
always do the same thing and hence be redundant.  I'm not so sure; time 
will tell.  Doing it this way certainly is more clear.

Then there's the question of having early_ and late_ versions of some of 
these things (i.e., one called with interrupts enabled, the other with 
interrupts disabled).  I don't know to what extent that would be 
necessary; perhaps the each method call should occur in two phases with 
the interrupt-enable status changed in between.  Then the interrupt-enable 
setting could be passed as an argument.

Alan Stern


^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
@ 2007-04-27 14:34                                                 ` Alan Stern
  0 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-04-27 14:34 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk,
	Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds,
	linux-pm, Arjan van de Ven

On Fri, 27 Apr 2007, Johannes Berg wrote:

> Look at it now:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary? As far as I can tell it's only necessary
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.

I think you're wrong here.  It's a little hard to say because the 
terminology is confusing and not yet standardized.

For the sake of argument, let's call the stages of STD and STR by these 
names (also noted are the current PSMG values):

	Suspend to disk:
	"prepare to create snapshot" (= FREEZE)
	"continue after snapshot" (= RESUME)

	Resume from disk:
	"prepare to restore snapshot" (= PRETHAW)
	"continue after restore" (= RESUME)

	Suspend to RAM:
	"suspend" (= SUSPEND)
	"resume" (= RESUME)

The real reason for adding PRETHAW was that drivers couldn't distinguish
between "continue after restore" and "resume", other than by examining the
device's state -- since the PM core doesn't pass any information to the
resume() method.

I suppose we could have modified the "prepare to create snapshot" code 
instead, but doing so would mean that "continue after snapshot" and 
"continue after restore" would always do the same thing, which is not 
necessarily a good idea.

Anyway, based on this analysis it seems reasonable to have Six (6) method 
pointers.  Suggested names (in the same order as above):

	pre_snaphot()
	post_snapshot()
	pre_restore()
	post_restore()
	suspend()
	resume()

People apparently assume that pre_snapshot() and pre_restore() would 
always do the same thing and hence be redundant.  I'm not so sure; time 
will tell.  Doing it this way certainly is more clear.

Then there's the question of having early_ and late_ versions of some of 
these things (i.e., one called with interrupts enabled, the other with 
interrupts disabled).  I don't know to what extent that would be 
necessary; perhaps the each method call should occur in two phases with 
the interrupt-enable status changed in between.  Then the interrupt-enable 
setting could be passed as an argument.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 14:34                                                 ` Alan Stern
  (?)
@ 2007-04-27 14:39                                                 ` Johannes Berg
  2007-04-27 14:49                                                     ` Johannes Berg
  -1 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 14:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk,
	suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds,
	Ingo Molnar, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2168 bytes --]

On Fri, 2007-04-27 at 10:34 -0400, Alan Stern wrote:

> For the sake of argument, let's call the stages of STD and STR by these 
> names (also noted are the current PSMG values):
> 
> 	Suspend to disk:
> 	"prepare to create snapshot" (= FREEZE)
> 	"continue after snapshot" (= RESUME)
> 
> 	Resume from disk:
> 	"prepare to restore snapshot" (= PRETHAW)
> 	"continue after restore" (= RESUME)
> 
> 	Suspend to RAM:
> 	"suspend" (= SUSPEND)
> 	"resume" (= RESUME)
> 
> The real reason for adding PRETHAW was that drivers couldn't distinguish
> between "continue after restore" and "resume", other than by examining the
> device's state -- since the PM core doesn't pass any information to the
> resume() method.

That's pretty much what I said about prethaw though, no? Anyway,

> Anyway, based on this analysis it seems reasonable to have Six (6) method 
> pointers.  Suggested names (in the same order as above):
> 
> 	pre_snaphot()
> 	post_snapshot()
> 	pre_restore()
> 	post_restore()
> 	suspend()
> 	resume()
> 
> People apparently assume that pre_snapshot() and pre_restore() would 
> always do the same thing and hence be redundant.  I'm not so sure; time 
> will tell.  Doing it this way certainly is more clear.

Right. I did assume that pre_snapshot and pre_restore would be
effectively the same since they both have to quiesce the device and
assume not much more. I'm not averse to making it explicit, many drivers
that don't care can just assign the same function.

> Then there's the question of having early_ and late_ versions of some of 
> these things (i.e., one called with interrupts enabled, the other with 
> interrupts disabled).  I don't know to what extent that would be 
> necessary; perhaps the each method call should occur in two phases with 
> the interrupt-enable status changed in between.  Then the interrupt-enable 
> setting could be passed as an argument.

Good point. Though if we go for passing the interrupt-enable setting as
an argument then many drivers will have the same
"if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
strictly necessary.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 14:34                                                 ` Alan Stern
  (?)
  (?)
@ 2007-04-27 14:39                                                 ` Johannes Berg
  -1 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 14:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk,
	Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds,
	linux-pm, Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 2168 bytes --]

On Fri, 2007-04-27 at 10:34 -0400, Alan Stern wrote:

> For the sake of argument, let's call the stages of STD and STR by these 
> names (also noted are the current PSMG values):
> 
> 	Suspend to disk:
> 	"prepare to create snapshot" (= FREEZE)
> 	"continue after snapshot" (= RESUME)
> 
> 	Resume from disk:
> 	"prepare to restore snapshot" (= PRETHAW)
> 	"continue after restore" (= RESUME)
> 
> 	Suspend to RAM:
> 	"suspend" (= SUSPEND)
> 	"resume" (= RESUME)
> 
> The real reason for adding PRETHAW was that drivers couldn't distinguish
> between "continue after restore" and "resume", other than by examining the
> device's state -- since the PM core doesn't pass any information to the
> resume() method.

That's pretty much what I said about prethaw though, no? Anyway,

> Anyway, based on this analysis it seems reasonable to have Six (6) method 
> pointers.  Suggested names (in the same order as above):
> 
> 	pre_snaphot()
> 	post_snapshot()
> 	pre_restore()
> 	post_restore()
> 	suspend()
> 	resume()
> 
> People apparently assume that pre_snapshot() and pre_restore() would 
> always do the same thing and hence be redundant.  I'm not so sure; time 
> will tell.  Doing it this way certainly is more clear.

Right. I did assume that pre_snapshot and pre_restore would be
effectively the same since they both have to quiesce the device and
assume not much more. I'm not averse to making it explicit, many drivers
that don't care can just assign the same function.

> Then there's the question of having early_ and late_ versions of some of 
> these things (i.e., one called with interrupts enabled, the other with 
> interrupts disabled).  I don't know to what extent that would be 
> necessary; perhaps the each method call should occur in two phases with 
> the interrupt-enable status changed in between.  Then the interrupt-enable 
> setting could be passed as an argument.

Good point. Though if we go for passing the interrupt-enable setting as
an argument then many drivers will have the same
"if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
strictly necessary.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 14:39                                                 ` [linux-pm] " Johannes Berg
@ 2007-04-27 14:49                                                     ` Johannes Berg
  0 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 14:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk,
	Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds,
	linux-pm, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote:

> Good point. Though if we go for passing the interrupt-enable setting as
> an argument then many drivers will have the same
> "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
> strictly necessary.

Eh, the point I actually wanted to make is that many drivers don't care
for the irqs disabled case and would have to add code to exclude it.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
@ 2007-04-27 14:49                                                     ` Johannes Berg
  0 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 14:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nick Piggin, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 439 bytes --]

On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote:

> Good point. Though if we go for passing the interrupt-enable setting as
> an argument then many drivers will have the same
> "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
> strictly necessary.

Eh, the point I actually wanted to make is that many drivers don't care
for the irqs disabled case and would have to add code to exclude it.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 14:34                                                 ` Alan Stern
                                                                   ` (2 preceding siblings ...)
  (?)
@ 2007-04-27 15:12                                                 ` Rafael J. Wysocki
  2007-04-27 15:24                                                   ` Johannes Berg
  2007-04-27 15:24                                                   ` Johannes Berg
  -1 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 15:12 UTC (permalink / raw)
  To: linux-pm
  Cc: Alan Stern, Johannes Berg, Nick Piggin, Ingo Molnar,
	suspend2-devel, Mike Galbraith, Kernel development list,
	Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek,
	Andrew Morton, Linus Torvalds, Arjan van de Ven

On Friday, 27 April 2007 16:34, Alan Stern wrote:
> On Fri, 27 Apr 2007, Johannes Berg wrote:
> 
> > Look at it now:
> > 
> >  * FREEZE       Quiesce operations so that a consistent image can be saved;
> >  *              but do NOT otherwise enter a low power device state, and do
> >  *              NOT emit system wakeup events.
> >  *
> >  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
> >  *              the system from a snapshot taken after an earlier FREEZE.
> >  *              Some drivers will need to reset their hardware state instead
> >  *              of preserving it, to ensure that it's never mistaken for the
> >  *              state which that earlier snapshot had set up.
> > 
> > Why is prethaw even necessary? As far as I can tell it's only necessary
> > because resume() can't tell you whether you just want to thaw or need to
> > reset since it doesn't tell you at what point it's invoked.
> 
> I think you're wrong here.  It's a little hard to say because the 
> terminology is confusing and not yet standardized.
> 
> For the sake of argument, let's call the stages of STD and STR by these 
> names (also noted are the current PSMG values):
> 
> 	Suspend to disk:
> 	"prepare to create snapshot" (= FREEZE)
> 	"continue after snapshot" (= RESUME)
> 
> 	Resume from disk:
> 	"prepare to restore snapshot" (= PRETHAW)
> 	"continue after restore" (= RESUME)
> 
> 	Suspend to RAM:
> 	"suspend" (= SUSPEND)
> 	"resume" (= RESUME)
> 
> The real reason for adding PRETHAW was that drivers couldn't distinguish
> between "continue after restore" and "resume", other than by examining the
> device's state -- since the PM core doesn't pass any information to the
> resume() method.
> 
> I suppose we could have modified the "prepare to create snapshot" code 
> instead, but doing so would mean that "continue after snapshot" and 
> "continue after restore" would always do the same thing, which is not 
> necessarily a good idea.
> 
> Anyway, based on this analysis it seems reasonable to have Six (6) method 
> pointers.  Suggested names (in the same order as above):
> 
> 	pre_snaphot()
> 	post_snapshot()
> 	pre_restore()
> 	post_restore()
> 	suspend()
> 	resume()
> 
> People apparently assume that pre_snapshot() and pre_restore() would 
> always do the same thing and hence be redundant.  I'm not so sure; time 
> will tell.  Doing it this way certainly is more clear.

How do we differentiate between post_snapshot() and post_restore()?
I mean, after the restore we're entering the same code path as after the
snapshot, so do we use a global var for this purpose?

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 14:34                                                 ` Alan Stern
                                                                   ` (3 preceding siblings ...)
  (?)
@ 2007-04-27 15:12                                                 ` Rafael J. Wysocki
  -1 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 15:12 UTC (permalink / raw)
  To: linux-pm
  Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, Johannes Berg, Linus Torvalds, Ingo Molnar,
	Arjan van de Ven

On Friday, 27 April 2007 16:34, Alan Stern wrote:
> On Fri, 27 Apr 2007, Johannes Berg wrote:
> 
> > Look at it now:
> > 
> >  * FREEZE       Quiesce operations so that a consistent image can be saved;
> >  *              but do NOT otherwise enter a low power device state, and do
> >  *              NOT emit system wakeup events.
> >  *
> >  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
> >  *              the system from a snapshot taken after an earlier FREEZE.
> >  *              Some drivers will need to reset their hardware state instead
> >  *              of preserving it, to ensure that it's never mistaken for the
> >  *              state which that earlier snapshot had set up.
> > 
> > Why is prethaw even necessary? As far as I can tell it's only necessary
> > because resume() can't tell you whether you just want to thaw or need to
> > reset since it doesn't tell you at what point it's invoked.
> 
> I think you're wrong here.  It's a little hard to say because the 
> terminology is confusing and not yet standardized.
> 
> For the sake of argument, let's call the stages of STD and STR by these 
> names (also noted are the current PSMG values):
> 
> 	Suspend to disk:
> 	"prepare to create snapshot" (= FREEZE)
> 	"continue after snapshot" (= RESUME)
> 
> 	Resume from disk:
> 	"prepare to restore snapshot" (= PRETHAW)
> 	"continue after restore" (= RESUME)
> 
> 	Suspend to RAM:
> 	"suspend" (= SUSPEND)
> 	"resume" (= RESUME)
> 
> The real reason for adding PRETHAW was that drivers couldn't distinguish
> between "continue after restore" and "resume", other than by examining the
> device's state -- since the PM core doesn't pass any information to the
> resume() method.
> 
> I suppose we could have modified the "prepare to create snapshot" code 
> instead, but doing so would mean that "continue after snapshot" and 
> "continue after restore" would always do the same thing, which is not 
> necessarily a good idea.
> 
> Anyway, based on this analysis it seems reasonable to have Six (6) method 
> pointers.  Suggested names (in the same order as above):
> 
> 	pre_snaphot()
> 	post_snapshot()
> 	pre_restore()
> 	post_restore()
> 	suspend()
> 	resume()
> 
> People apparently assume that pre_snapshot() and pre_restore() would 
> always do the same thing and hence be redundant.  I'm not so sure; time 
> will tell.  Doing it this way certainly is more clear.

How do we differentiate between post_snapshot() and post_restore()?
I mean, after the restore we're entering the same code path as after the
snapshot, so do we use a global var for this purpose?

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 14:49                                                     ` Johannes Berg
  (?)
@ 2007-04-27 15:20                                                     ` Rafael J. Wysocki
  2007-04-27 15:27                                                       ` Johannes Berg
                                                                         ` (2 more replies)
  -1 siblings, 3 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 15:20 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel,
	Mike Galbraith, Kernel development list, Con Kolivas,
	Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton,
	Linus Torvalds, linux-pm, Arjan van de Ven

On Friday, 27 April 2007 16:49, Johannes Berg wrote:
> On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote:
> 
> > Good point. Though if we go for passing the interrupt-enable setting as
> > an argument then many drivers will have the same
> > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
> > strictly necessary.
> 
> Eh, the point I actually wanted to make is that many drivers don't care
> for the irqs disabled case and would have to add code to exclude it.

I think we can use 'stages' and pass them as arguments to the functions.

In that case we can have two callbacks for the hibernation (I'd prefer to say
'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback
and one 'activate' callback that can be called many times in one
snapshot/restore cycle with different arguments, for example:

quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
before quiescing devices (if any)
...
quiesce(PRE_SNAPSHOT)
...
quiesce(PRE_SNAPSHOT_IRQ_OFF)
...
activate(POST_SNAPSHOT_IRQ_OFF)
...
activate(POST_SNAPSHOT)
...
activate(FINISH)

etc.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 14:49                                                     ` Johannes Berg
  (?)
  (?)
@ 2007-04-27 15:20                                                     ` Rafael J. Wysocki
  -1 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 15:20 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

On Friday, 27 April 2007 16:49, Johannes Berg wrote:
> On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote:
> 
> > Good point. Though if we go for passing the interrupt-enable setting as
> > an argument then many drivers will have the same
> > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even
> > strictly necessary.
> 
> Eh, the point I actually wanted to make is that many drivers don't care
> for the irqs disabled case and would have to add code to exclude it.

I think we can use 'stages' and pass them as arguments to the functions.

In that case we can have two callbacks for the hibernation (I'd prefer to say
'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback
and one 'activate' callback that can be called many times in one
snapshot/restore cycle with different arguments, for example:

quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
before quiescing devices (if any)
...
quiesce(PRE_SNAPSHOT)
...
quiesce(PRE_SNAPSHOT_IRQ_OFF)
...
activate(POST_SNAPSHOT_IRQ_OFF)
...
activate(POST_SNAPSHOT)
...
activate(FINISH)

etc.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 15:12                                                 ` [linux-pm] " Rafael J. Wysocki
@ 2007-04-27 15:24                                                   ` Johannes Berg
  2007-04-27 15:24                                                   ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 15:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel,
	Mike Galbraith, Kernel development list, Con Kolivas,
	Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton,
	Linus Torvalds, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On Fri, 2007-04-27 at 17:12 +0200, Rafael J. Wysocki wrote:

> How do we differentiate between post_snapshot() and post_restore()?
> I mean, after the restore we're entering the same code path as after the
> snapshot, so do we use a global var for this purpose?

That's pretty easy to do though, we already know at which point we are
so we just put an if(...) invoke_post_snapshot() else
invoke_post_restore().

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 15:12                                                 ` [linux-pm] " Rafael J. Wysocki
  2007-04-27 15:24                                                   ` Johannes Berg
@ 2007-04-27 15:24                                                   ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 15:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Linus Torvalds, Ingo Molnar,
	Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 432 bytes --]

On Fri, 2007-04-27 at 17:12 +0200, Rafael J. Wysocki wrote:

> How do we differentiate between post_snapshot() and post_restore()?
> I mean, after the restore we're entering the same code path as after the
> snapshot, so do we use a global var for this purpose?

That's pretty easy to do though, we already know at which point we are
so we just put an if(...) invoke_post_snapshot() else
invoke_post_restore().

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 15:20                                                     ` [linux-pm] " Rafael J. Wysocki
@ 2007-04-27 15:27                                                       ` Johannes Berg
  2007-04-27 15:27                                                       ` Johannes Berg
  2007-04-27 15:52                                                         ` Linus Torvalds
  2 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 15:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel,
	Mike Galbraith, Kernel development list, Con Kolivas,
	Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton,
	Linus Torvalds, linux-pm, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Fri, 2007-04-27 at 17:20 +0200, Rafael J. Wysocki wrote:

> I think we can use 'stages' and pass them as arguments to the functions.
> 
> In that case we can have two callbacks for the hibernation (I'd prefer to say
> 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback
> and one 'activate' callback that can be called many times in one
> snapshot/restore cycle with different arguments, for example:

But you're not proposing to add suspend/resume to this interface too, I
hope :)

> quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> before quiescing devices (if any)
> ...
> quiesce(PRE_SNAPSHOT)
> ...
> quiesce(PRE_SNAPSHOT_IRQ_OFF)
> ...
> activate(POST_SNAPSHOT_IRQ_OFF)
> ...
> activate(POST_SNAPSHOT)
> ...
> activate(FINISH)

I'm still not sure I like having to switch on the argument for every
implementation. Is it really worth it?

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 15:20                                                     ` [linux-pm] " Rafael J. Wysocki
  2007-04-27 15:27                                                       ` Johannes Berg
@ 2007-04-27 15:27                                                       ` Johannes Berg
  2007-04-27 15:52                                                         ` Linus Torvalds
  2 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-04-27 15:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


[-- Attachment #1.1: Type: text/plain, Size: 942 bytes --]

On Fri, 2007-04-27 at 17:20 +0200, Rafael J. Wysocki wrote:

> I think we can use 'stages' and pass them as arguments to the functions.
> 
> In that case we can have two callbacks for the hibernation (I'd prefer to say
> 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback
> and one 'activate' callback that can be called many times in one
> snapshot/restore cycle with different arguments, for example:

But you're not proposing to add suspend/resume to this interface too, I
hope :)

> quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> before quiescing devices (if any)
> ...
> quiesce(PRE_SNAPSHOT)
> ...
> quiesce(PRE_SNAPSHOT_IRQ_OFF)
> ...
> activate(POST_SNAPSHOT_IRQ_OFF)
> ...
> activate(POST_SNAPSHOT)
> ...
> activate(FINISH)

I'm still not sure I like having to switch on the argument for every
implementation. Is it really worth it?

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 14:49                                                     ` Johannes Berg
@ 2007-04-27 15:41                                                       ` Linus Torvalds
  -1 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:41 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel,
	Mike Galbraith, Kernel development list, Con Kolivas,
	Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton,
	linux-pm, Arjan van de Ven



On Fri, 27 Apr 2007, Johannes Berg wrote:
> 
> Eh, the point I actually wanted to make is that many drivers don't care
> for the irqs disabled case and would have to add code to exclude it.

You really *really* want to do a two-phase thing, at least for the case I 
care about. Whether that snapshotting thing does or not, I could care 
less.

There's a damn good reason why the kernel uses 

	/* phase 1 */
	for_each_dev()
		dev->suspend(dev);

	cli();
	/* phase 2 */
	for_each_dev()
		dev->suspend_late(dev);

(and the reverse case on resume).

The reason is simply that there are two totally different cases: things 
like disks etc want to spin down and do slow and high-level operations, 
while things like USB controllers and console devices do *not* want to be 
suspended early, because if you do, you lose debuggability.

So some things really *really* want to be done when they know that there 
isn't anything else going on any more, and they want to delay the shutdown 
to the very end. While other things really *require* that they can send 
requests that can take time, and cannot run with interrupts disabled.

I actually think that "snapshot" is totally different, exactly because for 
snapshotting, the slow operations like spinning down disks etc probably 
don't really even exist, and would always be no-ops. But who knows..

Anyway, I do have a final comment: 

     DO NOT PASS "STATE FLAGS" TO DRIVERS 

(or, even worse, assume that drivers would test "implicit" state by 
calling the same function under two different states, and then have the 
drivers test for "are interrupts disabled? Then I need to do something 
else").

If drivers are possibly going to do two different things, make it two 
different entry-points. There's absolutely no downsides. It's _clearer_ to 
the device writer when he gets called two different ways that it's not the 
same case, and in case a particular device can do the same thing for both 
cases, he can just set the function pointer to the same entry for both.

Never EVER pass dynamic flags that modify behaviour. It's simply bad 
programming. A function should do *one* thing, and do it well. 

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
@ 2007-04-27 15:41                                                       ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:41 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Ingo Molnar, Thomas Gleixner,
	Arjan van de Ven



On Fri, 27 Apr 2007, Johannes Berg wrote:
> 
> Eh, the point I actually wanted to make is that many drivers don't care
> for the irqs disabled case and would have to add code to exclude it.

You really *really* want to do a two-phase thing, at least for the case I 
care about. Whether that snapshotting thing does or not, I could care 
less.

There's a damn good reason why the kernel uses 

	/* phase 1 */
	for_each_dev()
		dev->suspend(dev);

	cli();
	/* phase 2 */
	for_each_dev()
		dev->suspend_late(dev);

(and the reverse case on resume).

The reason is simply that there are two totally different cases: things 
like disks etc want to spin down and do slow and high-level operations, 
while things like USB controllers and console devices do *not* want to be 
suspended early, because if you do, you lose debuggability.

So some things really *really* want to be done when they know that there 
isn't anything else going on any more, and they want to delay the shutdown 
to the very end. While other things really *require* that they can send 
requests that can take time, and cannot run with interrupts disabled.

I actually think that "snapshot" is totally different, exactly because for 
snapshotting, the slow operations like spinning down disks etc probably 
don't really even exist, and would always be no-ops. But who knows..

Anyway, I do have a final comment: 

     DO NOT PASS "STATE FLAGS" TO DRIVERS 

(or, even worse, assume that drivers would test "implicit" state by 
calling the same function under two different states, and then have the 
drivers test for "are interrupts disabled? Then I need to do something 
else").

If drivers are possibly going to do two different things, make it two 
different entry-points. There's absolutely no downsides. It's _clearer_ to 
the device writer when he gets called two different ways that it's not the 
same case, and in case a particular device can do the same thing for both 
cases, he can just set the function pointer to the same entry for both.

Never EVER pass dynamic flags that modify behaviour. It's simply bad 
programming. A function should do *one* thing, and do it well. 

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 15:20                                                     ` [linux-pm] " Rafael J. Wysocki
@ 2007-04-27 15:52                                                         ` Linus Torvalds
  2007-04-27 15:27                                                       ` Johannes Berg
  2007-04-27 15:52                                                         ` Linus Torvalds
  2 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Alan Stern, Nick Piggin, Ingo Molnar,
	suspend2-devel, Mike Galbraith, Kernel development list,
	Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek,
	Andrew Morton, linux-pm, Arjan van de Ven



On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> 
> I think we can use 'stages' and pass them as arguments to the functions.

No, no NOOOO!

If you use stages, just describe them in the function name instead.	

> quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> before quiescing devices (if any)
> ...
> quiesce(PRE_SNAPSHOT)
> ...
> quiesce(PRE_SNAPSHOT_IRQ_OFF)

There is *no* advantage to this (and _lots_ of disadvantages) compared to 
saying

	dev->snapshot_prepare(dev);
	dev->snapshot_freeze(dev);
	dev->snapshot(dev)

The latter is
 - more readable
 - MUCH easier for programmers to write readable code for (if-statements 
   and case-statements are *by*definition* more complicated to parse both 
   for humans and for CPU's - static information is good)
 - allows for the different stages to have different arguments, and 
   somewhat related to that, to have better static C type checking.

Look here, which one is more readable:

	int some_mixed_function(int arg)
	{
		do_one_thing();
		if (arg == SLEEP)
			do_another_thing();
		else
			do_yet_another_thing();
	}

or

	int do_sleep(void)
	{
		do_one_thing();
		do_another_thing();
	}

	int prepare_to_sleep(void)
	{
		do_one_thing();
		do_yet_another_thing();
	}

and quite frankly, while the second case may take more lines of code, 
anybody who says that it's not clearer what it does (because it can 
"self-document" with function names etc) is either lying, or just a really 
bad programmer. The second case is also likely faster and probably not 
larger code-size-wise either, since it does static decisions _statically_ 
(since all callers are realistically going to use a constant argument 
anyway, and the argument really is static).

Finally, the second case is *much* easier to fix, exactly because it 
doesn't mix up the cases. You can change the arguments, you can have 
totally different locking, you don't need things like

	int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL;

etc, and it's just more logical.

So don't overload a function. That's the *bug* with the current 
"dev->suspend()" interface already. Don't re-create it. The current one 
overloads two *totally*different* operations onto one function. 

Just don't do it. Not in the suspend part, not *ever*.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
@ 2007-04-27 15:52                                                         ` Linus Torvalds
  0 siblings, 0 replies; 712+ messages in thread
From: Linus Torvalds @ 2007-04-27 15:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Johannes Berg, Ingo Molnar,
	Arjan van de Ven



On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> 
> I think we can use 'stages' and pass them as arguments to the functions.

No, no NOOOO!

If you use stages, just describe them in the function name instead.	

> quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> before quiescing devices (if any)
> ...
> quiesce(PRE_SNAPSHOT)
> ...
> quiesce(PRE_SNAPSHOT_IRQ_OFF)

There is *no* advantage to this (and _lots_ of disadvantages) compared to 
saying

	dev->snapshot_prepare(dev);
	dev->snapshot_freeze(dev);
	dev->snapshot(dev)

The latter is
 - more readable
 - MUCH easier for programmers to write readable code for (if-statements 
   and case-statements are *by*definition* more complicated to parse both 
   for humans and for CPU's - static information is good)
 - allows for the different stages to have different arguments, and 
   somewhat related to that, to have better static C type checking.

Look here, which one is more readable:

	int some_mixed_function(int arg)
	{
		do_one_thing();
		if (arg == SLEEP)
			do_another_thing();
		else
			do_yet_another_thing();
	}

or

	int do_sleep(void)
	{
		do_one_thing();
		do_another_thing();
	}

	int prepare_to_sleep(void)
	{
		do_one_thing();
		do_yet_another_thing();
	}

and quite frankly, while the second case may take more lines of code, 
anybody who says that it's not clearer what it does (because it can 
"self-document" with function names etc) is either lying, or just a really 
bad programmer. The second case is also likely faster and probably not 
larger code-size-wise either, since it does static decisions _statically_ 
(since all callers are realistically going to use a constant argument 
anyway, and the argument really is static).

Finally, the second case is *much* easier to fix, exactly because it 
doesn't mix up the cases. You can change the arguments, you can have 
totally different locking, you don't need things like

	int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL;

etc, and it's just more logical.

So don't overload a function. That's the *bug* with the current 
"dev->suspend()" interface already. Don't re-create it. The current one 
overloads two *totally*different* operations onto one function. 

Just don't do it. Not in the suspend part, not *ever*.

		Linus

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
                                                                 ` (3 preceding siblings ...)
  2007-04-27 15:56                                               ` David Brownell
@ 2007-04-27 15:56                                               ` David Brownell
  2007-04-27 18:31                                                 ` Rafael J. Wysocki
  2007-04-27 18:31                                                 ` Rafael J. Wysocki
  2007-05-07 12:29                                               ` Pavel Machek
  2007-05-07 12:29                                               ` Pavel Machek
  6 siblings, 2 replies; 712+ messages in thread
From: David Brownell @ 2007-04-27 15:56 UTC (permalink / raw)
  To: linux-pm
  Cc: Johannes Berg, Pavel Machek, Nick Piggin, Andrew Morton,
	Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk,
	suspend2-devel, Thomas Gleixner, Linus Torvalds, Ingo Molnar,
	Arjan van de Ven

On Friday 27 April 2007, Johannes Berg wrote:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary?

Read the patch comments for the patch adding that transition.  Briefly,
adding that transition to swsusp resume was a significant bugfix for
all drivers that rely on controller state to determine how to resume.

(That's mostly drivers that are intelligent about wakeup events... so
unless you're working with such drivers, the issue may be unclear.)


> As far as I can tell it's only necessary 
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.

More like:  because swsusp overloaded the suspend()/resume() code paths
to do double duty.

Instead of just putting devices into low power states (just *which* state
is another discussion), they evolved into support for swsusp transitions...
causing trouble (and sometimes breakage) for non-swsusp models.


> Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> better name?) that are called at the appropriate places (with
> freeze/thaw around preparing the image and freeze/restart around
> restoring would go a long way of clearing up the confusion in all the
> drivers. Of course, it'd have to be documented that freeze/thaw isn't
> the only valid combination but that freeze/restart is used too, but
> that's not hard to do nor hard to understand.

I suspect that after snapshot resume restart() should always be used.
That shouldn't be hard to understand at all.  It'd be sub-optimal in
the same cases today's system resume is sub-optimal:  devices that
were in low power states before system suspend wouldn't be that way
after system resume.


> And, incidentally, it could possibly make both suspend and hibernate
> work much faster too. The comments there talk about "minimally power
> management aware" drivers which always do the wrong thing for suspend,
> in that they always reset everything...

That comment was purely about existing practice ... and was mostly
about resume() processing, not suspend() paths.

It's an unfortunate reality that most device drivers are stupid in
terms of power management, so we need to be clear about just how
stupid they're allowed to be without being terminally broken.

Additionally, it would be a Good Thing if changes to clean up the
swsusp-related code paths didn't make "real suspend" more painful.


> Of course, some drivers will 
> actually need to do that, but if freeze/suspend and thaw/restart/resume
> have the same prototypes (probably just int <function>(void)) then
> drivers can trivially assign the same there.
> And hibernate would benefit since a lot of drivers could do a lot less
> work for freeze/thaw.

That actually gets into discussions from a while back about wanting
to be able to quiesce() devices, as separate from actually putting
them into low power states.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
                                                                 ` (2 preceding siblings ...)
  2007-04-27 14:34                                                 ` Alan Stern
@ 2007-04-27 15:56                                               ` David Brownell
  2007-04-27 15:56                                               ` [linux-pm] " David Brownell
                                                                 ` (2 subsequent siblings)
  6 siblings, 0 replies; 712+ messages in thread
From: David Brownell @ 2007-04-27 15:56 UTC (permalink / raw)
  To: linux-pm
  Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith,
	linux-kernel, Con Kolivas, Adrian Bunk, Thomas Gleixner,
	Pavel Machek, Johannes Berg, Linus Torvalds, Andrew Morton,
	Arjan van de Ven

On Friday 27 April 2007, Johannes Berg wrote:
> 
>  * FREEZE       Quiesce operations so that a consistent image can be saved;
>  *              but do NOT otherwise enter a low power device state, and do
>  *              NOT emit system wakeup events.
>  *
>  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
>  *              the system from a snapshot taken after an earlier FREEZE.
>  *              Some drivers will need to reset their hardware state instead
>  *              of preserving it, to ensure that it's never mistaken for the
>  *              state which that earlier snapshot had set up.
> 
> Why is prethaw even necessary?

Read the patch comments for the patch adding that transition.  Briefly,
adding that transition to swsusp resume was a significant bugfix for
all drivers that rely on controller state to determine how to resume.

(That's mostly drivers that are intelligent about wakeup events... so
unless you're working with such drivers, the issue may be unclear.)


> As far as I can tell it's only necessary 
> because resume() can't tell you whether you just want to thaw or need to
> reset since it doesn't tell you at what point it's invoked.

More like:  because swsusp overloaded the suspend()/resume() code paths
to do double duty.

Instead of just putting devices into low power states (just *which* state
is another discussion), they evolved into support for swsusp transitions...
causing trouble (and sometimes breakage) for non-swsusp models.


> Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> better name?) that are called at the appropriate places (with
> freeze/thaw around preparing the image and freeze/restart around
> restoring would go a long way of clearing up the confusion in all the
> drivers. Of course, it'd have to be documented that freeze/thaw isn't
> the only valid combination but that freeze/restart is used too, but
> that's not hard to do nor hard to understand.

I suspect that after snapshot resume restart() should always be used.
That shouldn't be hard to understand at all.  It'd be sub-optimal in
the same cases today's system resume is sub-optimal:  devices that
were in low power states before system suspend wouldn't be that way
after system resume.


> And, incidentally, it could possibly make both suspend and hibernate
> work much faster too. The comments there talk about "minimally power
> management aware" drivers which always do the wrong thing for suspend,
> in that they always reset everything...

That comment was purely about existing practice ... and was mostly
about resume() processing, not suspend() paths.

It's an unfortunate reality that most device drivers are stupid in
terms of power management, so we need to be clear about just how
stupid they're allowed to be without being terminally broken.

Additionally, it would be a Good Thing if changes to clean up the
swsusp-related code paths didn't make "real suspend" more painful.


> Of course, some drivers will 
> actually need to do that, but if freeze/suspend and thaw/restart/resume
> have the same prototypes (probably just int <function>(void)) then
> drivers can trivially assign the same there.
> And hibernate would benefit since a lot of drivers could do a lot less
> work for freeze/thaw.

That actually gets into discussions from a while back about wanting
to be able to quiesce() devices, as separate from actually putting
them into low power states.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 15:56                                               ` [linux-pm] " David Brownell
@ 2007-04-27 18:31                                                 ` Rafael J. Wysocki
  2007-04-27 18:31                                                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 18:31 UTC (permalink / raw)
  To: David Brownell
  Cc: linux-pm, Johannes Berg, Pavel Machek, Nick Piggin,
	Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas,
	Adrian Bunk, suspend2-devel, Thomas Gleixner, Linus Torvalds,
	Ingo Molnar, Arjan van de Ven

On Friday, 27 April 2007 17:56, David Brownell wrote:
> On Friday 27 April 2007, Johannes Berg wrote:
> > 
> >  * FREEZE       Quiesce operations so that a consistent image can be saved;
> >  *              but do NOT otherwise enter a low power device state, and do
> >  *              NOT emit system wakeup events.
> >  *
> >  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
> >  *              the system from a snapshot taken after an earlier FREEZE.
> >  *              Some drivers will need to reset their hardware state instead
> >  *              of preserving it, to ensure that it's never mistaken for the
> >  *              state which that earlier snapshot had set up.
> > 
> > Why is prethaw even necessary?
> 
> Read the patch comments for the patch adding that transition.  Briefly,
> adding that transition to swsusp resume was a significant bugfix for
> all drivers that rely on controller state to determine how to resume.
> 
> (That's mostly drivers that are intelligent about wakeup events... so
> unless you're working with such drivers, the issue may be unclear.)
> 
> 
> > As far as I can tell it's only necessary 
> > because resume() can't tell you whether you just want to thaw or need to
> > reset since it doesn't tell you at what point it's invoked.
> 
> More like:  because swsusp overloaded the suspend()/resume() code paths
> to do double duty.
> 
> Instead of just putting devices into low power states (just *which* state
> is another discussion), they evolved into support for swsusp transitions...
> causing trouble (and sometimes breakage) for non-swsusp models.
> 
> 
> > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> > better name?) that are called at the appropriate places (with
> > freeze/thaw around preparing the image and freeze/restart around
> > restoring would go a long way of clearing up the confusion in all the
> > drivers. Of course, it'd have to be documented that freeze/thaw isn't
> > the only valid combination but that freeze/restart is used too, but
> > that's not hard to do nor hard to understand.
> 
> I suspect that after snapshot resume restart() should always be used.
> That shouldn't be hard to understand at all.  It'd be sub-optimal in
> the same cases today's system resume is sub-optimal:  devices that
> were in low power states before system suspend wouldn't be that way
> after system resume.
> 
> 
> > And, incidentally, it could possibly make both suspend and hibernate
> > work much faster too. The comments there talk about "minimally power
> > management aware" drivers which always do the wrong thing for suspend,
> > in that they always reset everything...
> 
> That comment was purely about existing practice ... and was mostly
> about resume() processing, not suspend() paths.
> 
> It's an unfortunate reality that most device drivers are stupid in
> terms of power management, so we need to be clear about just how
> stupid they're allowed to be without being terminally broken.
> 
> Additionally, it would be a Good Thing if changes to clean up the
> swsusp-related code paths didn't make "real suspend" more painful.
> 
> 
> > Of course, some drivers will 
> > actually need to do that, but if freeze/suspend and thaw/restart/resume
> > have the same prototypes (probably just int <function>(void)) then
> > drivers can trivially assign the same there.
> > And hibernate would benefit since a lot of drivers could do a lot less
> > work for freeze/thaw.
> 
> That actually gets into discussions from a while back about wanting
> to be able to quiesce() devices, as separate from actually putting
> them into low power states.

Yes, exactly.

Moreover, I think we should separate the current suspend code from the
hibernation (aka STD) code paths we're discussing.  I mean, we need
hibernation-specific equivalents of drivers/base/power/suspend.c etc.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 15:56                                               ` [linux-pm] " David Brownell
  2007-04-27 18:31                                                 ` Rafael J. Wysocki
@ 2007-04-27 18:31                                                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 18:31 UTC (permalink / raw)
  To: David Brownell
  Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith,
	linux-kernel, Con Kolivas, Adrian Bunk, Johannes Berg,
	Thomas Gleixner, Pavel Machek, linux-pm, Linus Torvalds,
	Andrew Morton, Arjan van de Ven

On Friday, 27 April 2007 17:56, David Brownell wrote:
> On Friday 27 April 2007, Johannes Berg wrote:
> > 
> >  * FREEZE       Quiesce operations so that a consistent image can be saved;
> >  *              but do NOT otherwise enter a low power device state, and do
> >  *              NOT emit system wakeup events.
> >  *
> >  * PRETHAW      Quiesce as if for FREEZE; additionally, prepare for restoring
> >  *              the system from a snapshot taken after an earlier FREEZE.
> >  *              Some drivers will need to reset their hardware state instead
> >  *              of preserving it, to ensure that it's never mistaken for the
> >  *              state which that earlier snapshot had set up.
> > 
> > Why is prethaw even necessary?
> 
> Read the patch comments for the patch adding that transition.  Briefly,
> adding that transition to swsusp resume was a significant bugfix for
> all drivers that rely on controller state to determine how to resume.
> 
> (That's mostly drivers that are intelligent about wakeup events... so
> unless you're working with such drivers, the issue may be unclear.)
> 
> 
> > As far as I can tell it's only necessary 
> > because resume() can't tell you whether you just want to thaw or need to
> > reset since it doesn't tell you at what point it's invoked.
> 
> More like:  because swsusp overloaded the suspend()/resume() code paths
> to do double duty.
> 
> Instead of just putting devices into low power states (just *which* state
> is another discussion), they evolved into support for swsusp transitions...
> causing trouble (and sometimes breakage) for non-swsusp models.
> 
> 
> > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a
> > better name?) that are called at the appropriate places (with
> > freeze/thaw around preparing the image and freeze/restart around
> > restoring would go a long way of clearing up the confusion in all the
> > drivers. Of course, it'd have to be documented that freeze/thaw isn't
> > the only valid combination but that freeze/restart is used too, but
> > that's not hard to do nor hard to understand.
> 
> I suspect that after snapshot resume restart() should always be used.
> That shouldn't be hard to understand at all.  It'd be sub-optimal in
> the same cases today's system resume is sub-optimal:  devices that
> were in low power states before system suspend wouldn't be that way
> after system resume.
> 
> 
> > And, incidentally, it could possibly make both suspend and hibernate
> > work much faster too. The comments there talk about "minimally power
> > management aware" drivers which always do the wrong thing for suspend,
> > in that they always reset everything...
> 
> That comment was purely about existing practice ... and was mostly
> about resume() processing, not suspend() paths.
> 
> It's an unfortunate reality that most device drivers are stupid in
> terms of power management, so we need to be clear about just how
> stupid they're allowed to be without being terminally broken.
> 
> Additionally, it would be a Good Thing if changes to clean up the
> swsusp-related code paths didn't make "real suspend" more painful.
> 
> 
> > Of course, some drivers will 
> > actually need to do that, but if freeze/suspend and thaw/restart/resume
> > have the same prototypes (probably just int <function>(void)) then
> > drivers can trivially assign the same there.
> > And hibernate would benefit since a lot of drivers could do a lot less
> > work for freeze/thaw.
> 
> That actually gets into discussions from a while back about wanting
> to be able to quiesce() devices, as separate from actually putting
> them into low power states.

Yes, exactly.

Moreover, I think we should separate the current suspend code from the
hibernation (aka STD) code paths we're discussing.  I mean, we need
hibernation-specific equivalents of drivers/base/power/suspend.c etc.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [linux-pm] driver power operations (was Re: suspend2 merge)
  2007-04-27 15:52                                                         ` Linus Torvalds
  (?)
  (?)
@ 2007-04-27 18:34                                                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 18:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Berg, Alan Stern, Nick Piggin, Ingo Molnar,
	suspend2-devel, Mike Galbraith, Kernel development list,
	Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek,
	Andrew Morton, linux-pm, Arjan van de Ven

On Friday, 27 April 2007 17:52, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > I think we can use 'stages' and pass them as arguments to the functions.
> 
> No, no NOOOO!
> 
> If you use stages, just describe them in the function name instead.	
> 
> > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> > before quiescing devices (if any)
> > ...
> > quiesce(PRE_SNAPSHOT)
> > ...
> > quiesce(PRE_SNAPSHOT_IRQ_OFF)
> 
> There is *no* advantage to this (and _lots_ of disadvantages) compared to 
> saying
> 
> 	dev->snapshot_prepare(dev);
> 	dev->snapshot_freeze(dev);
> 	dev->snapshot(dev)
> 
> The latter is
>  - more readable
>  - MUCH easier for programmers to write readable code for (if-statements 
>    and case-statements are *by*definition* more complicated to parse both 
>    for humans and for CPU's - static information is good)
>  - allows for the different stages to have different arguments, and 
>    somewhat related to that, to have better static C type checking.
> 
> Look here, which one is more readable:
> 
> 	int some_mixed_function(int arg)
> 	{
> 		do_one_thing();
> 		if (arg == SLEEP)
> 			do_another_thing();
> 		else
> 			do_yet_another_thing();
> 	}
> 
> or
> 
> 	int do_sleep(void)
> 	{
> 		do_one_thing();
> 		do_another_thing();
> 	}
> 
> 	int prepare_to_sleep(void)
> 	{
> 		do_one_thing();
> 		do_yet_another_thing();
> 	}
> 
> and quite frankly, while the second case may take more lines of code, 
> anybody who says that it's not clearer what it does (because it can 
> "self-document" with function names etc) is either lying, or just a really 
> bad programmer. The second case is also likely faster and probably not 
> larger code-size-wise either, since it does static decisions _statically_ 
> (since all callers are realistically going to use a constant argument 
> anyway, and the argument really is static).
> 
> Finally, the second case is *much* easier to fix, exactly because it 
> doesn't mix up the cases. You can change the arguments, you can have 
> totally different locking, you don't need things like
> 
> 	int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL;
> 
> etc, and it's just more logical.
> 
> So don't overload a function. That's the *bug* with the current 
> "dev->suspend()" interface already. Don't re-create it. The current one 
> overloads two *totally*different* operations onto one function. 
> 
> Just don't do it. Not in the suspend part, not *ever*.

OK, I won't.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 15:52                                                         ` Linus Torvalds
  (?)
@ 2007-04-27 18:34                                                         ` Rafael J. Wysocki
  -1 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 18:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith,
	Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton,
	suspend2-devel, linux-pm, Johannes Berg, Ingo Molnar,
	Arjan van de Ven

On Friday, 27 April 2007 17:52, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > I think we can use 'stages' and pass them as arguments to the functions.
> 
> No, no NOOOO!
> 
> If you use stages, just describe them in the function name instead.	
> 
> > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> > before quiescing devices (if any)
> > ...
> > quiesce(PRE_SNAPSHOT)
> > ...
> > quiesce(PRE_SNAPSHOT_IRQ_OFF)
> 
> There is *no* advantage to this (and _lots_ of disadvantages) compared to 
> saying
> 
> 	dev->snapshot_prepare(dev);
> 	dev->snapshot_freeze(dev);
> 	dev->snapshot(dev)
> 
> The latter is
>  - more readable
>  - MUCH easier for programmers to write readable code for (if-statements 
>    and case-statements are *by*definition* more complicated to parse both 
>    for humans and for CPU's - static information is good)
>  - allows for the different stages to have different arguments, and 
>    somewhat related to that, to have better static C type checking.
> 
> Look here, which one is more readable:
> 
> 	int some_mixed_function(int arg)
> 	{
> 		do_one_thing();
> 		if (arg == SLEEP)
> 			do_another_thing();
> 		else
> 			do_yet_another_thing();
> 	}
> 
> or
> 
> 	int do_sleep(void)
> 	{
> 		do_one_thing();
> 		do_another_thing();
> 	}
> 
> 	int prepare_to_sleep(void)
> 	{
> 		do_one_thing();
> 		do_yet_another_thing();
> 	}
> 
> and quite frankly, while the second case may take more lines of code, 
> anybody who says that it's not clearer what it does (because it can 
> "self-document" with function names etc) is either lying, or just a really 
> bad programmer. The second case is also likely faster and probably not 
> larger code-size-wise either, since it does static decisions _statically_ 
> (since all callers are realistically going to use a constant argument 
> anyway, and the argument really is static).
> 
> Finally, the second case is *much* easier to fix, exactly because it 
> doesn't mix up the cases. You can change the arguments, you can have 
> totally different locking, you don't need things like
> 
> 	int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL;
> 
> etc, and it's just more logical.
> 
> So don't overload a function. That's the *bug* with the current 
> "dev->suspend()" interface already. Don't re-create it. The current one 
> overloads two *totally*different* operations onto one function. 
> 
> Just don't do it. Not in the suspend part, not *ever*.

OK, I won't.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-26 16:31                                                 ` Johannes Berg
  (?)
  (?)
@ 2007-04-29 12:48                                                 ` R. J. Wysocki
  2007-04-29 12:53                                                   ` Rafael J. Wysocki
  2007-04-30  8:29                                                   ` Johannes Berg
  -1 siblings, 2 replies; 712+ messages in thread
From: R. J. Wysocki @ 2007-04-29 12:48 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek

[Trimmed the CC list to a reasonable minimum]

On Thursday, 26 April 2007 18:31, Johannes Berg wrote:
> On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:
> 
> > > From looking at pm_ops which I was recently working with a lot, it seems
> > > that it was designed by somebody who was reading the ACPI documentation
> > > and was otherwise pretty clueless, even at that level std tries to look
> > > like suspend. IMHO that is one of the first things that should be ripped
> > > out, no pm_ops for STD, it's a pain to work with.
> > 
> > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> > low-level enter is pretty similar).
> > 
> > Patches would be welcome
> 
> That was easier than I thought. This applies on top of a patch that
> makes kernel/power/user.c optional since I had no idea how to fix it,
> problems I see:
>  * it surfaces kernel implementation details about pm_ops and thus makes
>    the whole thing very fragile
>  * it has yet another interface (yuck) to determine whether to reboot,
>    shut down etc, doesn't use /sys/power/disk
>  * I generally had no idea wtf it is doing in some places
> 
> Anyway, this patch is only compile tested, it
>  * introduces include/linux/hibernate.h with hibernate_ops and
>    a new hibernate() function to hibernate the system
>  * rips apart a lot of the suspend code and puts it back together using
>    the hibernate_ops
>  * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
>  * might apply/compile against -mm, I have all my and some of Rafael's
>    suspend/hibernate work in my tree.
>  * breaks user suspend as I noted above
>  * is incomplete, somewhere pm_suspend_disk() is still defined iirc

OK, I reworked it a bit.

Main changes:

- IMHO 'hibernation_ops' sounds better than 'hibernate_ops', for example, so
now the new names start with 'hibernation_' (or 'HIBERNATION_')

- Moved the hibernation-related definitions to include/linux/suspend.h, since
some hibernation-specific definitions are already there.  We can introduce
hibernation.h in a separate patch (it'll have to #include suspend.h IMO).

- Changed the names starting from 'pm_disk_' (or 'PM_DISK_').

- Cleaned up the new ACPI code (it didn't compile and included some things
unrelated to hibernation).  I'm still not sure about acpi_hibernation_finish()
(is the code after acpi_disable_wakeup_device() really needed?)

- Made kernel/power/user.c compile (and hopefully work too)

It looks like we'll have to change CONFIG_SOFTWARE_SUSPEND into
CONFIG_HIBERNATION, since some pieces of code now look silly.

The appended patch is agaist 2.6.21-rc7-mm2 with two freezer patches applied
(should not affect this one).  Compilation tested on x86_64.

Greetings,
Rafael

---
 Documentation/power/userland-swsusp.txt |   26 ++--
 drivers/acpi/sleep/main.c               |   79 +++++++++++--
 drivers/acpi/sleep/proc.c               |    2 
 drivers/i2c/chips/tps65010.c            |    2 
 include/linux/pm.h                      |   31 -----
 kernel/power/disk.c                     |  184 +++++++++++++++++---------------
 kernel/power/main.c                     |   42 ++-----
 kernel/power/power.h                    |    7 -
 kernel/power/user.c                     |   13 +-
 kernel/sys.c                            |    2 
 10 files changed, 204 insertions(+), 184 deletions(-)

Index: linux-2.6.21-rc7-mm2/include/linux/pm.h
===================================================================
--- linux-2.6.21-rc7-mm2.orig/include/linux/pm.h	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/include/linux/pm.h	2007-04-29 13:39:17.000000000 +0200
@@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_ON		((__force suspend_state_t) 0)
 #define PM_SUSPEND_STANDBY	((__force suspend_state_t) 1)
 #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
-#define PM_SUSPEND_DISK		((__force suspend_state_t) 4)
-#define PM_SUSPEND_MAX		((__force suspend_state_t) 5)
-
-typedef int __bitwise suspend_disk_method_t;
-
-/* invalid must be 0 so struct pm_ops initialisers can leave it out */
-#define PM_DISK_INVALID		((__force suspend_disk_method_t) 0)
-#define	PM_DISK_PLATFORM	((__force suspend_disk_method_t) 1)
-#define	PM_DISK_SHUTDOWN	((__force suspend_disk_method_t) 2)
-#define	PM_DISK_REBOOT		((__force suspend_disk_method_t) 3)
-#define	PM_DISK_TEST		((__force suspend_disk_method_t) 4)
-#define	PM_DISK_TESTPROC	((__force suspend_disk_method_t) 5)
-#define	PM_DISK_MAX		((__force suspend_disk_method_t) 6)
+#define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
 
 /**
  * struct pm_ops - Callbacks for managing platform dependent suspend states.
  * @valid: Callback to determine whether the given state can be entered.
- * 	If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
- *	always valid and never passed to this call. If not assigned,
- *	no suspend states are valid.
  *	Valid states are advertised in /sys/power/state but can still
  *	be rejected by prepare or enter if the conditions aren't right.
  *	There is a %pm_valid_only_mem function available that can be assigned
@@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
  *
  * @finish: Called when the system has left the given state and all devices
  *	are resumed. The return value is ignored.
- *
- * @pm_disk_mode: The generic code always allows one of the shutdown methods
- *	%PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
- *	%PM_DISK_TESTPROC. If this variable is set, the mode it is set
- *	to is allowed in addition to those modes and is also made default.
- *	When this mode is sent selected, the @prepare call will be called
- *	before suspending to disk (if present), the @enter call should be
- *	present and will be called after all state has been saved and the
- *	machine is ready to be powered off; the @finish callback is called
- *	after state has been restored. All these calls are called with
- *	%PM_SUSPEND_DISK as the state.
  */
 struct pm_ops {
 	int (*valid)(suspend_state_t state);
 	int (*prepare)(suspend_state_t state);
 	int (*enter)(suspend_state_t state);
 	int (*finish)(suspend_state_t state);
-	suspend_disk_method_t pm_disk_mode;
 };
 
 /**
@@ -258,8 +231,6 @@ extern void device_power_up(void);
 extern void device_resume(void);
 
 #ifdef CONFIG_PM
-extern suspend_disk_method_t pm_disk_mode;
-
 extern int device_suspend(pm_message_t state);
 extern int device_prepare_suspend(pm_message_t state);
 
Index: linux-2.6.21-rc7-mm2/kernel/power/main.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/power/main.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/kernel/power/main.c	2007-04-29 13:43:34.000000000 +0200
@@ -30,7 +30,6 @@
 DEFINE_MUTEX(pm_mutex);
 
 struct pm_ops *pm_ops;
-suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
 
 /**
  *	pm_set_ops - Set the global power method table. 
@@ -41,10 +40,6 @@ void pm_set_ops(struct pm_ops * ops)
 {
 	mutex_lock(&pm_mutex);
 	pm_ops = ops;
-	if (ops && ops->pm_disk_mode != PM_DISK_INVALID) {
-		pm_disk_mode = ops->pm_disk_mode;
-	} else
-		pm_disk_mode = PM_DISK_SHUTDOWN;
 	mutex_unlock(&pm_mutex);
 }
 
@@ -196,24 +191,12 @@ static void suspend_finish(suspend_state
 static const char * const pm_states[PM_SUSPEND_MAX] = {
 	[PM_SUSPEND_STANDBY]	= "standby",
 	[PM_SUSPEND_MEM]	= "mem",
-	[PM_SUSPEND_DISK]	= "disk",
 };
 
 static inline int valid_state(suspend_state_t state)
 {
-	/* Suspend-to-disk does not really need low-level support.
-	 * It can work with shutdown/reboot if needed. If it isn't
-	 * configured, then it cannot be supported.
-	 */
-	if (state == PM_SUSPEND_DISK)
-#ifdef CONFIG_SOFTWARE_SUSPEND
-		return 1;
-#else
-		return 0;
-#endif
-
-	/* all other states need lowlevel support and need to be
-	 * valid to the lowlevel implementation, no valid callback
+	/* All states need lowlevel support and need to be valid
+	 * to the lowlevel implementation, no valid callback
 	 * implies that none are valid. */
 	if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state))
 		return 0;
@@ -241,11 +224,6 @@ static int enter_state(suspend_state_t s
 	if (!mutex_trylock(&pm_mutex))
 		return -EBUSY;
 
-	if (state == PM_SUSPEND_DISK) {
-		error = pm_suspend_disk();
-		goto Unlock;
-	}
-
 	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
 	if ((error = suspend_prepare(state)))
 		goto Unlock;
@@ -263,7 +241,7 @@ static int enter_state(suspend_state_t s
 
 /**
  *	pm_suspend - Externally visible function for suspending system.
- *	@state:		Enumarted value of state to enter.
+ *	@state:		Enumerated value of state to enter.
  *
  *	Determine whether or not value is within range, get state 
  *	structure, and enter (above).
@@ -301,7 +279,13 @@ static ssize_t state_show(struct subsyst
 		if (pm_states[i] && valid_state(i))
 			s += sprintf(s,"%s ", pm_states[i]);
 	}
-	s += sprintf(s,"\n");
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	s += sprintf(s, "%s\n", "disk");
+#else
+	if (s != buf)
+		/* convert the last space to a newline */
+		*(s-1) = "\n";
+#endif
 	return (s - buf);
 }
 
@@ -316,6 +300,12 @@ static ssize_t state_store(struct subsys
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
+	/* First, check if we are requested to hibernate */
+	if (strncmp(buf, "disk", len)) {
+		error = hibernate();
+		return error ? error : n;
+	}
+
 	for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) {
 		if (*s && !strncmp(buf, *s, len))
 			break;
Index: linux-2.6.21-rc7-mm2/kernel/power/disk.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/power/disk.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/kernel/power/disk.c	2007-04-29 13:54:50.000000000 +0200
@@ -30,30 +30,60 @@ char resume_file[256] = CONFIG_PM_STD_PA
 dev_t swsusp_resume_device;
 sector_t swsusp_resume_block;
 
+static int hibernation_mode;
+
+enum {
+	HIBERNATION_INVALID,
+	HIBERNATION_PLATFORM,
+	HIBERNATION_TEST,
+	HIBERNATION_TESTPROC,
+	HIBERNATION_SHUTDOWN,
+	HIBERNATION_REBOOT,
+	/* keep last */
+	__HIBERNATION_AFTER_LAST
+};
+#define HIBERNATION_MAX (__HIBERNATION_AFTER_LAST-1)
+#define HIBERNATION_FIRST (HIBERNATION_INVALID + 1)
+
+struct hibernation_ops *hibernation_ops;
+
+void hibernation_set_ops(struct hibernation_ops *ops)
+{
+	mutex_lock(&pm_mutex);
+	hibernation_ops = ops;
+	mutex_unlock(&pm_mutex);
+	if (hibernation_ops) {
+		BUG_ON(!hibernation_ops->prepare);
+		BUG_ON(!hibernation_ops->enter);
+		BUG_ON(!hibernation_ops->finish);
+	}
+}
+
+
 /**
  *	platform_prepare - prepare the machine for hibernation using the
  *	platform driver if so configured and return an error code if it fails
  */
 
-static inline int platform_prepare(void)
+static int platform_prepare(void)
 {
-	int error = 0;
+	return (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) ?
+		hibernation_ops->prepare() : 0;
+}
 
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
-	case PM_DISK_SHUTDOWN:
-	case PM_DISK_REBOOT:
-		break;
-	default:
-		if (pm_ops && pm_ops->prepare)
-			error = pm_ops->prepare(PM_SUSPEND_DISK);
-	}
-	return error;
+/**
+ *	platform_finish - switch the machine to the normal mode of operation
+ *	using the platform driver (must be called after platform_prepare())
+ */
+
+static void platform_finish(void)
+{
+	if (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops)
+		hibernation_ops->finish();
 }
 
 /**
- *	power_down - Shut machine down for hibernate.
+ *	power_down - Shut the machine down for hibernation.
  *
  *	Use the platform driver, if configured so; otherwise try
  *	to power off or reboot.
@@ -61,20 +91,20 @@ static inline int platform_prepare(void)
 
 static void power_down(void)
 {
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
+	switch (hibernation_mode) {
+	case HIBERNATION_TEST:
+	case HIBERNATION_TESTPROC:
 		break;
-	case PM_DISK_SHUTDOWN:
+	case HIBERNATION_SHUTDOWN:
 		kernel_power_off();
 		break;
-	case PM_DISK_REBOOT:
+	case HIBERNATION_REBOOT:
 		kernel_restart(NULL);
 		break;
-	default:
-		if (pm_ops && pm_ops->enter) {
+	case HIBERNATION_PLATFORM:
+		if (hibernation_ops) {
 			kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-			pm_ops->enter(PM_SUSPEND_DISK);
+			hibernation_ops->enter();
 			break;
 		}
 	}
@@ -87,20 +117,6 @@ static void power_down(void)
 	while(1);
 }
 
-static inline void platform_finish(void)
-{
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
-	case PM_DISK_SHUTDOWN:
-	case PM_DISK_REBOOT:
-		break;
-	default:
-		if (pm_ops && pm_ops->finish)
-			pm_ops->finish(PM_SUSPEND_DISK);
-	}
-}
-
 static void unprepare_processes(void)
 {
 	thaw_processes();
@@ -120,13 +136,10 @@ static int prepare_processes(void)
 }
 
 /**
- *	pm_suspend_disk - The granpappy of hibernation power management.
- *
- *	If not, then call swsusp to do its thing, then figure out how
- *	to power down the system.
+ *	hibernate - The granpappy of the built-in hibernation management
  */
 
-int pm_suspend_disk(void)
+int hibernate(void)
 {
 	int error;
 
@@ -151,7 +164,7 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Thaw;
 
-	if (pm_disk_mode == PM_DISK_TESTPROC) {
+	if (hibernation_mode == HIBERNATION_TESTPROC) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);
 		goto Thaw;
@@ -176,7 +189,7 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Enable_cpus;
 
-	if (pm_disk_mode == PM_DISK_TEST) {
+	if (hibernation_mode == HIBERNATION_TEST) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);
 		goto Enable_cpus;
@@ -230,7 +243,7 @@ int pm_suspend_disk(void)
  *	Called as a late_initcall (so all devices are discovered and
  *	initialized), we call swsusp to see if we have a saved image or not.
  *	If so, we quiesce devices, the restore the saved image. We will
- *	return above (in pm_suspend_disk() ) if everything goes well.
+ *	return above (in hibernate() ) if everything goes well.
  *	Otherwise, we fail gracefully and return to the normally
  *	scheduled program.
  *
@@ -336,25 +349,26 @@ static int software_resume(void)
 late_initcall(software_resume);
 
 
-static const char * const pm_disk_modes[] = {
-	[PM_DISK_PLATFORM]	= "platform",
-	[PM_DISK_SHUTDOWN]	= "shutdown",
-	[PM_DISK_REBOOT]	= "reboot",
-	[PM_DISK_TEST]		= "test",
-	[PM_DISK_TESTPROC]	= "testproc",
+static const char * const hibernation_modes[] = {
+	[HIBERNATION_PLATFORM]	= "platform",
+	[HIBERNATION_SHUTDOWN]	= "shutdown",
+	[HIBERNATION_REBOOT]	= "reboot",
+	[HIBERNATION_TEST]	= "test",
+	[HIBERNATION_TESTPROC]	= "testproc",
 };
 
 /**
- *	disk - Control suspend-to-disk mode
+ *	disk - Control hibernation mode
  *
  *	Suspend-to-disk can be handled in several ways. We have a few options
  *	for putting the system to sleep - using the platform driver (e.g. ACPI
- *	or other pm_ops), powering off the system or rebooting the system
- *	(for testing) as well as the two test modes.
+ *	or other hibernation_ops), powering off the system or rebooting the
+ *	system (for testing) as well as the two test modes.
  *
  *	The system can support 'platform', and that is known a priori (and
- *	encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot'
- *	as alternatives, as well as the test modes 'test' and 'testproc'.
+ *	encoded by the presence of hibernation_ops). However, the user may
+ *	choose 'shutdown' or 'reboot' as alternatives, as well as one fo the
+ *	test modes, 'test' or 'testproc'.
  *
  *	show() will display what the mode is currently set to.
  *	store() will accept one of
@@ -366,7 +380,7 @@ static const char * const pm_disk_modes[
  *	'testproc'
  *
  *	It will only change to 'platform' if the system
- *	supports it (as determined from pm_ops->pm_disk_mode).
+ *	supports it (as determined by having hibernation_ops).
  */
 
 static ssize_t disk_show(struct subsystem * subsys, char * buf)
@@ -374,27 +388,26 @@ static ssize_t disk_show(struct subsyste
 	int i;
 	char *start = buf;
 
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
-		if (!pm_disk_modes[i])
+	for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
+		if (!hibernation_modes[i])
 			continue;
 		switch (i) {
-		case PM_DISK_SHUTDOWN:
-		case PM_DISK_REBOOT:
-		case PM_DISK_TEST:
-		case PM_DISK_TESTPROC:
+		case HIBERNATION_SHUTDOWN:
+		case HIBERNATION_REBOOT:
+		case HIBERNATION_TEST:
+		case HIBERNATION_TESTPROC:
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (i == pm_ops->pm_disk_mode))
+		case HIBERNATION_PLATFORM:
+			if (hibernation_ops)
 				break;
 			/* not a valid mode, continue with loop */
 			continue;
 		}
-		if (i == pm_disk_mode)
-			buf += sprintf(buf, "[%s]", pm_disk_modes[i]);
+		if (i == hibernation_mode)
+			buf += sprintf(buf, "[%s]", hibernation_modes[i]);
 		else
-			buf += sprintf(buf, "%s", pm_disk_modes[i]);
-		if (i+1 != PM_DISK_MAX)
+			buf += sprintf(buf, "%s", hibernation_modes[i]);
+		if (i+1 != HIBERNATION_MAX)
 			buf += sprintf(buf, " ");
 	}
 	buf += sprintf(buf, "\n");
@@ -408,39 +421,38 @@ static ssize_t disk_store(struct subsyst
 	int i;
 	int len;
 	char *p;
-	suspend_disk_method_t mode = 0;
+	int mode = HIBERNATION_INVALID;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
 	mutex_lock(&pm_mutex);
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
-		if (!strncmp(buf, pm_disk_modes[i], len)) {
+	for (i = HIBERNATION_FIRST; i < HIBERNATION_MAX; i++) {
+		if (!strncmp(buf, hibernation_modes[i], len)) {
 			mode = i;
 			break;
 		}
 	}
-	if (mode) {
+	if (mode != HIBERNATION_INVALID) {
 		switch (mode) {
-		case PM_DISK_SHUTDOWN:
-		case PM_DISK_REBOOT:
-		case PM_DISK_TEST:
-		case PM_DISK_TESTPROC:
-			pm_disk_mode = mode;
+		case HIBERNATION_SHUTDOWN:
+		case HIBERNATION_REBOOT:
+		case HIBERNATION_TEST:
+		case HIBERNATION_TESTPROC:
+			hibernation_mode = mode;
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (mode == pm_ops->pm_disk_mode))
-				pm_disk_mode = mode;
+		case HIBERNATION_PLATFORM:
+			if (hibernation_ops)
+				hibernation_mode = mode;
 			else
 				error = -EINVAL;
 		}
-	} else {
+	} else
 		error = -EINVAL;
-	}
 
-	pr_debug("PM: suspend-to-disk mode set to '%s'\n",
-		 pm_disk_modes[mode]);
+	if (!error)
+		pr_debug("PM: suspend-to-disk mode set to '%s'\n",
+			 hibernation_modes[mode]);
 	mutex_unlock(&pm_mutex);
 	return error ? error : n;
 }
Index: linux-2.6.21-rc7-mm2/Documentation/power/userland-swsusp.txt
===================================================================
--- linux-2.6.21-rc7-mm2.orig/Documentation/power/userland-swsusp.txt	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/Documentation/power/userland-swsusp.txt	2007-04-29 13:39:18.000000000 +0200
@@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t
 	to resume the system from RAM if there's enough battery power or restore
 	its state on the basis of the saved suspend image otherwise)
 
-SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and
-	pmops->finish methods (the in-kernel swsusp knows these as the "platform
-	method") which are needed on many machines to (among others) speed up
-	the resume by letting the BIOS skip some steps or to let the system
-	recognise the correct state of the hardware after the resume (in
-	particular on many machines this ensures that unplugged AC
-	adapters get correctly detected and that kacpid does not run wild after
-	the resume).  The last ioctl() argument can take one of the three
-	values, defined in kernel/power/power.h:
+SNAPSHOT_PMOPS - enable the usage of the hibernation_ops->prepare,
+	hibernate_ops->enter and hibernation_ops->finish methods (the in-kernel
+	swsusp knows these as the "platform method") which are needed on many
+	machines to (among others) speed up the resume by letting the BIOS skip
+	some steps or to let the system recognise the correct state of the
+	hardware after the resume (in particular on many machines this ensures
+	that unplugged AC adapters get correctly detected and that kacpid does
+	not run wild after the resume).  The last ioctl() argument can take one
+	of the three values, defined in kernel/power/power.h:
 	PMOPS_PREPARE - make the kernel carry out the
-		pm_ops->prepare(PM_SUSPEND_DISK) operation
+		hibernation_ops->prepare() operation
 	PMOPS_ENTER - make the kernel power off the system by calling
-		pm_ops->enter(PM_SUSPEND_DISK)
+		hibernation_ops->enter()
 	PMOPS_FINISH - make the kernel carry out the
-		pm_ops->finish(PM_SUSPEND_DISK) operation
+		hibernation_ops->finish() operation
+	Note that the actual constants are misnamed because they surface
+	internal kernel implementation details that have changed.
 
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
Index: linux-2.6.21-rc7-mm2/drivers/i2c/chips/tps65010.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/drivers/i2c/chips/tps65010.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/drivers/i2c/chips/tps65010.c	2007-04-29 13:39:18.000000000 +0200
@@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp
 			 * also needs to get error handling and probably
 			 * an #ifdef CONFIG_SOFTWARE_SUSPEND
 			 */
-			pm_suspend(PM_SUSPEND_DISK);
+			hibernate();
 #endif
 			poll = 1;
 		}
Index: linux-2.6.21-rc7-mm2/kernel/sys.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/sys.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/kernel/sys.c	2007-04-29 13:39:18.000000000 +0200
@@ -942,7 +942,7 @@ asmlinkage long sys_reboot(int magic1, i
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		{
-			int ret = pm_suspend(PM_SUSPEND_DISK);
+			int ret = hibernate();
 			unlock_kernel();
 			return ret;
 		}
Index: linux-2.6.21-rc7-mm2/drivers/acpi/sleep/main.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/drivers/acpi/sleep/main.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/drivers/acpi/sleep/main.c	2007-04-29 14:16:30.000000000 +0200
@@ -29,7 +29,6 @@ static u32 acpi_suspend_states[] = {
 	[PM_SUSPEND_ON] = ACPI_STATE_S0,
 	[PM_SUSPEND_STANDBY] = ACPI_STATE_S1,
 	[PM_SUSPEND_MEM] = ACPI_STATE_S3,
-	[PM_SUSPEND_DISK] = ACPI_STATE_S4,
 	[PM_SUSPEND_MAX] = ACPI_STATE_S5
 };
 
@@ -94,14 +93,6 @@ static int acpi_pm_enter(suspend_state_t
 		do_suspend_lowlevel();
 		break;
 
-	case PM_SUSPEND_DISK:
-		if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM)
-			status = acpi_enter_sleep_state(acpi_state);
-		break;
-	case PM_SUSPEND_MAX:
-		acpi_power_off();
-		break;
-
 	default:
 		return -EINVAL;
 	}
@@ -157,12 +148,13 @@ int acpi_suspend(u32 acpi_state)
 	suspend_state_t states[] = {
 		[1] = PM_SUSPEND_STANDBY,
 		[3] = PM_SUSPEND_MEM,
-		[4] = PM_SUSPEND_DISK,
 		[5] = PM_SUSPEND_MAX
 	};
 
 	if (acpi_state < 6 && states[acpi_state])
 		return pm_suspend(states[acpi_state]);
+	if (acpi_state == 4)
+		return hibernate();
 	return -EINVAL;
 }
 
@@ -189,6 +181,61 @@ static struct pm_ops acpi_pm_ops = {
 	.finish = acpi_pm_finish,
 };
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int acpi_hibernation_prepare(void)
+{
+	return acpi_sleep_prepare(ACPI_STATE_S4);
+}
+
+static int acpi_hibernation_enter(void)
+{
+	acpi_status status = AE_OK;
+	unsigned long flags = 0;
+	int error;
+
+	ACPI_FLUSH_CPU_CACHE();
+
+	/* Do arch specific saving of state. */
+	error = acpi_save_state_mem();
+	if (error)
+		return error;
+
+	local_irq_save(flags);
+	acpi_enable_wakeup_device(ACPI_STATE_S4);
+	status = acpi_enter_sleep_state(ACPI_STATE_S4);
+	local_irq_restore(flags);
+
+	/*
+	 * Restore processor state
+	 * We should only be here if we're coming back from hibernation and
+	 * the memory image should have already been loaded from disk.
+	 */
+	acpi_restore_state_mem();
+
+	return ACPI_SUCCESS(status) ? 0 : -EFAULT;
+}
+
+static void acpi_hibernation_finish(void)
+{
+	acpi_leave_sleep_state(ACPI_STATE_S4);
+	acpi_disable_wakeup_device(ACPI_STATE_S4);
+
+	/* reset firmware waking vector */
+	acpi_set_firmware_waking_vector((acpi_physical_address) 0);
+
+	if (init_8259A_after_S1) {
+		printk("Broken toshiba laptop -> kicking interrupts\n");
+		init_8259A(0);
+	}
+}
+
+static struct hibernation_ops acpi_hibernation_ops = {
+	.prepare = acpi_hibernation_prepare,
+	.enter = acpi_hibernation_enter,
+	.finish = acpi_hibernation_finish,
+};
+#endif /* CONFIG_SOFTWARE_SUSPEND */
+
 /*
  * Toshiba fails to preserve interrupts over S1, reinitialization
  * of 8259 is needed after S1 resume.
@@ -227,14 +274,18 @@ int __init acpi_sleep_init(void)
 			sleep_states[i] = 1;
 			printk(" S%d", i);
 		}
-		if (i == ACPI_STATE_S4) {
-			if (sleep_states[i])
-				acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM;
-		}
 	}
 	printk(")\n");
 
 	pm_set_ops(&acpi_pm_ops);
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	if (sleep_states[ACPI_STATE_S4])
+		hibernation_set_ops(&acpi_hibernation_ops);
+#else
+	sleep_states[ACPI_STATE_S4] = 0;
+#endif
+
 	return 0;
 }
 
Index: linux-2.6.21-rc7-mm2/kernel/power/power.h
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/power/power.h	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/kernel/power/power.h	2007-04-29 13:55:55.000000000 +0200
@@ -25,12 +25,7 @@ struct swsusp_info {
  */
 #define SPARE_PAGES	((1024 * 1024) >> PAGE_SHIFT)
 
-extern int pm_suspend_disk(void);
-#else
-static inline int pm_suspend_disk(void)
-{
-	return -EPERM;
-}
+extern struct hibernation_ops *hibernation_ops;
 #endif
 
 extern int pfn_is_nosave(unsigned long);
Index: linux-2.6.21-rc7-mm2/drivers/acpi/sleep/proc.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/drivers/acpi/sleep/proc.c	2007-04-29 13:39:02.000000000 +0200
+++ linux-2.6.21-rc7-mm2/drivers/acpi/sleep/proc.c	2007-04-29 13:49:42.000000000 +0200
@@ -60,7 +60,7 @@ acpi_system_write_sleep(struct file *fil
 	state = simple_strtoul(str, NULL, 0);
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	if (state == 4) {
-		error = pm_suspend(PM_SUSPEND_DISK);
+		error = hibernate();
 		goto Done;
 	}
 #endif
Index: linux-2.6.21-rc7-mm2/kernel/power/user.c
===================================================================
--- linux-2.6.21-rc7-mm2.orig/kernel/power/user.c	2007-04-29 13:43:34.000000000 +0200
+++ linux-2.6.21-rc7-mm2/kernel/power/user.c	2007-04-29 14:00:42.000000000 +0200
@@ -138,16 +138,16 @@ static inline int platform_prepare(void)
 {
 	int error = 0;
 
-	if (pm_ops && pm_ops->prepare)
-		error = pm_ops->prepare(PM_SUSPEND_DISK);
+	if (hibernation_ops)
+		error = hibernation_ops->prepare();
 
 	return error;
 }
 
 static inline void platform_finish(void)
 {
-	if (pm_ops && pm_ops->finish)
-		pm_ops->finish(PM_SUSPEND_DISK);
+	if (hibernation_ops)
+		hibernation_ops->finish();
 }
 
 static inline int snapshot_suspend(int platform_suspend)
@@ -407,7 +407,7 @@ static int snapshot_ioctl(struct inode *
 		switch (arg) {
 
 		case PMOPS_PREPARE:
-			if (pm_ops && pm_ops->enter) {
+			if (hibernation_ops) {
 				data->platform_suspend = 1;
 				error = 0;
 			} else {
@@ -418,8 +418,7 @@ static int snapshot_ioctl(struct inode *
 		case PMOPS_ENTER:
 			if (data->platform_suspend) {
 				kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-				error = pm_ops->enter(PM_SUSPEND_DISK);
-				error = 0;
+				error = hibernation_ops->enter();
 			}
 			break;
 




> 
> johannes
> ---
>  Documentation/power/userland-swsusp.txt |   26 +++----
>  drivers/acpi/sleep/main.c               |   89 ++++++++++++++++++++----
>  drivers/acpi/sleep/proc.c               |    3 
>  drivers/i2c/chips/tps65010.c            |    2 
>  include/linux/hibernate.h               |   36 +++++++++
>  include/linux/pm.h                      |   31 --------
>  kernel/power/disk.c                     |  117 +++++++++++++++++++-------------
>  kernel/power/main.c                     |   47 +++++-------
>  kernel/power/power.h                    |   13 ---
>  kernel/power/user.c                     |   28 +------
>  kernel/sys.c                            |    3 
>  11 files changed, 231 insertions(+), 164 deletions(-)
> 
> --- wireless-dev.orig/include/linux/pm.h	2007-04-26 18:15:00.440691185 +0200
> +++ wireless-dev/include/linux/pm.h	2007-04-26 18:15:09.410691185 +0200
> @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
>  #define PM_SUSPEND_ON		((__force suspend_state_t) 0)
>  #define PM_SUSPEND_STANDBY	((__force suspend_state_t) 1)
>  #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
> -#define PM_SUSPEND_DISK		((__force suspend_state_t) 4)
> -#define PM_SUSPEND_MAX		((__force suspend_state_t) 5)
> -
> -typedef int __bitwise suspend_disk_method_t;
> -
> -/* invalid must be 0 so struct pm_ops initialisers can leave it out */
> -#define PM_DISK_INVALID		((__force suspend_disk_method_t) 0)
> -#define	PM_DISK_PLATFORM	((__force suspend_disk_method_t) 1)
> -#define	PM_DISK_SHUTDOWN	((__force suspend_disk_method_t) 2)
> -#define	PM_DISK_REBOOT		((__force suspend_disk_method_t) 3)
> -#define	PM_DISK_TEST		((__force suspend_disk_method_t) 4)
> -#define	PM_DISK_TESTPROC	((__force suspend_disk_method_t) 5)
> -#define	PM_DISK_MAX		((__force suspend_disk_method_t) 6)
> +#define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
>  
>  /**
>   * struct pm_ops - Callbacks for managing platform dependent suspend states.
>   * @valid: Callback to determine whether the given state can be entered.
> - * 	If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
> - *	always valid and never passed to this call. If not assigned,
> - *	no suspend states are valid.
>   *	Valid states are advertised in /sys/power/state but can still
>   *	be rejected by prepare or enter if the conditions aren't right.
>   *	There is a %pm_valid_only_mem function available that can be assigned
> @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
>   *
>   * @finish: Called when the system has left the given state and all devices
>   *	are resumed. The return value is ignored.
> - *
> - * @pm_disk_mode: The generic code always allows one of the shutdown methods
> - *	%PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
> - *	%PM_DISK_TESTPROC. If this variable is set, the mode it is set
> - *	to is allowed in addition to those modes and is also made default.
> - *	When this mode is sent selected, the @prepare call will be called
> - *	before suspending to disk (if present), the @enter call should be
> - *	present and will be called after all state has been saved and the
> - *	machine is ready to be powered off; the @finish callback is called
> - *	after state has been restored. All these calls are called with
> - *	%PM_SUSPEND_DISK as the state.
>   */
>  struct pm_ops {
>  	int (*valid)(suspend_state_t state);
>  	int (*prepare)(suspend_state_t state);
>  	int (*enter)(suspend_state_t state);
>  	int (*finish)(suspend_state_t state);
> -	suspend_disk_method_t pm_disk_mode;
>  };
>  
>  /**
> @@ -276,8 +249,6 @@ extern void device_power_up(void);
>  extern void device_resume(void);
>  
>  #ifdef CONFIG_PM
> -extern suspend_disk_method_t pm_disk_mode;
> -
>  extern int device_suspend(pm_message_t state);
>  extern int device_prepare_suspend(pm_message_t state);
>  
> --- wireless-dev.orig/kernel/power/main.c	2007-04-26 18:15:00.790691185 +0200
> +++ wireless-dev/kernel/power/main.c	2007-04-26 18:15:09.410691185 +0200
> @@ -21,6 +21,7 @@
>  #include <linux/resume-trace.h>
>  #include <linux/freezer.h>
>  #include <linux/vmstat.h>
> +#include <linux/hibernate.h>
>  
>  #include "power.h"
>  
> @@ -30,7 +31,6 @@
>  DEFINE_MUTEX(pm_mutex);
>  
>  struct pm_ops *pm_ops;
> -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
>  
>  /**
>   *	pm_set_ops - Set the global power method table. 
> @@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops)
>  {
>  	mutex_lock(&pm_mutex);
>  	pm_ops = ops;
> -	if (ops && ops->pm_disk_mode != PM_DISK_INVALID) {
> -		pm_disk_mode = ops->pm_disk_mode;
> -	} else
> -		pm_disk_mode = PM_DISK_SHUTDOWN;
>  	mutex_unlock(&pm_mutex);
>  }
>  
> @@ -184,24 +180,12 @@ static void suspend_finish(suspend_state
>  static const char * const pm_states[PM_SUSPEND_MAX] = {
>  	[PM_SUSPEND_STANDBY]	= "standby",
>  	[PM_SUSPEND_MEM]	= "mem",
> -	[PM_SUSPEND_DISK]	= "disk",
>  };
>  
>  static inline int valid_state(suspend_state_t state)
>  {
> -	/* Suspend-to-disk does not really need low-level support.
> -	 * It can work with shutdown/reboot if needed. If it isn't
> -	 * configured, then it cannot be supported.
> -	 */
> -	if (state == PM_SUSPEND_DISK)
> -#ifdef CONFIG_SOFTWARE_SUSPEND
> -		return 1;
> -#else
> -		return 0;
> -#endif
> -
> -	/* all other states need lowlevel support and need to be
> -	 * valid to the lowlevel implementation, no valid callback
> +	/* All states need lowlevel support and need to be valid
> +	 * to the lowlevel implementation, no valid callback
>  	 * implies that none are valid. */
>  	if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state))
>  		return 0;
> @@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s
>  	if (!mutex_trylock(&pm_mutex))
>  		return -EBUSY;
>  
> -	if (state == PM_SUSPEND_DISK) {
> -		error = pm_suspend_disk();
> -		goto Unlock;
> -	}
> -
>  	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
>  	if ((error = suspend_prepare(state)))
>  		goto Unlock;
> @@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s
>  
>  /**
>   *	pm_suspend - Externally visible function for suspending system.
> - *	@state:		Enumarted value of state to enter.
> + *	@state:		Enumerated value of state to enter.
>   *
>   *	Determine whether or not value is within range, get state 
>   *	structure, and enter (above).
> @@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL);
>  static ssize_t state_show(struct subsystem * subsys, char * buf)
>  {
>  	int i;
> -	char * s = buf;
> +	char *s = buf;
>  
>  	for (i = 0; i < PM_SUSPEND_MAX; i++) {
>  		if (pm_states[i] && valid_state(i))
> -			s += sprintf(s,"%s ", pm_states[i]);
> +			s += sprintf(s, "%s ", pm_states[i]);
>  	}
> -	s += sprintf(s,"\n");
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +	s += sprintf(s, "%s\n", "disk");
> +#else
> +	if (s != buf)
> +		/* convert the last space to a newline */
> +		*(s-1) = "\n";
> +#endif
>  	return (s - buf);
>  }
>  
> @@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys
>  	p = memchr(buf, '\n', n);
>  	len = p ? p - buf : n;
>  
> +	/* first check hibernate */
> +	if (strncmp(buf, "disk", len)) {
> +		error = hibernate();
> +		return error ? error : n;
> +	}
> +
>  	for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) {
>  		if (*s && !strncmp(buf, *s, len))
>  			break;
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ wireless-dev/include/linux/hibernate.h	2007-04-26 18:21:38.130691185 +0200
> @@ -0,0 +1,36 @@
> +#ifndef __LINUX_HIBERNATE
> +#define __LINUX_HIBERNATE
> +/*
> + * hibernate ('suspend to disk') functionality
> + */
> +
> +/**
> + * struct hibernate_ops - hibernate platform support
> + *
> + * The methods in this structure allow a platform to override what
> + * happens for shutting down the machine when going into hibernation.
> + *
> + * All three methods must be assigned.
> + *
> + * @prepare: prepare system for hibernation
> + * @enter: shut down system after state has been saved to disk
> + * @finish: finish/clean up after state has been reloaded
> + */
> +struct hibernate_ops {
> +	int (*prepare)(void);
> +	int (*enter)(void);
> +	void (*finish)(void);
> +};
> +
> +/**
> + * hibernate_set_ops - set the global hibernate operations
> + * @ops: the hibernate operations to use from now on.
> + */
> +void hibernate_set_ops(struct hibernate_ops *ops);
> +
> +/**
> + * hibernate - hibernate the system
> + */
> +int hibernate(void);
> +
> +#endif /* __LINUX_HIBERNATE */
> --- wireless-dev.orig/kernel/power/disk.c	2007-04-26 18:15:00.800691185 +0200
> +++ wireless-dev/kernel/power/disk.c	2007-04-26 18:15:09.420691185 +0200
> @@ -21,45 +21,72 @@
>  #include <linux/console.h>
>  #include <linux/cpu.h>
>  #include <linux/freezer.h>
> +#include <linux/hibernate.h>
>  
>  #include "power.h"
>  
>  
> -static int noresume = 0;
> +static int noresume;
>  char resume_file[256] = CONFIG_PM_STD_PARTITION;
>  dev_t swsusp_resume_device;
>  sector_t swsusp_resume_block;
>  
> +static struct hibernate_ops *hibernate_ops;
> +static int pm_disk_mode;
> +
> +enum {
> +	PM_DISK_INVALID,
> +	PM_DISK_PLATFORM,
> +	PM_DISK_TEST,
> +	PM_DISK_TESTPROC,
> +	PM_DISK_SHUTDOWN,
> +	PM_DISK_REBOOT,
> +	/* keep last */
> +	__PM_DISK_AFTER_LAST
> +};
> +#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1)
> +#define PM_DISK_FIRST (PM_DISK_INVALID + 1)
> +
> +void hibernate_set_ops(struct hibernate_ops *ops)
> +{
> +	BUG_ON(!hibernate_ops->prepare);
> +	BUG_ON(!hibernate_ops->enter);
> +	BUG_ON(!hibernate_ops->finish);
> +	mutex_lock(&pm_mutex);
> +	hibernate_ops = ops;
> +	mutex_unlock(&pm_mutex);
> +}
> +
> +
>  /**
> - *	platform_prepare - prepare the machine for hibernation using the
> - *	platform driver if so configured and return an error code if it fails
> + *	hibernate_platform_prepare - prepare the machine for hibernation using
> + *	the platform driver if so configured and return an error code if it
> + *	fails.
>   */
>  
> -static inline int platform_prepare(void)
> +int hibernate_platform_prepare(void)
>  {
> -	int error = 0;
> -
>  	switch (pm_disk_mode) {
>  	case PM_DISK_TEST:
>  	case PM_DISK_TESTPROC:
>  	case PM_DISK_SHUTDOWN:
>  	case PM_DISK_REBOOT:
>  		break;
> -	default:
> -		if (pm_ops && pm_ops->prepare)
> -			error = pm_ops->prepare(PM_SUSPEND_DISK);
> +	case PM_DISK_PLATFORM:
> +		if (hibernate_ops)
> +			return hibernate_ops->prepare();
>  	}
> -	return error;
> +	return 0;
>  }
>  
>  /**
> - *	power_down - Shut machine down for hibernate.
> + *	hibernate_power_down - Shut machine down for hibernate.
>   *
>   *	Use the platform driver, if configured so; otherwise try
>   *	to power off or reboot.
>   */
>  
> -static void power_down(void)
> +static void hibernate_power_down(void)
>  {
>  	switch (pm_disk_mode) {
>  	case PM_DISK_TEST:
> @@ -70,11 +97,10 @@ static void power_down(void)
>  	case PM_DISK_REBOOT:
>  		kernel_restart(NULL);
>  		break;
> -	default:
> -		if (pm_ops && pm_ops->enter) {
> +	case PM_DISK_PLATFORM:
> +		if (hibernate_ops) {
>  			kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
> -			pm_ops->enter(PM_SUSPEND_DISK);
> -			break;
> +			hibernate_ops->enter();
>  		}
>  	}
>  
> @@ -85,7 +111,7 @@ static void power_down(void)
>  	while(1);
>  }
>  
> -static inline void platform_finish(void)
> +void hibernate_platform_finish(void)
>  {
>  	switch (pm_disk_mode) {
>  	case PM_DISK_TEST:
> @@ -93,9 +119,9 @@ static inline void platform_finish(void)
>  	case PM_DISK_SHUTDOWN:
>  	case PM_DISK_REBOOT:
>  		break;
> -	default:
> -		if (pm_ops && pm_ops->finish)
> -			pm_ops->finish(PM_SUSPEND_DISK);
> +	case PM_DISK_PLATFORM:
> +		if (hibernate_ops)
> +			hibernate_ops->finish();
>  	}
>  }
>  
> @@ -118,13 +144,13 @@ static int prepare_processes(void)
>  }
>  
>  /**
> - *	pm_suspend_disk - The granpappy of hibernation power management.
> + *	hibernate - The granpappy of hibernation power management.
>   *
>   *	If not, then call swsusp to do its thing, then figure out how
>   *	to power down the system.
>   */
>  
> -int pm_suspend_disk(void)
> +int hibernate(void)
>  {
>  	int error;
>  
> @@ -147,7 +173,7 @@ int pm_suspend_disk(void)
>  	if (error)
>  		goto Finish;
>  
> -	error = platform_prepare();
> +	error = hibernate_platform_prepare();
>  	if (error)
>  		goto Finish;
>  
> @@ -175,13 +201,13 @@ int pm_suspend_disk(void)
>  
>  	if (in_suspend) {
>  		enable_nonboot_cpus();
> -		platform_finish();
> +		hibernate_platform_finish();
>  		device_resume();
>  		resume_console();
>  		pr_debug("PM: writing image.\n");
>  		error = swsusp_write();
>  		if (!error)
> -			power_down();
> +			hibernate_power_down();
>  		else {
>  			swsusp_free();
>  			goto Finish;
> @@ -194,7 +220,7 @@ int pm_suspend_disk(void)
>   Enable_cpus:
>  	enable_nonboot_cpus();
>   Resume_devices:
> -	platform_finish();
> +	hibernate_platform_finish();
>  	device_resume();
>  	resume_console();
>   Finish:
> @@ -211,7 +237,7 @@ int pm_suspend_disk(void)
>   *	Called as a late_initcall (so all devices are discovered and
>   *	initialized), we call swsusp to see if we have a saved image or not.
>   *	If so, we quiesce devices, the restore the saved image. We will
> - *	return above (in pm_suspend_disk() ) if everything goes well.
> + *	return above (in hibernate() ) if everything goes well.
>   *	Otherwise, we fail gracefully and return to the normally
>   *	scheduled program.
>   *
> @@ -311,12 +337,13 @@ static const char * const pm_disk_modes[
>   *
>   *	Suspend-to-disk can be handled in several ways. We have a few options
>   *	for putting the system to sleep - using the platform driver (e.g. ACPI
> - *	or other pm_ops), powering off the system or rebooting the system
> - *	(for testing) as well as the two test modes.
> + *	or other hibernate_ops), powering off the system or rebooting the
> + *	system (for testing) as well as the two test modes.
>   *
>   *	The system can support 'platform', and that is known a priori (and
> - *	encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot'
> - *	as alternatives, as well as the test modes 'test' and 'testproc'.
> + *	encoded by the presence of hibernate_ops). However, the user may choose
> + *	'shutdown' or 'reboot' as alternatives, as well as the test modes 'test'
> + *	and 'testproc'.
>   *
>   *	show() will display what the mode is currently set to.
>   *	store() will accept one of
> @@ -328,7 +355,7 @@ static const char * const pm_disk_modes[
>   *	'testproc'
>   *
>   *	It will only change to 'platform' if the system
> - *	supports it (as determined from pm_ops->pm_disk_mode).
> + *	supports it (as determined by having hibernate_ops).
>   */
>  
>  static ssize_t disk_show(struct subsystem * subsys, char * buf)
> @@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste
>  	int i;
>  	char *start = buf;
>  
> -	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
> +	for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) {
>  		if (!pm_disk_modes[i])
>  			continue;
>  		switch (i) {
> @@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste
>  		case PM_DISK_TEST:
>  		case PM_DISK_TESTPROC:
>  			break;
> -		default:
> -			if (pm_ops && pm_ops->enter &&
> -			    (i == pm_ops->pm_disk_mode))
> +		case PM_DISK_PLATFORM:
> +			if (hibernate_ops)
>  				break;
>  			/* not a valid mode, continue with loop */
>  			continue;
> @@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst
>  	int i;
>  	int len;
>  	char *p;
> -	suspend_disk_method_t mode = 0;
> +	int mode = PM_DISK_INVALID;
>  
>  	p = memchr(buf, '\n', n);
>  	len = p ? p - buf : n;
>  
>  	mutex_lock(&pm_mutex);
> -	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
> +	for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) {
>  		if (!strncmp(buf, pm_disk_modes[i], len)) {
>  			mode = i;
>  			break;
>  		}
>  	}
> -	if (mode) {
> +	if (mode != PM_DISK_INVALID) {
>  		switch (mode) {
>  		case PM_DISK_SHUTDOWN:
>  		case PM_DISK_REBOOT:
> @@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst
>  		case PM_DISK_TESTPROC:
>  			pm_disk_mode = mode;
>  			break;
> -		default:
> -			if (pm_ops && pm_ops->enter &&
> -			    (mode == pm_ops->pm_disk_mode))
> +		case PM_DISK_PLATFORM:
> +			if (hibernate_ops)
>  				pm_disk_mode = mode;
>  			else
>  				error = -EINVAL;
>  		}
> -	} else {
> +	} else
>  		error = -EINVAL;
> -	}
>  
> -	pr_debug("PM: suspend-to-disk mode set to '%s'\n",
> -		 pm_disk_modes[mode]);
> +	if (!error)
> +		pr_debug("PM: suspend-to-disk mode set to '%s'\n",
> +			 pm_disk_modes[mode]);
>  	mutex_unlock(&pm_mutex);
>  	return error ? error : n;
>  }
> --- wireless-dev.orig/kernel/power/user.c	2007-04-26 18:15:01.130691185 +0200
> +++ wireless-dev/kernel/power/user.c	2007-04-26 18:15:09.420691185 +0200
> @@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil
>  	return res;
>  }
>  
> -static inline int platform_prepare(void)
> -{
> -	int error = 0;
> -
> -	if (pm_ops && pm_ops->prepare)
> -		error = pm_ops->prepare(PM_SUSPEND_DISK);
> -
> -	return error;
> -}
> -
> -static inline void platform_finish(void)
> -{
> -	if (pm_ops && pm_ops->finish)
> -		pm_ops->finish(PM_SUSPEND_DISK);
> -}
> -
>  static inline int snapshot_suspend(int platform_suspend)
>  {
>  	int error;
> @@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p
>  		goto Finish;
>  
>  	if (platform_suspend) {
> -		error = platform_prepare();
> +		error = hibernate_platform_prepare();
>  		if (error)
>  			goto Finish;
>  	}
> @@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p
>  	enable_nonboot_cpus();
>   Resume_devices:
>  	if (platform_suspend)
> -		platform_finish();
> +		hibernate_platform_finish();
>  
>  	device_resume();
>  	resume_console();
> @@ -188,7 +172,7 @@ static inline int snapshot_restore(int p
>  	mutex_lock(&pm_mutex);
>  	pm_prepare_console();
>  	if (platform_suspend) {
> -		error = platform_prepare();
> +		error = hibernate_platform_prepare();
>  		if (error)
>  			goto Finish;
>  	}
> @@ -204,7 +188,7 @@ static inline int snapshot_restore(int p
>  	enable_nonboot_cpus();
>   Resume_devices:
>  	if (platform_suspend)
> -		platform_finish();
> +		hibernate_platform_finish();
>  
>  	device_resume();
>  	resume_console();
> @@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode *
>  		case PMOPS_ENTER:
>  			if (data->platform_suspend) {
>  				kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
> -				error = pm_ops->enter(PM_SUSPEND_DISK);
> +				error = hibernate_ops->enter();
> +				/* how can this possibly do the right thing? */
>  				error = 0;
>  			}
>  			break;
>  
>  		case PMOPS_FINISH:
>  			if (data->platform_suspend)
> +				/* and why doesn't this invoke anything??? */
>  				error = 0;
>  
>  			break;
> --- wireless-dev.orig/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:02.120691185 +0200
> +++ wireless-dev/Documentation/power/userland-swsusp.txt	2007-04-26 18:15:09.440691185 +0200
> @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t
>  	to resume the system from RAM if there's enough battery power or restore
>  	its state on the basis of the saved suspend image otherwise)
>  
> -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and
> -	pmops->finish methods (the in-kernel swsusp knows these as the "platform
> -	method") which are needed on many machines to (among others) speed up
> -	the resume by letting the BIOS skip some steps or to let the system
> -	recognise the correct state of the hardware after the resume (in
> -	particular on many machines this ensures that unplugged AC
> -	adapters get correctly detected and that kacpid does not run wild after
> -	the resume).  The last ioctl() argument can take one of the three
> -	values, defined in kernel/power/power.h:
> +SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare,
> +	hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel
> +	swsusp knows these as the "platform method") which are needed on many
> +	machines to (among others) speed up the resume by letting the BIOS skip
> +	some steps or to let the system recognise the correct state of the
> +	hardware after the resume (in particular on many machines this ensures
> +	that unplugged AC adapters get correctly detected and that kacpid does
> +	not run wild after the resume).  The last ioctl() argument can take one
> +	of the three values, defined in kernel/power/power.h:
>  	PMOPS_PREPARE - make the kernel carry out the
> -		pm_ops->prepare(PM_SUSPEND_DISK) operation
> +		hibernate_ops->prepare() operation
>  	PMOPS_ENTER - make the kernel power off the system by calling
> -		pm_ops->enter(PM_SUSPEND_DISK)
> +		hibernate_ops->enter()
>  	PMOPS_FINISH - make the kernel carry out the
> -		pm_ops->finish(PM_SUSPEND_DISK) operation
> +		hibernate_ops->finish() operation
> +	Note that the actual constants are misnamed because they surface
> +	internal kernel implementation details that have changed.
>  
>  The device's read() operation can be used to transfer the snapshot image from
>  the kernel.  It has the following limitations:
> --- wireless-dev.orig/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:02.150691185 +0200
> +++ wireless-dev/drivers/i2c/chips/tps65010.c	2007-04-26 18:15:09.440691185 +0200
> @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp
>  			 * also needs to get error handling and probably
>  			 * an #ifdef CONFIG_SOFTWARE_SUSPEND
>  			 */
> -			pm_suspend(PM_SUSPEND_DISK);
> +			hibernate();
>  #endif
>  			poll = 1;
>  		}
> --- wireless-dev.orig/kernel/sys.c	2007-04-26 18:15:01.310691185 +0200
> +++ wireless-dev/kernel/sys.c	2007-04-26 18:15:09.450691185 +0200
> @@ -25,6 +25,7 @@
>  #include <linux/security.h>
>  #include <linux/dcookies.h>
>  #include <linux/suspend.h>
> +#include <linux/hibernate.h>
>  #include <linux/tty.h>
>  #include <linux/signal.h>
>  #include <linux/cn_proc.h>
> @@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i
>  #ifdef CONFIG_SOFTWARE_SUSPEND
>  	case LINUX_REBOOT_CMD_SW_SUSPEND:
>  		{
> -			int ret = pm_suspend(PM_SUSPEND_DISK);
> +			int ret = hibernate();
>  			unlock_kernel();
>  			return ret;
>  		}
> --- wireless-dev.orig/drivers/acpi/sleep/main.c	2007-04-26 18:15:02.290691185 +0200
> +++ wireless-dev/drivers/acpi/sleep/main.c	2007-04-26 18:15:09.630691185 +0200
> @@ -15,6 +15,7 @@
>  #include <linux/dmi.h>
>  #include <linux/device.h>
>  #include <linux/suspend.h>
> +#include <linux/hibernate.h>
>  #include <acpi/acpi_bus.h>
>  #include <acpi/acpi_drivers.h>
>  #include "sleep.h"
> @@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = {
>  	[PM_SUSPEND_ON] = ACPI_STATE_S0,
>  	[PM_SUSPEND_STANDBY] = ACPI_STATE_S1,
>  	[PM_SUSPEND_MEM] = ACPI_STATE_S3,
> -	[PM_SUSPEND_DISK] = ACPI_STATE_S4,
>  	[PM_SUSPEND_MAX] = ACPI_STATE_S5
>  };
>  
> @@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t
>  		do_suspend_lowlevel();
>  		break;
>  
> -	case PM_SUSPEND_DISK:
> -		if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM)
> -			status = acpi_enter_sleep_state(acpi_state);
> -		break;
> -	case PM_SUSPEND_MAX:
> -		acpi_power_off();
> -		break;
> -
>  	default:
>  		return -EINVAL;
>  	}
> @@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state)
>  	suspend_state_t states[] = {
>  		[1] = PM_SUSPEND_STANDBY,
>  		[3] = PM_SUSPEND_MEM,
> -		[4] = PM_SUSPEND_DISK,
>  		[5] = PM_SUSPEND_MAX
>  	};
>  
>  	if (acpi_state < 6 && states[acpi_state])
>  		return pm_suspend(states[acpi_state]);
> +	if (acpi_state == 4)
> +		return hibernate();
>  	return -EINVAL;
>  }
>  
> @@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = {
>  	.finish = acpi_pm_finish,
>  };
>  
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +static int acpi_hib_prepare(void)
> +{
> +	return acpi_sleep_prepare(ACPI_STATE_S4);
> +}
> +
> +static int acpi_hib_enter(void)
> +{
> +	acpi_status status = AE_OK;
> +	unsigned long flags = 0;
> +	u32 acpi_state = acpi_suspend_states[pm_state];
> +
> +	ACPI_FLUSH_CPU_CACHE();
> +
> +	/* Do arch specific saving of state. */
> +	int error = acpi_save_state_mem();
> +	if (error)
> +		return error;
> +
> +	local_irq_save(flags);
> +	acpi_enable_wakeup_device(acpi_state);
> +	status = acpi_enter_sleep_state(acpi_state);
> +
> +	/* ACPI 3.0 specs (P62) says that it's the responsabilty
> +	 * of the OSPM to clear the status bit [ implying that the
> +	 * POWER_BUTTON event should not reach userspace ]
> +	 */
> +	if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3))
> +		acpi_clear_event(ACPI_EVENT_POWER_BUTTON);
> +
> +	local_irq_restore(flags);
> +	printk(KERN_DEBUG "Back to C!\n");
> +
> +	/* restore processor state
> +	 * We should only be here if we're coming back from STR or STD.
> +	 * And, in the case of the latter, the memory image should have already
> +	 * been loaded from disk.
> +	 */
> +	acpi_restore_state_mem();
> +
> +	return ACPI_SUCCESS(status) ? 0 : -EFAULT;
> +}
> +
> +static void acpi_hib_finish(void)
> +{
> +	acpi_leave_sleep_state(ACPI_STATE_S4);
> +	acpi_disable_wakeup_device(ACPI_STATE_S4);
> +
> +	/* reset firmware waking vector */
> +	acpi_set_firmware_waking_vector((acpi_physical_address) 0);
> +
> +	if (init_8259A_after_S1) {
> +		printk("Broken toshiba laptop -> kicking interrupts\n");
> +		init_8259A(0);
> +	}
> +	return 0;
> +}
> +
> +static struct hibernate_ops acpi_hib_ops = {
> +	.prepare = acpi_hib_prepare,
> +	.enter = acpi_hib_enter,
> +	.finish = acpi_hib_finish,
> +};
> +#endif /* CONFIG_SOFTWARE_SUSPEND */
> +
>  /*
>   * Toshiba fails to preserve interrupts over S1, reinitialization
>   * of 8259 is needed after S1 resume.
> @@ -227,13 +285,16 @@ int __init acpi_sleep_init(void)
>  			sleep_states[i] = 1;
>  			printk(" S%d", i);
>  		}
> -		if (i == ACPI_STATE_S4) {
> -			if (sleep_states[i])
> -				acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM;
> -		}
>  	}
>  	printk(")\n");
>  
> +#ifdef CONFIG_SOFTWARE_SUSPEND
> +	if (sleep_states[ACPI_STATE_S4])
> +		hibernate_set_ops(&acpi_hib_ops);
> +#else
> +	sleep_states[ACPI_STATE_S4] = 0;
> +#endif
> +
>  	pm_set_ops(&acpi_pm_ops);
>  	return 0;
>  }
> --- wireless-dev.orig/kernel/power/power.h	2007-04-26 18:15:01.240691185 +0200
> +++ wireless-dev/kernel/power/power.h	2007-04-26 18:15:09.630691185 +0200
> @@ -13,16 +13,6 @@ struct swsusp_info {
>  
>  
>  
> -#ifdef CONFIG_SOFTWARE_SUSPEND
> -extern int pm_suspend_disk(void);
> -
> -#else
> -static inline int pm_suspend_disk(void)
> -{
> -	return -EPERM;
> -}
> -#endif
> -
>  extern struct mutex pm_mutex;
>  
>  #define power_attr(_name) \
> @@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t
>  struct timeval;
>  extern void swsusp_show_speed(struct timeval *, struct timeval *,
>  				unsigned int, char *);
> +
> +extern int hibernate_platform_prepare(void);
> +extern void hibernate_platform_finish(void);
> --- wireless-dev.orig/drivers/acpi/sleep/proc.c	2007-04-26 18:15:02.720691185 +0200
> +++ wireless-dev/drivers/acpi/sleep/proc.c	2007-04-26 18:15:09.630691185 +0200
> @@ -1,6 +1,7 @@
>  #include <linux/proc_fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/suspend.h>
> +#include <linux/hibernate.h>
>  #include <linux/bcd.h>
>  #include <asm/uaccess.h>
>  
> @@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil
>  	state = simple_strtoul(str, NULL, 0);
>  #ifdef CONFIG_SOFTWARE_SUSPEND
>  	if (state == 4) {
> -		error = pm_suspend(PM_SUSPEND_DISK);
> +		error = hibernate();
>  		goto Done;
>  	}
>  #endif
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

-- 
Rafael J. Wysocki, Ph.D.
Institute of Theoretical Physics
Faculty of Physics of Warsaw University
ul. Hoza 69, 00-681 Warsaw
[tel: +48 22 55 32 263]
[mob: +48 60 50 53 693]
----------------------------
One should not increase, beyond what is necessary,
the number of entities required to explain anything.
			-- William of Ockham

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-29 12:48                                                 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki
@ 2007-04-29 12:53                                                   ` Rafael J. Wysocki
  2007-04-30  8:29                                                   ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-29 12:53 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Sunday, 29 April 2007 14:48, R. J. Wysocki wrote:
> [Trimmed the CC list to a reasonable minimum]
> 
> On Thursday, 26 April 2007 18:31, Johannes Berg wrote:
> > On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote:
> > 
> > > > From looking at pm_ops which I was recently working with a lot, it seems
> > > > that it was designed by somebody who was reading the ACPI documentation
> > > > and was otherwise pretty clueless, even at that level std tries to look
> > > > like suspend. IMHO that is one of the first things that should be ripped
> > > > out, no pm_ops for STD, it's a pain to work with.
> > > 
> > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4
> > > low-level enter is pretty similar).
> > > 
> > > Patches would be welcome
> > 
> > That was easier than I thought. This applies on top of a patch that
> > makes kernel/power/user.c optional since I had no idea how to fix it,
> > problems I see:
> >  * it surfaces kernel implementation details about pm_ops and thus makes
> >    the whole thing very fragile
> >  * it has yet another interface (yuck) to determine whether to reboot,
> >    shut down etc, doesn't use /sys/power/disk
> >  * I generally had no idea wtf it is doing in some places
> > 
> > Anyway, this patch is only compile tested, it
> >  * introduces include/linux/hibernate.h with hibernate_ops and
> >    a new hibernate() function to hibernate the system
> >  * rips apart a lot of the suspend code and puts it back together using
> >    the hibernate_ops
> >  * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode)
> >  * might apply/compile against -mm, I have all my and some of Rafael's
> >    suspend/hibernate work in my tree.
> >  * breaks user suspend as I noted above
> >  * is incomplete, somewhere pm_suspend_disk() is still defined iirc
> 
> OK, I reworked it a bit.
> 
> Main changes:
> 
> - IMHO 'hibernation_ops' sounds better than 'hibernate_ops', for example, so
> now the new names start with 'hibernation_' (or 'HIBERNATION_')
> 
> - Moved the hibernation-related definitions to include/linux/suspend.h, since
> some hibernation-specific definitions are already there.  We can introduce
> hibernation.h in a separate patch (it'll have to #include suspend.h IMO).
> 
> - Changed the names starting from 'pm_disk_' (or 'PM_DISK_').
> 
> - Cleaned up the new ACPI code (it didn't compile and included some things
> unrelated to hibernation).  I'm still not sure about acpi_hibernation_finish()
> (is the code after acpi_disable_wakeup_device() really needed?)
> 
> - Made kernel/power/user.c compile (and hopefully work too)

Forgot to say that hibernation_ops is needed, IMO, because ACPI can be modular.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-29 12:48                                                 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki
  2007-04-29 12:53                                                   ` Rafael J. Wysocki
@ 2007-04-30  8:29                                                   ` Johannes Berg
  2007-04-30 14:51                                                     ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-30  8:29 UTC (permalink / raw)
  To: R. J. Wysocki; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 787 bytes --]

On Sun, 2007-04-29 at 14:48 +0200, R. J. Wysocki wrote:

> +	status = acpi_enter_sleep_state(ACPI_STATE_S4);
> +	local_irq_restore(flags);
> +
> +	/*
> +	 * Restore processor state
> +	 * We should only be here if we're coming back from hibernation and
> +	 * the memory image should have already been loaded from disk.

That comment doesn't seem right. This is in ->enter so afaict the image
hasn't been loaded yet at this point. I don't know if you just moved
code but if you did then I don't think it was correct before.

> +	 */
> +	acpi_restore_state_mem();

Maybe that needs to be in ->finish then? Or somewhere in the deeper arch
code?

Other than that it looks good to me on a cursory look. I'll give it a
try on my G5 on Wednesday or Thursday.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-30  8:29                                                   ` Johannes Berg
@ 2007-04-30 14:51                                                     ` Rafael J. Wysocki
  2007-04-30 14:59                                                       ` Johannes Berg
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-04-30 14:51 UTC (permalink / raw)
  To: Johannes Berg, Pavel Machek; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham

On Monday, 30 April 2007 10:29, Johannes Berg wrote:
> On Sun, 2007-04-29 at 14:48 +0200, R. J. Wysocki wrote:
> 
> > +	status = acpi_enter_sleep_state(ACPI_STATE_S4);
> > +	local_irq_restore(flags);
> > +
> > +	/*
> > +	 * Restore processor state
> > +	 * We should only be here if we're coming back from hibernation and
> > +	 * the memory image should have already been loaded from disk.
> 
> That comment doesn't seem right. This is in ->enter so afaict the image
> hasn't been loaded yet at this point. I don't know if you just moved
> code but if you did then I don't think it was correct before.

It was in your patch, so I kept it, but I don't think it's correct too.

Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are
only needed by s2ram, so we can safely remove them from the hibernation code
path.  Pavel, is that correct?

> > +	 */
> > +	acpi_restore_state_mem();
> 
> Maybe that needs to be in ->finish then? Or somewhere in the deeper arch
> code?
> 
> Other than that it looks good to me on a cursory look. I'll give it a
> try on my G5 on Wednesday or Thursday.

I think I'll have an improved version till then. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-30 14:51                                                     ` Rafael J. Wysocki
@ 2007-04-30 14:59                                                       ` Johannes Berg
  2007-05-01 14:05                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-04-30 14:59 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 736 bytes --]

On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote:

> > That comment doesn't seem right. This is in ->enter so afaict the image
> > hasn't been loaded yet at this point. I don't know if you just moved
> > code but if you did then I don't think it was correct before.
> 
> It was in your patch, so I kept it, but I don't think it's correct too.

If it was in my patch then it must be there in the original code, iirc I
just shuffled it a bit :)

> Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are
> only needed by s2ram, so we can safely remove them from the hibernation code
> path.  Pavel, is that correct?

This I don't know. They seemed to be done on hibernate too.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-04-30 14:59                                                       ` Johannes Berg
@ 2007-05-01 14:05                                                         ` Rafael J. Wysocki
  2007-05-01 22:02                                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-01 14:05 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek

On Monday, 30 April 2007 16:59, Johannes Berg wrote:
> On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote:
> 
> > > That comment doesn't seem right. This is in ->enter so afaict the image
> > > hasn't been loaded yet at this point. I don't know if you just moved
> > > code but if you did then I don't think it was correct before.
> > 
> > It was in your patch, so I kept it, but I don't think it's correct too.
> 
> If it was in my patch then it must be there in the original code, iirc I
> just shuffled it a bit :)
> 
> > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are
> > only needed by s2ram, so we can safely remove them from the hibernation code
> > path.  Pavel, is that correct?
> 
> This I don't know. They seemed to be done on hibernate too.

The previous version of the patch was missing the changes in suspend.h.

Apart from this I've cleaned up some changes in disk.c and main.c to make
the sysfs interface work again and dropped some ACPI code that I think was
not necessary.

Patch appended (tested on x86_64, but not extensively), comments welcome. :-)

Greetings,
Rafael

---
This patch:
 * removes the definitions related to hibernation (aka suspend to disk) from
   include/linux/pm.h
 * introduces struct hibernation_ops and a new function to hibernate the system
   called  hibernate(), defined in include/linux/suspend.h
 * separates suspend code in kernel/power/main.c from hibernation-related code
   in kernel/power/disk.c and kernel/power/user.c (with the help of
   hibernation_ops)
 * switches ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops
---

 Documentation/power/userland-swsusp.txt |   26 ++--
 drivers/acpi/sleep/main.c               |   67 +++++++++--
 drivers/acpi/sleep/proc.c               |    2 
 drivers/i2c/chips/tps65010.c            |    2 
 include/linux/pm.h                      |   31 -----
 include/linux/suspend.h                 |   32 +++++
 kernel/power/disk.c                     |  186 +++++++++++++++++---------------
 kernel/power/main.c                     |   42 ++-----
 kernel/power/power.h                    |    7 -
 kernel/power/user.c                     |   13 +-
 kernel/sys.c                            |    2 
 11 files changed, 225 insertions(+), 185 deletions(-)

Index: linux-2.6.21/include/linux/pm.h
===================================================================
--- linux-2.6.21.orig/include/linux/pm.h	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/include/linux/pm.h	2007-05-01 13:35:33.000000000 +0200
@@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_ON		((__force suspend_state_t) 0)
 #define PM_SUSPEND_STANDBY	((__force suspend_state_t) 1)
 #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
-#define PM_SUSPEND_DISK		((__force suspend_state_t) 4)
-#define PM_SUSPEND_MAX		((__force suspend_state_t) 5)
-
-typedef int __bitwise suspend_disk_method_t;
-
-/* invalid must be 0 so struct pm_ops initialisers can leave it out */
-#define PM_DISK_INVALID		((__force suspend_disk_method_t) 0)
-#define	PM_DISK_PLATFORM	((__force suspend_disk_method_t) 1)
-#define	PM_DISK_SHUTDOWN	((__force suspend_disk_method_t) 2)
-#define	PM_DISK_REBOOT		((__force suspend_disk_method_t) 3)
-#define	PM_DISK_TEST		((__force suspend_disk_method_t) 4)
-#define	PM_DISK_TESTPROC	((__force suspend_disk_method_t) 5)
-#define	PM_DISK_MAX		((__force suspend_disk_method_t) 6)
+#define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
 
 /**
  * struct pm_ops - Callbacks for managing platform dependent suspend states.
  * @valid: Callback to determine whether the given state can be entered.
- * 	If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is
- *	always valid and never passed to this call. If not assigned,
- *	no suspend states are valid.
  *	Valid states are advertised in /sys/power/state but can still
  *	be rejected by prepare or enter if the conditions aren't right.
  *	There is a %pm_valid_only_mem function available that can be assigned
@@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho
  *
  * @finish: Called when the system has left the given state and all devices
  *	are resumed. The return value is ignored.
- *
- * @pm_disk_mode: The generic code always allows one of the shutdown methods
- *	%PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and
- *	%PM_DISK_TESTPROC. If this variable is set, the mode it is set
- *	to is allowed in addition to those modes and is also made default.
- *	When this mode is sent selected, the @prepare call will be called
- *	before suspending to disk (if present), the @enter call should be
- *	present and will be called after all state has been saved and the
- *	machine is ready to be powered off; the @finish callback is called
- *	after state has been restored. All these calls are called with
- *	%PM_SUSPEND_DISK as the state.
  */
 struct pm_ops {
 	int (*valid)(suspend_state_t state);
 	int (*prepare)(suspend_state_t state);
 	int (*enter)(suspend_state_t state);
 	int (*finish)(suspend_state_t state);
-	suspend_disk_method_t pm_disk_mode;
 };
 
 /**
@@ -276,8 +249,6 @@ extern void device_power_up(void);
 extern void device_resume(void);
 
 #ifdef CONFIG_PM
-extern suspend_disk_method_t pm_disk_mode;
-
 extern int device_suspend(pm_message_t state);
 extern int device_prepare_suspend(pm_message_t state);
 
Index: linux-2.6.21/kernel/power/main.c
===================================================================
--- linux-2.6.21.orig/kernel/power/main.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/kernel/power/main.c	2007-05-01 15:14:00.000000000 +0200
@@ -30,7 +30,6 @@
 DEFINE_MUTEX(pm_mutex);
 
 struct pm_ops *pm_ops;
-suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN;
 
 /**
  *	pm_set_ops - Set the global power method table. 
@@ -41,10 +40,6 @@ void pm_set_ops(struct pm_ops * ops)
 {
 	mutex_lock(&pm_mutex);
 	pm_ops = ops;
-	if (ops && ops->pm_disk_mode != PM_DISK_INVALID) {
-		pm_disk_mode = ops->pm_disk_mode;
-	} else
-		pm_disk_mode = PM_DISK_SHUTDOWN;
 	mutex_unlock(&pm_mutex);
 }
 
@@ -184,24 +179,12 @@ static void suspend_finish(suspend_state
 static const char * const pm_states[PM_SUSPEND_MAX] = {
 	[PM_SUSPEND_STANDBY]	= "standby",
 	[PM_SUSPEND_MEM]	= "mem",
-	[PM_SUSPEND_DISK]	= "disk",
 };
 
 static inline int valid_state(suspend_state_t state)
 {
-	/* Suspend-to-disk does not really need low-level support.
-	 * It can work with shutdown/reboot if needed. If it isn't
-	 * configured, then it cannot be supported.
-	 */
-	if (state == PM_SUSPEND_DISK)
-#ifdef CONFIG_SOFTWARE_SUSPEND
-		return 1;
-#else
-		return 0;
-#endif
-
-	/* all other states need lowlevel support and need to be
-	 * valid to the lowlevel implementation, no valid callback
+	/* All states need lowlevel support and need to be valid
+	 * to the lowlevel implementation, no valid callback
 	 * implies that none are valid. */
 	if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state))
 		return 0;
@@ -229,11 +212,6 @@ static int enter_state(suspend_state_t s
 	if (!mutex_trylock(&pm_mutex))
 		return -EBUSY;
 
-	if (state == PM_SUSPEND_DISK) {
-		error = pm_suspend_disk();
-		goto Unlock;
-	}
-
 	pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]);
 	if ((error = suspend_prepare(state)))
 		goto Unlock;
@@ -251,7 +229,7 @@ static int enter_state(suspend_state_t s
 
 /**
  *	pm_suspend - Externally visible function for suspending system.
- *	@state:		Enumarted value of state to enter.
+ *	@state:		Enumerated value of state to enter.
  *
  *	Determine whether or not value is within range, get state 
  *	structure, and enter (above).
@@ -289,7 +267,13 @@ static ssize_t state_show(struct subsyst
 		if (pm_states[i] && valid_state(i))
 			s += sprintf(s,"%s ", pm_states[i]);
 	}
-	s += sprintf(s,"\n");
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	s += sprintf(s, "%s\n", "disk");
+#else
+	if (s != buf)
+		/* convert the last space to a newline */
+		*(s-1) = "\n";
+#endif
 	return (s - buf);
 }
 
@@ -304,6 +288,12 @@ static ssize_t state_store(struct subsys
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
+	/* First, check if we are requested to hibernate */
+	if (!strncmp(buf, "disk", len)) {
+		error = hibernate();
+		return error ? error : n;
+	}
+
 	for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) {
 		if (*s && !strncmp(buf, *s, len))
 			break;
Index: linux-2.6.21/kernel/power/disk.c
===================================================================
--- linux-2.6.21.orig/kernel/power/disk.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/kernel/power/disk.c	2007-05-01 15:14:13.000000000 +0200
@@ -30,30 +30,60 @@ char resume_file[256] = CONFIG_PM_STD_PA
 dev_t swsusp_resume_device;
 sector_t swsusp_resume_block;
 
+static int hibernation_mode;
+
+enum {
+	HIBERNATION_INVALID,
+	HIBERNATION_PLATFORM,
+	HIBERNATION_TEST,
+	HIBERNATION_TESTPROC,
+	HIBERNATION_SHUTDOWN,
+	HIBERNATION_REBOOT,
+	/* keep last */
+	__HIBERNATION_AFTER_LAST
+};
+#define HIBERNATION_MAX (__HIBERNATION_AFTER_LAST-1)
+#define HIBERNATION_FIRST (HIBERNATION_INVALID + 1)
+
+struct hibernation_ops *hibernation_ops;
+
+void hibernation_set_ops(struct hibernation_ops *ops)
+{
+	if (ops && !(ops->prepare && ops->enter && ops->finish)) {
+		printk(KERN_ERR "Wrong definition of hibernation operations! "
+			"Using defaults\n");
+		return;
+	}
+	mutex_lock(&pm_mutex);
+	hibernation_ops = ops;
+	mutex_unlock(&pm_mutex);
+}
+
+
 /**
  *	platform_prepare - prepare the machine for hibernation using the
  *	platform driver if so configured and return an error code if it fails
  */
 
-static inline int platform_prepare(void)
+static int platform_prepare(void)
 {
-	int error = 0;
+	return (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) ?
+		hibernation_ops->prepare() : 0;
+}
 
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
-	case PM_DISK_SHUTDOWN:
-	case PM_DISK_REBOOT:
-		break;
-	default:
-		if (pm_ops && pm_ops->prepare)
-			error = pm_ops->prepare(PM_SUSPEND_DISK);
-	}
-	return error;
+/**
+ *	platform_finish - switch the machine to the normal mode of operation
+ *	using the platform driver (must be called after platform_prepare())
+ */
+
+static void platform_finish(void)
+{
+	if (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops)
+		hibernation_ops->finish();
 }
 
 /**
- *	power_down - Shut machine down for hibernate.
+ *	power_down - Shut the machine down for hibernation.
  *
  *	Use the platform driver, if configured so; otherwise try
  *	to power off or reboot.
@@ -61,20 +91,20 @@ static inline int platform_prepare(void)
 
 static void power_down(void)
 {
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
+	switch (hibernation_mode) {
+	case HIBERNATION_TEST:
+	case HIBERNATION_TESTPROC:
 		break;
-	case PM_DISK_SHUTDOWN:
+	case HIBERNATION_SHUTDOWN:
 		kernel_power_off();
 		break;
-	case PM_DISK_REBOOT:
+	case HIBERNATION_REBOOT:
 		kernel_restart(NULL);
 		break;
-	default:
-		if (pm_ops && pm_ops->enter) {
+	case HIBERNATION_PLATFORM:
+		if (hibernation_ops) {
 			kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-			pm_ops->enter(PM_SUSPEND_DISK);
+			hibernation_ops->enter();
 			break;
 		}
 	}
@@ -87,20 +117,6 @@ static void power_down(void)
 	while(1);
 }
 
-static inline void platform_finish(void)
-{
-	switch (pm_disk_mode) {
-	case PM_DISK_TEST:
-	case PM_DISK_TESTPROC:
-	case PM_DISK_SHUTDOWN:
-	case PM_DISK_REBOOT:
-		break;
-	default:
-		if (pm_ops && pm_ops->finish)
-			pm_ops->finish(PM_SUSPEND_DISK);
-	}
-}
-
 static void unprepare_processes(void)
 {
 	thaw_processes();
@@ -120,13 +136,10 @@ static int prepare_processes(void)
 }
 
 /**
- *	pm_suspend_disk - The granpappy of hibernation power management.
- *
- *	If not, then call swsusp to do its thing, then figure out how
- *	to power down the system.
+ *	hibernate - The granpappy of the built-in hibernation management
  */
 
-int pm_suspend_disk(void)
+int hibernate(void)
 {
 	int error;
 
@@ -143,7 +156,8 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Finish;
 
-	if (pm_disk_mode == PM_DISK_TESTPROC) {
+	mutex_lock(&pm_mutex);
+	if (hibernation_mode == HIBERNATION_TESTPROC) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);
 		goto Thaw;
@@ -168,7 +182,7 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Enable_cpus;
 
-	if (pm_disk_mode == PM_DISK_TEST) {
+	if (hibernation_mode == HIBERNATION_TEST) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);
 		goto Enable_cpus;
@@ -205,6 +219,7 @@ int pm_suspend_disk(void)
 	device_resume();
 	resume_console();
  Thaw:
+	mutex_unlock(&pm_mutex);
 	unprepare_processes();
  Finish:
 	free_basic_memory_bitmaps();
@@ -220,7 +235,7 @@ int pm_suspend_disk(void)
  *	Called as a late_initcall (so all devices are discovered and
  *	initialized), we call swsusp to see if we have a saved image or not.
  *	If so, we quiesce devices, the restore the saved image. We will
- *	return above (in pm_suspend_disk() ) if everything goes well.
+ *	return above (in hibernate() ) if everything goes well.
  *	Otherwise, we fail gracefully and return to the normally
  *	scheduled program.
  *
@@ -315,25 +330,26 @@ static int software_resume(void)
 late_initcall(software_resume);
 
 
-static const char * const pm_disk_modes[] = {
-	[PM_DISK_PLATFORM]	= "platform",
-	[PM_DISK_SHUTDOWN]	= "shutdown",
-	[PM_DISK_REBOOT]	= "reboot",
-	[PM_DISK_TEST]		= "test",
-	[PM_DISK_TESTPROC]	= "testproc",
+static const char * const hibernation_modes[] = {
+	[HIBERNATION_PLATFORM]	= "platform",
+	[HIBERNATION_SHUTDOWN]	= "shutdown",
+	[HIBERNATION_REBOOT]	= "reboot",
+	[HIBERNATION_TEST]	= "test",
+	[HIBERNATION_TESTPROC]	= "testproc",
 };
 
 /**
- *	disk - Control suspend-to-disk mode
+ *	disk - Control hibernation mode
  *
  *	Suspend-to-disk can be handled in several ways. We have a few options
  *	for putting the system to sleep - using the platform driver (e.g. ACPI
- *	or other pm_ops), powering off the system or rebooting the system
- *	(for testing) as well as the two test modes.
+ *	or other hibernation_ops), powering off the system or rebooting the
+ *	system (for testing) as well as the two test modes.
  *
  *	The system can support 'platform', and that is known a priori (and
- *	encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot'
- *	as alternatives, as well as the test modes 'test' and 'testproc'.
+ *	encoded by the presence of hibernation_ops). However, the user may
+ *	choose 'shutdown' or 'reboot' as alternatives, as well as one fo the
+ *	test modes, 'test' or 'testproc'.
  *
  *	show() will display what the mode is currently set to.
  *	store() will accept one of
@@ -345,7 +361,7 @@ static const char * const pm_disk_modes[
  *	'testproc'
  *
  *	It will only change to 'platform' if the system
- *	supports it (as determined from pm_ops->pm_disk_mode).
+ *	supports it (as determined by having hibernation_ops).
  */
 
 static ssize_t disk_show(struct subsystem * subsys, char * buf)
@@ -353,28 +369,25 @@ static ssize_t disk_show(struct subsyste
 	int i;
 	char *start = buf;
 
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
-		if (!pm_disk_modes[i])
+	for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
+		if (!hibernation_modes[i])
 			continue;
 		switch (i) {
-		case PM_DISK_SHUTDOWN:
-		case PM_DISK_REBOOT:
-		case PM_DISK_TEST:
-		case PM_DISK_TESTPROC:
+		case HIBERNATION_SHUTDOWN:
+		case HIBERNATION_REBOOT:
+		case HIBERNATION_TEST:
+		case HIBERNATION_TESTPROC:
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (i == pm_ops->pm_disk_mode))
+		case HIBERNATION_PLATFORM:
+			if (hibernation_ops)
 				break;
 			/* not a valid mode, continue with loop */
 			continue;
 		}
-		if (i == pm_disk_mode)
-			buf += sprintf(buf, "[%s]", pm_disk_modes[i]);
+		if (i == hibernation_mode)
+			buf += sprintf(buf, "[%s] ", hibernation_modes[i]);
 		else
-			buf += sprintf(buf, "%s", pm_disk_modes[i]);
-		if (i+1 != PM_DISK_MAX)
-			buf += sprintf(buf, " ");
+			buf += sprintf(buf, "%s ", hibernation_modes[i]);
 	}
 	buf += sprintf(buf, "\n");
 	return buf-start;
@@ -387,39 +400,38 @@ static ssize_t disk_store(struct subsyst
 	int i;
 	int len;
 	char *p;
-	suspend_disk_method_t mode = 0;
+	int mode = HIBERNATION_INVALID;
 
 	p = memchr(buf, '\n', n);
 	len = p ? p - buf : n;
 
 	mutex_lock(&pm_mutex);
-	for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) {
-		if (!strncmp(buf, pm_disk_modes[i], len)) {
+	for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) {
+		if (!strncmp(buf, hibernation_modes[i], len)) {
 			mode = i;
 			break;
 		}
 	}
-	if (mode) {
+	if (mode != HIBERNATION_INVALID) {
 		switch (mode) {
-		case PM_DISK_SHUTDOWN:
-		case PM_DISK_REBOOT:
-		case PM_DISK_TEST:
-		case PM_DISK_TESTPROC:
-			pm_disk_mode = mode;
+		case HIBERNATION_SHUTDOWN:
+		case HIBERNATION_REBOOT:
+		case HIBERNATION_TEST:
+		case HIBERNATION_TESTPROC:
+			hibernation_mode = mode;
 			break;
-		default:
-			if (pm_ops && pm_ops->enter &&
-			    (mode == pm_ops->pm_disk_mode))
-				pm_disk_mode = mode;
+		case HIBERNATION_PLATFORM:
+			if (hibernation_ops)
+				hibernation_mode = mode;
 			else
 				error = -EINVAL;
 		}
-	} else {
+	} else
 		error = -EINVAL;
-	}
 
-	pr_debug("PM: suspend-to-disk mode set to '%s'\n",
-		 pm_disk_modes[mode]);
+	if (!error)
+		pr_debug("PM: suspend-to-disk mode set to '%s'\n",
+			 hibernation_modes[mode]);
 	mutex_unlock(&pm_mutex);
 	return error ? error : n;
 }
Index: linux-2.6.21/Documentation/power/userland-swsusp.txt
===================================================================
--- linux-2.6.21.orig/Documentation/power/userland-swsusp.txt	2007-05-01 13:35:25.000000000 +0200
+++ linux-2.6.21/Documentation/power/userland-swsusp.txt	2007-05-01 13:35:33.000000000 +0200
@@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t
 	to resume the system from RAM if there's enough battery power or restore
 	its state on the basis of the saved suspend image otherwise)
 
-SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and
-	pmops->finish methods (the in-kernel swsusp knows these as the "platform
-	method") which are needed on many machines to (among others) speed up
-	the resume by letting the BIOS skip some steps or to let the system
-	recognise the correct state of the hardware after the resume (in
-	particular on many machines this ensures that unplugged AC
-	adapters get correctly detected and that kacpid does not run wild after
-	the resume).  The last ioctl() argument can take one of the three
-	values, defined in kernel/power/power.h:
+SNAPSHOT_PMOPS - enable the usage of the hibernation_ops->prepare,
+	hibernate_ops->enter and hibernation_ops->finish methods (the in-kernel
+	swsusp knows these as the "platform method") which are needed on many
+	machines to (among others) speed up the resume by letting the BIOS skip
+	some steps or to let the system recognise the correct state of the
+	hardware after the resume (in particular on many machines this ensures
+	that unplugged AC adapters get correctly detected and that kacpid does
+	not run wild after the resume).  The last ioctl() argument can take one
+	of the three values, defined in kernel/power/power.h:
 	PMOPS_PREPARE - make the kernel carry out the
-		pm_ops->prepare(PM_SUSPEND_DISK) operation
+		hibernation_ops->prepare() operation
 	PMOPS_ENTER - make the kernel power off the system by calling
-		pm_ops->enter(PM_SUSPEND_DISK)
+		hibernation_ops->enter()
 	PMOPS_FINISH - make the kernel carry out the
-		pm_ops->finish(PM_SUSPEND_DISK) operation
+		hibernation_ops->finish() operation
+	Note that the actual constants are misnamed because they surface
+	internal kernel implementation details that have changed.
 
 The device's read() operation can be used to transfer the snapshot image from
 the kernel.  It has the following limitations:
Index: linux-2.6.21/drivers/i2c/chips/tps65010.c
===================================================================
--- linux-2.6.21.orig/drivers/i2c/chips/tps65010.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/drivers/i2c/chips/tps65010.c	2007-05-01 13:35:33.000000000 +0200
@@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp
 			 * also needs to get error handling and probably
 			 * an #ifdef CONFIG_SOFTWARE_SUSPEND
 			 */
-			pm_suspend(PM_SUSPEND_DISK);
+			hibernate();
 #endif
 			poll = 1;
 		}
Index: linux-2.6.21/kernel/sys.c
===================================================================
--- linux-2.6.21.orig/kernel/sys.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/kernel/sys.c	2007-05-01 13:35:33.000000000 +0200
@@ -881,7 +881,7 @@ asmlinkage long sys_reboot(int magic1, i
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		{
-			int ret = pm_suspend(PM_SUSPEND_DISK);
+			int ret = hibernate();
 			unlock_kernel();
 			return ret;
 		}
Index: linux-2.6.21/drivers/acpi/sleep/main.c
===================================================================
--- linux-2.6.21.orig/drivers/acpi/sleep/main.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/drivers/acpi/sleep/main.c	2007-05-01 14:20:45.000000000 +0200
@@ -29,7 +29,6 @@ static u32 acpi_suspend_states[] = {
 	[PM_SUSPEND_ON] = ACPI_STATE_S0,
 	[PM_SUSPEND_STANDBY] = ACPI_STATE_S1,
 	[PM_SUSPEND_MEM] = ACPI_STATE_S3,
-	[PM_SUSPEND_DISK] = ACPI_STATE_S4,
 	[PM_SUSPEND_MAX] = ACPI_STATE_S5
 };
 
@@ -94,14 +93,6 @@ static int acpi_pm_enter(suspend_state_t
 		do_suspend_lowlevel();
 		break;
 
-	case PM_SUSPEND_DISK:
-		if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM)
-			status = acpi_enter_sleep_state(acpi_state);
-		break;
-	case PM_SUSPEND_MAX:
-		acpi_power_off();
-		break;
-
 	default:
 		return -EINVAL;
 	}
@@ -157,12 +148,13 @@ int acpi_suspend(u32 acpi_state)
 	suspend_state_t states[] = {
 		[1] = PM_SUSPEND_STANDBY,
 		[3] = PM_SUSPEND_MEM,
-		[4] = PM_SUSPEND_DISK,
 		[5] = PM_SUSPEND_MAX
 	};
 
 	if (acpi_state < 6 && states[acpi_state])
 		return pm_suspend(states[acpi_state]);
+	if (acpi_state == 4)
+		return hibernate();
 	return -EINVAL;
 }
 
@@ -189,6 +181,49 @@ static struct pm_ops acpi_pm_ops = {
 	.finish = acpi_pm_finish,
 };
 
+#ifdef CONFIG_SOFTWARE_SUSPEND
+static int acpi_hibernation_prepare(void)
+{
+	return acpi_sleep_prepare(ACPI_STATE_S4);
+}
+
+static int acpi_hibernation_enter(void)
+{
+	acpi_status status = AE_OK;
+	unsigned long flags = 0;
+
+	ACPI_FLUSH_CPU_CACHE();
+
+	local_irq_save(flags);
+	acpi_enable_wakeup_device(ACPI_STATE_S4);
+	/* This shouldn't return.  If it returns, we have a problem */
+	status = acpi_enter_sleep_state(ACPI_STATE_S4);
+	local_irq_restore(flags);
+
+	return ACPI_SUCCESS(status) ? 0 : -EFAULT;
+}
+
+static void acpi_hibernation_finish(void)
+{
+	acpi_leave_sleep_state(ACPI_STATE_S4);
+	acpi_disable_wakeup_device(ACPI_STATE_S4);
+
+	/* reset firmware waking vector */
+	acpi_set_firmware_waking_vector((acpi_physical_address) 0);
+
+	if (init_8259A_after_S1) {
+		printk("Broken toshiba laptop -> kicking interrupts\n");
+		init_8259A(0);
+	}
+}
+
+static struct hibernation_ops acpi_hibernation_ops = {
+	.prepare = acpi_hibernation_prepare,
+	.enter = acpi_hibernation_enter,
+	.finish = acpi_hibernation_finish,
+};
+#endif /* CONFIG_SOFTWARE_SUSPEND */
+
 /*
  * Toshiba fails to preserve interrupts over S1, reinitialization
  * of 8259 is needed after S1 resume.
@@ -227,14 +262,18 @@ int __init acpi_sleep_init(void)
 			sleep_states[i] = 1;
 			printk(" S%d", i);
 		}
-		if (i == ACPI_STATE_S4) {
-			if (sleep_states[i])
-				acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM;
-		}
 	}
 	printk(")\n");
 
 	pm_set_ops(&acpi_pm_ops);
+
+#ifdef CONFIG_SOFTWARE_SUSPEND
+	if (sleep_states[ACPI_STATE_S4])
+		hibernation_set_ops(&acpi_hibernation_ops);
+#else
+	sleep_states[ACPI_STATE_S4] = 0;
+#endif
+
 	return 0;
 }
 
Index: linux-2.6.21/kernel/power/power.h
===================================================================
--- linux-2.6.21.orig/kernel/power/power.h	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/kernel/power/power.h	2007-05-01 13:35:33.000000000 +0200
@@ -25,12 +25,7 @@ struct swsusp_info {
  */
 #define SPARE_PAGES	((1024 * 1024) >> PAGE_SHIFT)
 
-extern int pm_suspend_disk(void);
-#else
-static inline int pm_suspend_disk(void)
-{
-	return -EPERM;
-}
+extern struct hibernation_ops *hibernation_ops;
 #endif
 
 extern struct mutex pm_mutex;
Index: linux-2.6.21/drivers/acpi/sleep/proc.c
===================================================================
--- linux-2.6.21.orig/drivers/acpi/sleep/proc.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/drivers/acpi/sleep/proc.c	2007-05-01 13:35:33.000000000 +0200
@@ -60,7 +60,7 @@ acpi_system_write_sleep(struct file *fil
 	state = simple_strtoul(str, NULL, 0);
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	if (state == 4) {
-		error = pm_suspend(PM_SUSPEND_DISK);
+		error = hibernate();
 		goto Done;
 	}
 #endif
Index: linux-2.6.21/kernel/power/user.c
===================================================================
--- linux-2.6.21.orig/kernel/power/user.c	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/kernel/power/user.c	2007-05-01 13:35:33.000000000 +0200
@@ -130,16 +130,16 @@ static inline int platform_prepare(void)
 {
 	int error = 0;
 
-	if (pm_ops && pm_ops->prepare)
-		error = pm_ops->prepare(PM_SUSPEND_DISK);
+	if (hibernation_ops)
+		error = hibernation_ops->prepare();
 
 	return error;
 }
 
 static inline void platform_finish(void)
 {
-	if (pm_ops && pm_ops->finish)
-		pm_ops->finish(PM_SUSPEND_DISK);
+	if (hibernation_ops)
+		hibernation_ops->finish();
 }
 
 static inline int snapshot_suspend(int platform_suspend)
@@ -384,7 +384,7 @@ static int snapshot_ioctl(struct inode *
 		switch (arg) {
 
 		case PMOPS_PREPARE:
-			if (pm_ops && pm_ops->enter) {
+			if (hibernation_ops) {
 				data->platform_suspend = 1;
 				error = 0;
 			} else {
@@ -395,8 +395,7 @@ static int snapshot_ioctl(struct inode *
 		case PMOPS_ENTER:
 			if (data->platform_suspend) {
 				kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK);
-				error = pm_ops->enter(PM_SUSPEND_DISK);
-				error = 0;
+				error = hibernation_ops->enter();
 			}
 			break;
 
Index: linux-2.6.21/include/linux/suspend.h
===================================================================
--- linux-2.6.21.orig/include/linux/suspend.h	2007-05-01 13:35:33.000000000 +0200
+++ linux-2.6.21/include/linux/suspend.h	2007-05-01 13:35:33.000000000 +0200
@@ -32,6 +32,24 @@ static inline int pm_prepare_console(voi
 static inline void pm_restore_console(void) {}
 #endif
 
+/**
+ * struct hibernation_ops - hibernation platform support
+ *
+ * The methods in this structure allow a platform to override the default
+ * mechanism of shutting down the machine during a hibernation transition.
+ *
+ * All three methods must be assigned.
+ *
+ * @prepare: prepare system for hibernation
+ * @enter: shut down system after state has been saved to disk
+ * @finish: finish/clean up after state has been reloaded
+ */
+struct hibernation_ops {
+	int (*prepare)(void);
+	int (*enter)(void);
+	void (*finish)(void);
+};
+
 #if defined(CONFIG_PM) && defined(CONFIG_SOFTWARE_SUSPEND)
 /* kernel/power/snapshot.c */
 extern void __init register_nosave_region(unsigned long, unsigned long);
@@ -39,11 +57,25 @@ extern int swsusp_page_is_forbidden(stru
 extern void swsusp_set_page_free(struct page *);
 extern void swsusp_unset_page_free(struct page *);
 extern unsigned long get_safe_page(gfp_t gfp_mask);
+
+/**
+ * hibernation_set_ops - set the global hibernate operations
+ * @ops: the hibernation operations to use in subsequent hibernation transitions
+ */
+void hibernation_set_ops(struct hibernation_ops *ops);
+
+/**
+ * hibernate - hibernate the system
+ */
+extern int hibernate(void);
 #else
 static inline void register_nosave_region(unsigned long b, unsigned long e) {}
 static inline int swsusp_page_is_forbidden(struct page *p) { return 0; }
 static inline void swsusp_set_page_free(struct page *p) {}
 static inline void swsusp_unset_page_free(struct page *p) {}
+
+static inline void hibernation_set_ops(struct hibernation_ops *ops) {}
+extern inline int hibernate(void) { return -ENOSYS; }
 #endif /* defined(CONFIG_PM) && defined(CONFIG_SOFTWARE_SUSPEND) */
 
 void save_processor_state(void);

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-01 14:05                                                         ` Rafael J. Wysocki
@ 2007-05-01 22:02                                                           ` Rafael J. Wysocki
  2007-05-02  5:13                                                             ` Alexey Starikovskiy
  2007-05-02  8:21                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
  0 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-01 22:02 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Tuesday, 1 May 2007 16:05, Rafael J. Wysocki wrote:
> On Monday, 30 April 2007 16:59, Johannes Berg wrote:
> > On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote:
> > 
> > > > That comment doesn't seem right. This is in ->enter so afaict the image
> > > > hasn't been loaded yet at this point. I don't know if you just moved
> > > > code but if you did then I don't think it was correct before.
> > > 
> > > It was in your patch, so I kept it, but I don't think it's correct too.
> > 
> > If it was in my patch then it must be there in the original code, iirc I
> > just shuffled it a bit :)
> > 
> > > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are
> > > only needed by s2ram, so we can safely remove them from the hibernation code
> > > path.  Pavel, is that correct?
> > 
> > This I don't know. They seemed to be done on hibernate too.
> 
> The previous version of the patch was missing the changes in suspend.h.
> 
> Apart from this I've cleaned up some changes in disk.c and main.c to make
> the sysfs interface work again and dropped some ACPI code that I think was
> not necessary.
> 
> Patch appended (tested on x86_64, but not extensively), comments welcome. :-)

Well, having a look on the ACPI spec I'm thinking that what we're trying to do
with this patch is actually wrong.

Instead, we should rip off all of the invocations of pm_ops->whatever() from
the hibernation code paths (with the below exceptions) and *if* the platform
method is to be used, call pm_ops to make the system go down, in the following
way:
1) call pm_ops->prepare(PM_SUSPEND_DISK)
2) suspend devices (ie. call device_suspend() etc.)
3) call pm_ops->enter(PM_SUSPEND_DISK)
and if that *fails* (ie. pm_ops->enter() returns):
4) call pm_ops->finish(PM_SUSPEND_DISK)
5) halt the system

Formally, after restoring the image, *if* the platform method was used (ie. the
above was executed as the last hibernation step), we should call
pm_ops->finish(PM_SUSPEND_DISK) before resuming devices, but
since we get the control from the "old kernel" rather than from the BIOS,
this doesn't seem to be the right thing to do.

I'll try to create a patch along these lines and see if it breaks anything on
my boxes.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-01 22:02                                                           ` Rafael J. Wysocki
@ 2007-05-02  5:13                                                             ` Alexey Starikovskiy
  2007-05-02 13:42                                                               ` Rafael J. Wysocki
  2007-05-02  8:21                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
  1 sibling, 1 reply; 712+ messages in thread
From: Alexey Starikovskiy @ 2007-05-02  5:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

Rafael,

On resume ACPI expects
boot kernel do pm_prepare().
resumed kernel do pm_finish() before device_resume().

Thanks,
Alex.

On 5/2/07, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Tuesday, 1 May 2007 16:05, Rafael J. Wysocki wrote:
> > On Monday, 30 April 2007 16:59, Johannes Berg wrote:
> > > On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote:
> > >
> > > > > That comment doesn't seem right. This is in ->enter so afaict the image
> > > > > hasn't been loaded yet at this point. I don't know if you just moved
> > > > > code but if you did then I don't think it was correct before.
> > > >
> > > > It was in your patch, so I kept it, but I don't think it's correct too.
> > >
> > > If it was in my patch then it must be there in the original code, iirc I
> > > just shuffled it a bit :)
> > >
> > > > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are
> > > > only needed by s2ram, so we can safely remove them from the hibernation code
> > > > path.  Pavel, is that correct?
> > >
> > > This I don't know. They seemed to be done on hibernate too.
> >
> > The previous version of the patch was missing the changes in suspend.h.
> >
> > Apart from this I've cleaned up some changes in disk.c and main.c to make
> > the sysfs interface work again and dropped some ACPI code that I think was
> > not necessary.
> >
> > Patch appended (tested on x86_64, but not extensively), comments welcome. :-)
>
> Well, having a look on the ACPI spec I'm thinking that what we're trying to do
> with this patch is actually wrong.
>
> Instead, we should rip off all of the invocations of pm_ops->whatever() from
> the hibernation code paths (with the below exceptions) and *if* the platform
> method is to be used, call pm_ops to make the system go down, in the following
> way:
> 1) call pm_ops->prepare(PM_SUSPEND_DISK)
> 2) suspend devices (ie. call device_suspend() etc.)
> 3) call pm_ops->enter(PM_SUSPEND_DISK)
> and if that *fails* (ie. pm_ops->enter() returns):
> 4) call pm_ops->finish(PM_SUSPEND_DISK)
> 5) halt the system
>
> Formally, after restoring the image, *if* the platform method was used (ie. the
> above was executed as the last hibernation step), we should call
> pm_ops->finish(PM_SUSPEND_DISK) before resuming devices, but
> since we get the control from the "old kernel" rather than from the BIOS,
> this doesn't seem to be the right thing to do.
>
> I'll try to create a patch along these lines and see if it breaks anything on
> my boxes.
>
> Greetings,
> Rafael
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-01 22:02                                                           ` Rafael J. Wysocki
  2007-05-02  5:13                                                             ` Alexey Starikovskiy
@ 2007-05-02  8:21                                                             ` Johannes Berg
  2007-05-02  9:02                                                               ` Rafael J. Wysocki
  2007-05-02  9:16                                                               ` Pavel Machek
  1 sibling, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-02  8:21 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 884 bytes --]

On Wed, 2007-05-02 at 00:02 +0200, Rafael J. Wysocki wrote:

> Well, having a look on the ACPI spec I'm thinking that what we're trying to do
> with this patch is actually wrong.

No idea :)

> Instead, we should rip off all of the invocations of pm_ops->whatever() from
> the hibernation code paths (with the below exceptions) and *if* the platform
> method is to be used, call pm_ops to make the system go down, in the following
> way:
> 1) call pm_ops->prepare(PM_SUSPEND_DISK)
> 2) suspend devices (ie. call device_suspend() etc.)
> 3) call pm_ops->enter(PM_SUSPEND_DISK)
> and if that *fails* (ie. pm_ops->enter() returns):
> 4) call pm_ops->finish(PM_SUSPEND_DISK)
> 5) halt the system

Can we still split that off to another method so we don't use pm_ops? No
matter how we invoke hibernation_ops or in what order, imho we shouldn't
use pm_ops.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  8:21                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
@ 2007-05-02  9:02                                                               ` Rafael J. Wysocki
  2007-05-02  9:16                                                               ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-02  9:02 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Wednesday, 2 May 2007 10:21, Johannes Berg wrote:
> On Wed, 2007-05-02 at 00:02 +0200, Rafael J. Wysocki wrote:
> 
> > Well, having a look on the ACPI spec I'm thinking that what we're trying to do
> > with this patch is actually wrong.
> 
> No idea :)
> 
> > Instead, we should rip off all of the invocations of pm_ops->whatever() from
> > the hibernation code paths (with the below exceptions) and *if* the platform
> > method is to be used, call pm_ops to make the system go down, in the following
> > way:
> > 1) call pm_ops->prepare(PM_SUSPEND_DISK)
> > 2) suspend devices (ie. call device_suspend() etc.)
> > 3) call pm_ops->enter(PM_SUSPEND_DISK)
> > and if that *fails* (ie. pm_ops->enter() returns):
> > 4) call pm_ops->finish(PM_SUSPEND_DISK)
> > 5) halt the system
> 
> Can we still split that off to another method so we don't use pm_ops? No
> matter how we invoke hibernation_ops or in what order, imho we shouldn't
> use pm_ops.

OK, I think we can go ahead with the patch if nobody objects.  It's been tested
to some extent and seems to work.  More testing will be appreciated.

Later on we can do what I said above using hibernation_ops instead of pm_ops,
if turns out to really make sense.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  8:21                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
  2007-05-02  9:02                                                               ` Rafael J. Wysocki
@ 2007-05-02  9:16                                                               ` Pavel Machek
  2007-05-02  9:25                                                                 ` Johannes Berg
  2007-05-02 13:43                                                                 ` Rafael J. Wysocki
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-02  9:16 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham

Hi!

> > Well, having a look on the ACPI spec I'm thinking that what we're trying to do
> > with this patch is actually wrong.
> 
> No idea :)
> 
> > Instead, we should rip off all of the invocations of pm_ops->whatever() from
> > the hibernation code paths (with the below exceptions) and *if* the platform
> > method is to be used, call pm_ops to make the system go down, in the following
> > way:
> > 1) call pm_ops->prepare(PM_SUSPEND_DISK)
> > 2) suspend devices (ie. call device_suspend() etc.)
> > 3) call pm_ops->enter(PM_SUSPEND_DISK)
> > and if that *fails* (ie. pm_ops->enter() returns):
> > 4) call pm_ops->finish(PM_SUSPEND_DISK)
> > 5) halt the system
> 
> Can we still split that off to another method so we don't use pm_ops? No
> matter how we invoke hibernation_ops or in what order, imho we shouldn't
> use pm_ops.

Well... the powerdown during hibernation... does not have _anything_
to do with snapshot/restore. It is really a very deep sleep; similar
to soft powerdown, but not quite.

So this usage of pm_ops seems ok.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  9:16                                                               ` Pavel Machek
@ 2007-05-02  9:25                                                                 ` Johannes Berg
  2007-05-03 14:00                                                                   ` Alan Stern
  2007-05-02 13:43                                                                 ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-02  9:25 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham


[-- Attachment #1.1: Type: text/plain, Size: 657 bytes --]

On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote:

> Well... the powerdown during hibernation... does not have _anything_
> to do with snapshot/restore. It is really a very deep sleep; similar
> to soft powerdown, but not quite.

It's also horribly confusing to intermingle hibernation and suspend into
one operation struct when there's only a single user for it anyway. Just
look at what all the arm platforms had there, trying to veto suspend to
disk through pm_ops etc. I don't technically disagree with you, but from
a point of how to understand this whole thing I'd rather have hibernate
and suspend be totally orthogonal.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  5:13                                                             ` Alexey Starikovskiy
@ 2007-05-02 13:42                                                               ` Rafael J. Wysocki
  2007-05-02 14:11                                                                 ` Alexey Starikovskiy
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-02 13:42 UTC (permalink / raw)
  To: Alexey Starikovskiy
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

Hi,

On Wednesday, 2 May 2007 07:13, Alexey Starikovskiy wrote:
> Rafael,
> 
> On resume ACPI expects
> boot kernel do pm_prepare().
> resumed kernel do pm_finish() before device_resume().

Well, lets analyse what pm_prepare() actually does.  If my understading of
the code in there and the ACPI spec [1] is correct, it does the following:

(1) Sets the firmware waking vector (doesn't matter for hibernation)
(2) Prepares the wake-up devices for a state transition, by calling their _PSW
methods ("to enable wake" according to the spec)
(3) Disables the GPEs that cannot wake up the system
(4) Runs the _PTS and _GTS methods
(5) Runs the _SST method
(6) Disables all GPEs

Now, there's a couple of problems with that regardless of what it's used for,
as far as I can see:

a) The spec (in Section 7.2) says that (2) should be done *after* the _PTS
method is called

b) The spec (Section 7.3.2) says:

"This [_PTS] method is called after OSPM has notified native device drivers of
the sleep state transition and before the OSPM has had a chance to fully
prepare the system for a sleep state transition."

We don't seem to be doing this.  Moreover, Section 15.1.6 of the spec say that
"OSPM places all device drivers into their respective Dx state" *before* _PTS
is executed, so it doesn't look like _PTS should be executed before
device_suspend().

c) According to the spec (Section 15.1.6) "OSPM saves any other processor’s
context (other than the local processor) to memory" *after* executing _PTS,
but *before* _GTS is executed, but we do this after _GTS is executed.
Moreover, the waking vector should be written into FACS after the "other
processor’s context" has been saved, but *before* _GTS is executed.

d) The spec (Section 7.3.3) says literally this:

" _GTS allows ACPI system firmware to perform any required system specific
functions prior to entering a system sleep state. OSPM will set the sleep
enable (SLP_EN) bit in the PM1 control register immediately following the
execution of the _GTS control method without performing any other physical I/O
or allowing any interrupt servicing."

However, in our code _GTS is executed *waaay* before setting the SLP_EN bit
in PM1, which only happens in acpi_enter_sleep_state() called from
acpi_pm_enter(), *after* we've executed device_suspend() with IRQs enabled
and, in the hibernation case, called device_resume() and saved the image
(oh, dear).

e) It implicitly follows from d) that _SST should be executed before _GTS
and after we run device_suspend().

f) I'm not sure if the disabling of all GPEs before device_suspend() is
actually a good idea.

Next, we can consider acpi_pm_finish().  Again, if my understading of the code
in there and the ACPI spec [1] is correct, it does the following:

(7) Sets SLP_EN and SLP_TYPE to state S0
(8) Executes the _SST method (Waking)
(9) Executes the _BFS (Back From Sleep) method
(10) Executes the _WAK method
(11) Enables the runtime GPEs
(12) Enables the power button
(13) Executes the _SST method (Working)
(14) Disables the wake-up capability of devices
(15) Resets the firmware waking vector (doesn't matter for hibernation)

Now, there seems to be nothing wrong with that *if* it's executed while
resuming from RAM, for example, but it doesn't seem to be suitable for using
in such a way as we do this in the resume-during-hibernation code path.

Consider a hibernation (aka suspend to disk) transition (ie. an operation in
which we snapshot the system memory, save the image and shut the system down).

Currently, we call acpi_pm_prepare(PM_SUSPEND_DISK) and run device_suspend(),
which seems to be in many ways agaist the ACPI spec.  The spec, as I understand
it, indicates that we should run device_suspend() first and then execute the
_PTS method.  We shouldn't, however, execute either _GTS, or _SST just yet.

Next, we suspend sysdevs etc., and create the memory snapshot.  We want
to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK
to be executed *after* _GTS, which is clearly against the spec (might this be the
reason why (7) is sometimes necessary?).  Moreover, calling _BFS at this stage
makes no sense, IMHO, since there hasn't been any transition (the system has
not slept).  What I think we should do at this point is to execute _WAK only,
which means "power transition aborted" to the firmware, and continue with
device_resume().

Next, we save the image and now we'd like to put the system to "sleep", so
we use acpi_pm_enter(PM_SUSPEND_DISK), but we shouldn't do that, since the
power transition has been aborted by _WAK in acpi_pm_finish()!  Thus we should
start the transition again, run device_suspend(), execute _PTS, do (2) and (3),
save the "other processor's context" etc., execute _SST(S4), execute _GTS and
set SLP_EN in PM1 etc.

When we restore the system state from a hibernation image, the "boot kernel" is
first started.  It loads the image into memory, calls
device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with
the "resumed kernel".  It doesn't call acpi_pm_prepare(), which I think is
right, because it doesn't want to start any power transition, not even a
fake one.  Now, the "resumed kernel" takes control, resumes sysdevs and calls
acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS
should be executed in that case (the ACPI spec seems to assume that the
hibernation image will be loaded into memory by a boot loader).

Concluding, it seems to me that the "restore" code path is correct, but the
"hibernate" code path is not and should be reworked.  Also, it seems that
acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case
either (_PTS should be executed after device_suspend() and _GTS should
be executed in acpi_pm_enter(), right before the transition is completed).

Greetings,
Rafael


[1] Advanced Configuration and Power Interface Specification, Revision 3.0,
September 2, 2004

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  9:16                                                               ` Pavel Machek
  2007-05-02  9:25                                                                 ` Johannes Berg
@ 2007-05-02 13:43                                                                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-02 13:43 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham

On Wednesday, 2 May 2007 11:16, Pavel Machek wrote:
> Hi!
> 
> > > Well, having a look on the ACPI spec I'm thinking that what we're trying to do
> > > with this patch is actually wrong.
> > 
> > No idea :)
> > 
> > > Instead, we should rip off all of the invocations of pm_ops->whatever() from
> > > the hibernation code paths (with the below exceptions) and *if* the platform
> > > method is to be used, call pm_ops to make the system go down, in the following
> > > way:
> > > 1) call pm_ops->prepare(PM_SUSPEND_DISK)
> > > 2) suspend devices (ie. call device_suspend() etc.)
> > > 3) call pm_ops->enter(PM_SUSPEND_DISK)
> > > and if that *fails* (ie. pm_ops->enter() returns):
> > > 4) call pm_ops->finish(PM_SUSPEND_DISK)
> > > 5) halt the system
> > 
> > Can we still split that off to another method so we don't use pm_ops? No
> > matter how we invoke hibernation_ops or in what order, imho we shouldn't
> > use pm_ops.
> 
> Well... the powerdown during hibernation... does not have _anything_
> to do with snapshot/restore.

Agreed.

> It is really a very deep sleep; similar to soft powerdown, but not quite.

Yeah, not quite.  For example, we may want to use some devices for waking up
the system, but with the current code it's impossible, because pm_ops->finish()
disables this capability of devices.

I think we shouldn't confuse the quiescing of devices before the image creation
with a power transition.  This is not a power transition, since it's not
completed by calling pm_ops->enter().  Instead, we kinda-sorta abort it with
pm_ops->finish() which confuses the heck out of the ACPI firmware (please
see my reply to Alexey in the same thread for a detailed analysis).

> So this usage of pm_ops seems ok.

To me, it doesn't.  These are the main problems I see with it:
1) device_suspend() should be called before the _PTS method is executed (IMO
it's correct not to execute _PTS at all if we don't want to do a real power
transition)
2) The _GTS method shouldn't be executed in acpi_pm_prepare(), but instead
should be executed in acpi_pm_enter(), right before the transition is completed
3) The _BFS method shouldn't be executed in the resume-during-hibernation
code path
4) The wake-up capability of devices should be enabled before we execute
pm_ops->enter() and shouldn't be enabled before the image creation (what for?).
5) The first part of 4) requires that the transition be started over after the
image has been saved.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02 13:42                                                               ` Rafael J. Wysocki
@ 2007-05-02 14:11                                                                 ` Alexey Starikovskiy
  2007-05-02 19:26                                                                   ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki
  2007-05-02 19:26                                                                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 712+ messages in thread
From: Alexey Starikovskiy @ 2007-05-02 14:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

Rafael,

> Concluding, it seems to me that the "restore" code path is correct, but the
> "hibernate" code path is not and should be reworked.  Also, it seems that
> acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case
> either (_PTS should be executed after device_suspend() and _GTS should
> be executed in acpi_pm_enter(), right before the transition is completed).

Current implementation is not fully up-to spec, so we may try to get
it closer to, I agree.

> When we restore the system state from a hibernation image, the "boot kernel" is
> first started.  It loads the image into memory, calls
> device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with
> the "resumed kernel".  It doesn't call acpi_pm_prepare(), which I think is
> right, because it doesn't want to start any power transition, not even a
> fake one.  Now, the "resumed kernel" takes control, resumes sysdevs and calls
Currently call to prepare() is needed to stop ACPI devices to send
GPEs to ACPI drivers.
If you remove it, Acer laptops will resume without ACPI interrupt at
all (with all problems from it).
> acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS
> should be executed in that case (the ACPI spec seems to assume that the
> hibernation image will be loaded into memory by a boot loader).

> Next, we suspend sysdevs etc., and create the memory snapshot.  We want
> to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK
> to be executed *after* _GTS, which is clearly against the spec (might this be the
> reason why (7) is sometimes necessary?).  Moreover, calling _BFS at this stage
> makes no sense, IMHO, since there hasn't been any transition (the system has
> not slept).  What I think we should do at this point is to execute _WAK only,
> which means "power transition aborted" to the firmware, and continue with
> device_resume().

But I don't get your idea about executing _finish() between _prepare()
and _enter()...
_finish is executed only if _prepare() fails, so we are rolling back,
or it is executed after we loaded the image and transfered execution
to it, so again -- we are going from _prepare() state to running
state...

Regards,
Alex.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-02 14:11                                                                 ` Alexey Starikovskiy
  2007-05-02 19:26                                                                   ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki
@ 2007-05-02 19:26                                                                   ` Rafael J. Wysocki
  2007-05-03 22:48                                                                     ` Pavel Machek
  2007-05-03 22:48                                                                     ` Pavel Machek
  1 sibling, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-02 19:26 UTC (permalink / raw)
  To: Alexey Starikovskiy
  Cc: Johannes Berg, linux-pm, Pekka Enberg, Nigel Cunningham,
	Pavel Machek, ACPI Devel Maling List

Hi,

[Added linux-acpi to the CC list, should be there from the start]

On Wednesday, 2 May 2007 16:11, Alexey Starikovskiy wrote:
> Rafael,
> 
> > Concluding, it seems to me that the "restore" code path is correct, but the
> > "hibernate" code path is not and should be reworked.  Also, it seems that
> > acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case
> > either (_PTS should be executed after device_suspend() and _GTS should
> > be executed in acpi_pm_enter(), right before the transition is completed).
> 
> Current implementation is not fully up-to spec, so we may try to get
> it closer to, I agree.

Okay.  Since we're trying to separate the hibernation code from the
suspend code anyway, we can use the opportunity to introduce some new
callbacks for the hibernation and/or redefine the existing ones.

The spec suggests that we need the following callbacks:

(1) prepare() - called after device_suspend(), execute _PTS and disable GPEs
(2) cancel() - called at any time after prepare() if there's an error, execute
    _WAK and enable run-time GPEs
(3) enter() - called after the image has been saved, execute _GTS and do what's
    currently done in pm_enter()
(4) finish() - called after the image has been restored, do what's currently
    done in pm_finish()

[At least, the execution of _GTS in pm_prepare() seems to be dangerous at first
sight.]

We also might need a callback that will be run before device_suspend() to
invoke some ACPI-related magic needed at that point, but I have no idea what
it would have to do. :-)

> > When we restore the system state from a hibernation image, the "boot kernel" is
> > first started.  It loads the image into memory, calls
> > device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with
> > the "resumed kernel".  It doesn't call acpi_pm_prepare(), which I think is
> > right, because it doesn't want to start any power transition, not even a
> > fake one.  Now, the "resumed kernel" takes control, resumes sysdevs and calls

> Currently call to prepare() is needed to stop ACPI devices to send
> GPEs to ACPI drivers.

Does it mean that we need to call pm_prepare() (or an equivalent function)
before device_suspend()?  If that's the case, then which part of pm_prepare()
is essential here?

> If you remove it, Acer laptops will resume without ACPI interrupt at
> all (with all problems from it).

A recent discussion on the LKML lead to the conclusion that for the
hibernation we shouldn't use .suspend() callbacks before snapshotting the
system memory.  Instead, we should use some other callbacks to quiesce devices,
create the snapshot, reactivate devices, save the image and carry out the
actual power transition after that.  Would something like this be viable from
the ACPI point of view?

> > acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS
> > should be executed in that case (the ACPI spec seems to assume that the
> > hibernation image will be loaded into memory by a boot loader).
> 
> > Next, we suspend sysdevs etc., and create the memory snapshot.  We want
> > to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK
> > to be executed *after* _GTS, which is clearly against the spec (might this be the
> > reason why (7) is sometimes necessary?).  Moreover, calling _BFS at this stage
> > makes no sense, IMHO, since there hasn't been any transition (the system has
> > not slept).  What I think we should do at this point is to execute _WAK only,
> > which means "power transition aborted" to the firmware, and continue with
> > device_resume().
> 
> But I don't get your idea about executing _finish() between _prepare()
> and _enter()...
> _finish is executed only if _prepare() fails, so we are rolling back,

Well, no.  Please have a look at the code in kernel/power/disk.c.

Should we remove it from the nonerror code paths?

> or it is executed after we loaded the image and transfered execution
> to it, so again -- we are going from _prepare() state to running
> state...

Currently that's not the case.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-02 14:11                                                                 ` Alexey Starikovskiy
@ 2007-05-02 19:26                                                                   ` Rafael J. Wysocki
  2007-05-02 19:26                                                                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-02 19:26 UTC (permalink / raw)
  To: Alexey Starikovskiy
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg,
	Pavel Machek, Johannes Berg, linux-pm

Hi,

[Added linux-acpi to the CC list, should be there from the start]

On Wednesday, 2 May 2007 16:11, Alexey Starikovskiy wrote:
> Rafael,
> 
> > Concluding, it seems to me that the "restore" code path is correct, but the
> > "hibernate" code path is not and should be reworked.  Also, it seems that
> > acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case
> > either (_PTS should be executed after device_suspend() and _GTS should
> > be executed in acpi_pm_enter(), right before the transition is completed).
> 
> Current implementation is not fully up-to spec, so we may try to get
> it closer to, I agree.

Okay.  Since we're trying to separate the hibernation code from the
suspend code anyway, we can use the opportunity to introduce some new
callbacks for the hibernation and/or redefine the existing ones.

The spec suggests that we need the following callbacks:

(1) prepare() - called after device_suspend(), execute _PTS and disable GPEs
(2) cancel() - called at any time after prepare() if there's an error, execute
    _WAK and enable run-time GPEs
(3) enter() - called after the image has been saved, execute _GTS and do what's
    currently done in pm_enter()
(4) finish() - called after the image has been restored, do what's currently
    done in pm_finish()

[At least, the execution of _GTS in pm_prepare() seems to be dangerous at first
sight.]

We also might need a callback that will be run before device_suspend() to
invoke some ACPI-related magic needed at that point, but I have no idea what
it would have to do. :-)

> > When we restore the system state from a hibernation image, the "boot kernel" is
> > first started.  It loads the image into memory, calls
> > device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with
> > the "resumed kernel".  It doesn't call acpi_pm_prepare(), which I think is
> > right, because it doesn't want to start any power transition, not even a
> > fake one.  Now, the "resumed kernel" takes control, resumes sysdevs and calls

> Currently call to prepare() is needed to stop ACPI devices to send
> GPEs to ACPI drivers.

Does it mean that we need to call pm_prepare() (or an equivalent function)
before device_suspend()?  If that's the case, then which part of pm_prepare()
is essential here?

> If you remove it, Acer laptops will resume without ACPI interrupt at
> all (with all problems from it).

A recent discussion on the LKML lead to the conclusion that for the
hibernation we shouldn't use .suspend() callbacks before snapshotting the
system memory.  Instead, we should use some other callbacks to quiesce devices,
create the snapshot, reactivate devices, save the image and carry out the
actual power transition after that.  Would something like this be viable from
the ACPI point of view?

> > acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS
> > should be executed in that case (the ACPI spec seems to assume that the
> > hibernation image will be loaded into memory by a boot loader).
> 
> > Next, we suspend sysdevs etc., and create the memory snapshot.  We want
> > to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK
> > to be executed *after* _GTS, which is clearly against the spec (might this be the
> > reason why (7) is sometimes necessary?).  Moreover, calling _BFS at this stage
> > makes no sense, IMHO, since there hasn't been any transition (the system has
> > not slept).  What I think we should do at this point is to execute _WAK only,
> > which means "power transition aborted" to the firmware, and continue with
> > device_resume().
> 
> But I don't get your idea about executing _finish() between _prepare()
> and _enter()...
> _finish is executed only if _prepare() fails, so we are rolling back,

Well, no.  Please have a look at the code in kernel/power/disk.c.

Should we remove it from the nonerror code paths?

> or it is executed after we loaded the image and transfered execution
> to it, so again -- we are going from _prepare() state to running
> state...

Currently that's not the case.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-02  9:25                                                                 ` Johannes Berg
@ 2007-05-03 14:00                                                                   ` Alan Stern
  2007-05-03 17:17                                                                     ` Rafael J. Wysocki
                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-03 14:00 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Wed, 2 May 2007, Johannes Berg wrote:

> On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote:
> 
> > Well... the powerdown during hibernation... does not have _anything_
> > to do with snapshot/restore. It is really a very deep sleep; similar
> > to soft powerdown, but not quite.

Is this really a good idea?

For that matter, what are the differences among the various sorts of 
poweroff?

	Which devices remain minimally powered for wakeup purposes?

	Anything else?

In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
same as a normal non-hibernate poweroff?  Aren't drivers required to 
assume (during the processing after the snapshot has been restored) that 
power could have been lost and devices might need to be completely 
reinitialized?

We are letting ourselves in for problems if we say that when the snapshot
is restored, devices may or may not need to be reinitialized.  Drivers
might not be able to tell which, so they would have to reinitialize
regardless, losing any advantage.  Even worse, the device may _appear_ not
to need reinitialization because the firmware (BIOS) has already
initialized it but left it in a state that's useless for the kernel's
purposes.  (That's part of the reason why PRETHAW was added.)

If the only remaining difference between poweroff for hibernate and normal 
poweroff is which wakeup devices will function, then it seems pointless.  
Why shouldn't the same devices work for wakeup from hibernate and wakeup 
from normal poweroff?

Or have I misunderstood something and is this all nonsense?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 14:00                                                                   ` Alan Stern
@ 2007-05-03 17:17                                                                     ` Rafael J. Wysocki
  2007-05-03 18:33                                                                       ` Alan Stern
  2007-05-03 20:33                                                                     ` David Brownell
  2007-05-03 22:18                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
  2 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 17:17 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thursday, 3 May 2007 16:00, Alan Stern wrote:
> On Wed, 2 May 2007, Johannes Berg wrote:
> 
> > On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote:
> > 
> > > Well... the powerdown during hibernation... does not have _anything_
> > > to do with snapshot/restore. It is really a very deep sleep; similar
> > > to soft powerdown, but not quite.
> 
> Is this really a good idea?
> 
> For that matter, what are the differences among the various sorts of 
> poweroff?
> 
> 	Which devices remain minimally powered for wakeup purposes?
> 
> 	Anything else?
> 
> In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> same as a normal non-hibernate poweroff?

Not quite (see (*) below).

> Aren't drivers required to assume (during the processing after the snapshot
> has been restored) that power could have been lost and devices might need to
> be completely reinitialized?
> 
> We are letting ourselves in for problems if we say that when the snapshot
> is restored, devices may or may not need to be reinitialized.

Agreed.

> Drivers might not be able to tell which, so they would have to reinitialize
> regardless, losing any advantage.  Even worse, the device may _appear_ not
> to need reinitialization because the firmware (BIOS) has already
> initialized it but left it in a state that's useless for the kernel's
> purposes.  (That's part of the reason why PRETHAW was added.)

Yes.

> If the only remaining difference between poweroff for hibernate and normal 
> poweroff is which wakeup devices will function, then it seems pointless.

No, this is not the only difference (*).

> Why shouldn't the same devices work for wakeup from hibernate and wakeup 
> from normal poweroff?
> 
> Or have I misunderstood something and is this all nonsense?

The problem, generally speaking, is that we have to prepare devices for waking
up the system.  On an ACPI system this is done, among other things, by
executing the devices' _PSW control methods after the system-level _PTS method
has run.  For this purpose the devices must be in (low) power states from which
the wake is possible, so in particular they must not be powered off.  Later, by
making the platform enter the suspend-to-disk (ACPI S4) state we prevent it
from powering off the wake-up devices, among other things.

That's why I'm thinking that it might be a good idea to do a
suspend-before-poweroff, but it doesn't mean that device drivers would be
allowed to make any assumptions regarding the state of the device after the
resume.  IMO, if this is a resume from disk, devices should be initialized from
scratch.

(*) Another issue is that, for example, on my notebook the status of the AC
power supply (and sometimes of the battery too) is not reported correctly by
the platform after the resume if the suspend-to-disk (ACPI S4) state has not
been entered during the hibernation. I don't understand why this happens, but
I'm going to find out.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 17:17                                                                     ` Rafael J. Wysocki
@ 2007-05-03 18:33                                                                       ` Alan Stern
  2007-05-03 19:47                                                                         ` Rafael J. Wysocki
  2007-05-03 20:33                                                                         ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell
  0 siblings, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-03 18:33 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thu, 3 May 2007, Rafael J. Wysocki wrote:

> > In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> > same as a normal non-hibernate poweroff?
> 
> Not quite (see (*) below).

> The problem, generally speaking, is that we have to prepare devices for waking
> up the system.  On an ACPI system this is done, among other things, by
> executing the devices' _PSW control methods after the system-level _PTS method
> has run.  For this purpose the devices must be in (low) power states from which
> the wake is possible, so in particular they must not be powered off.  Later, by
> making the platform enter the suspend-to-disk (ACPI S4) state we prevent it
> from powering off the wake-up devices, among other things.
> 
> That's why I'm thinking that it might be a good idea to do a
> suspend-before-poweroff, but it doesn't mean that device drivers would be
> allowed to make any assumptions regarding the state of the device after the
> resume.  IMO, if this is a resume from disk, devices should be initialized from
> scratch.

I generally agree with your last sentence, but with one reservation (see 
below).

As for the rest, you missed my point.  Granted that all these special 
activities are required on ACPI systems in order to support proper 
operation of wakeup devices -- Why shouldn't these same steps also be 
followed during a normal poweroff?

There really are two orthogonal issues here:

	(1) Is this a "hibernate" poweroff (as opposed to a "normal" 
	    poweroff)? 

	(2) Should some devices remain minimally powered and be capable
	    of waking up the system?

I don't see any necessary relation between the answers to (1) and (2).  In 
particular, I don't see why a Yes answer to (1) should imply a Yes answer 
to (2).

This suggests that the poweroff methods be completely independent of
hibernation_ops (or whatever you are now calling it).  Perhaps there
should be a separate sysfs attribute controlling whether or not wakeup is
enabled.  If it is then poweroff should go through all the ACPI (or the
platform's equivalent) hoops, otherwise everything should just be turned
off completely.  Regardless of whether the poweroff is part of a
hibernate sequence.

> (*) Another issue is that, for example, on my notebook the status of the AC
> power supply (and sometimes of the battery too) is not reported correctly by
> the platform after the resume if the suspend-to-disk (ACPI S4) state has not
> been entered during the hibernation. I don't understand why this happens, but
> I'm going to find out.

Hopefully it's not directly related to the matter under discussion. :-)


There remains one issue associated with always reinitializing devices 
during resume from hibernation.  In the one area I know a lot about (USB) 
this actually does matter, at least a little.

The USB specs include the notion of a "power session", which is
essentially an uninterrupted continuous connection between the host and
the device.  As long as a power session exists, the host is guaranteed
that the device has not been unplugged or replaced with a different
device.

On most systems, hibernation breaks power sessions.  When the system wakes 
back up it sees a bunch of USB devices connected, but it is not allowed 
(by the spec!) to assume that these are the same devices as were attached 
before.  In fact, some of them might not be.

Mostly this doesn't make any difference, but for mass-storage it does.  
Memory mappings and filesystem mounts will be disrupted when the
underlying logical device goes away, even if the same physical device is
still attached to the same port.  This has caused significant headaches
for USB users in the past.

On the other hand, some systems are designed cleverly enough to maintain
power sessions across hibernation.  Not many -- the only ones I've heard
about were all PPC Macs.  The USB drivers have always tried to keep power
sessions intact across hibernation whenever the hardware and firmware
would permit, but of course reinitializing the USB controller would
destroy them.

There are a couple of reaons for not worrying about this very much.  
First, as mentioned before this issue exists on only a small number of
systems.  Second, I have submitted to Greg KH a couple of patches to
maintain persistence of USB devices even when the power sessions are lost
(they're still in his queue so you can't try them out yet).  This feature
violates the USB spec and it is potentially dangerous -- users could
easily lose data for example by changing the card in a USB card reader
while the system is hibernating -- so it is a non-default Kconfig option.  
Nevertheless, it does solve the problem.

In the end, this is a long-winded way of saying that always reinitializing
devices while resuming from a hibernation is probably the best overall
approach, even if it may not be optimal in a few cases.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 18:33                                                                       ` Alan Stern
@ 2007-05-03 19:47                                                                         ` Rafael J. Wysocki
  2007-05-03 19:59                                                                           ` Alan Stern
  2007-05-03 20:33                                                                         ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 19:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thursday, 3 May 2007 20:33, Alan Stern wrote:
> On Thu, 3 May 2007, Rafael J. Wysocki wrote:
> 
> > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> > > same as a normal non-hibernate poweroff?
> > 
> > Not quite (see (*) below).
> 
> > The problem, generally speaking, is that we have to prepare devices for waking
> > up the system.  On an ACPI system this is done, among other things, by
> > executing the devices' _PSW control methods after the system-level _PTS method
> > has run.  For this purpose the devices must be in (low) power states from which
> > the wake is possible, so in particular they must not be powered off.  Later, by
> > making the platform enter the suspend-to-disk (ACPI S4) state we prevent it
> > from powering off the wake-up devices, among other things.

The last sencence in the above paragraph is not actually true, sorry for the
confusion.

> > That's why I'm thinking that it might be a good idea to do a
> > suspend-before-poweroff, but it doesn't mean that device drivers would be
> > allowed to make any assumptions regarding the state of the device after the
> > resume.  IMO, if this is a resume from disk, devices should be initialized from
> > scratch.
> 
> I generally agree with your last sentence, but with one reservation (see 
> below).
> 
> As for the rest, you missed my point.  Granted that all these special 
> activities are required on ACPI systems in order to support proper 
> operation of wakeup devices -- Why shouldn't these same steps also be 
> followed during a normal poweroff?
> 
> There really are two orthogonal issues here:
> 
> 	(1) Is this a "hibernate" poweroff (as opposed to a "normal" 
> 	    poweroff)? 
> 
> 	(2) Should some devices remain minimally powered and be capable
> 	    of waking up the system?
> 
> I don't see any necessary relation between the answers to (1) and (2).  In 
> particular, I don't see why a Yes answer to (1) should imply a Yes answer 
> to (2).
> 
> This suggests that the poweroff methods be completely independent of
> hibernation_ops (or whatever you are now calling it).  Perhaps there
> should be a separate sysfs attribute controlling whether or not wakeup is
> enabled.  If it is then poweroff should go through all the ACPI (or the
> platform's equivalent) hoops, otherwise everything should just be turned
> off completely.  Regardless of whether the poweroff is part of a
> hibernate sequence.

Well, after reviewing the code once again I see that we already do it this
way on ACPI systems, since the 'normal' power off is done by entering the
ACPI S5 state.  Moreover, there shouldn't be any difference between
ACPI S4 and 'power off' with respect to the wake-up devices, so you're
absolutely right.

It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4)
to finish the hibernation in order to avoid problems like (*) and for this purpose
we need to use hibernation_ops earlier during the hibernation.

> > (*) Another issue is that, for example, on my notebook the status of the AC
> > power supply (and sometimes of the battery too) is not reported correctly by
> > the platform after the resume if the suspend-to-disk (ACPI S4) state has not
> > been entered during the hibernation. I don't understand why this happens, but
> > I'm going to find out.
> 
> Hopefully it's not directly related to the matter under discussion. :-)
> 
> 
> There remains one issue associated with always reinitializing devices 
> during resume from hibernation.  In the one area I know a lot about (USB) 
> this actually does matter, at least a little.
> 
> The USB specs include the notion of a "power session", which is
> essentially an uninterrupted continuous connection between the host and
> the device.  As long as a power session exists, the host is guaranteed
> that the device has not been unplugged or replaced with a different
> device.
> 
> On most systems, hibernation breaks power sessions.  When the system wakes 
> back up it sees a bunch of USB devices connected, but it is not allowed 
> (by the spec!) to assume that these are the same devices as were attached 
> before.  In fact, some of them might not be.
> 
> Mostly this doesn't make any difference, but for mass-storage it does.  
> Memory mappings and filesystem mounts will be disrupted when the
> underlying logical device goes away, even if the same physical device is
> still attached to the same port.  This has caused significant headaches
> for USB users in the past.
> 
> On the other hand, some systems are designed cleverly enough to maintain
> power sessions across hibernation.  Not many -- the only ones I've heard
> about were all PPC Macs.  The USB drivers have always tried to keep power
> sessions intact across hibernation whenever the hardware and firmware
> would permit, but of course reinitializing the USB controller would
> destroy them.

That seems to be one of the really rare cases in which a device driver can
actually make sure that the device is in certain state after the hibernation on
the basis of information provided by the device itself, so it doesn't need to
make any assupmtions.  In such cases it might be possible not to reinitialize
the device, but that would have to be handled with much care.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 19:47                                                                         ` Rafael J. Wysocki
@ 2007-05-03 19:59                                                                           ` Alan Stern
  2007-05-03 20:21                                                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-03 19:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thu, 3 May 2007, Rafael J. Wysocki wrote:

> > This suggests that the poweroff methods be completely independent of
> > hibernation_ops (or whatever you are now calling it).  Perhaps there
> > should be a separate sysfs attribute controlling whether or not wakeup is
> > enabled.  If it is then poweroff should go through all the ACPI (or the
> > platform's equivalent) hoops, otherwise everything should just be turned
> > off completely.  Regardless of whether the poweroff is part of a
> > hibernate sequence.
> 
> Well, after reviewing the code once again I see that we already do it this
> way on ACPI systems, since the 'normal' power off is done by entering the
> ACPI S5 state.  Moreover, there shouldn't be any difference between
> ACPI S4 and 'power off' with respect to the wake-up devices, so you're
> absolutely right.
> 
> It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4)
> to finish the hibernation in order to avoid problems like (*) and for this purpose
> we need to use hibernation_ops earlier during the hibernation.

But why shouldn't a "normal" poweroff enter ACPI S4?  And why shouldn't a 
"hibernate" poweroff enter ACPI S5?  The choice of which state to enter is 
independent of the reason for shutting down, right?

In other words, the choice for whether or not to call
acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not 
you're hibernating.  So it shouldn't affect the usage of hibernation_ops 
at all.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 19:59                                                                           ` Alan Stern
@ 2007-05-03 20:21                                                                             ` Rafael J. Wysocki
  2007-05-04 14:40                                                                               ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 20:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thursday, 3 May 2007 21:59, Alan Stern wrote:
> On Thu, 3 May 2007, Rafael J. Wysocki wrote:
> 
> > > This suggests that the poweroff methods be completely independent of
> > > hibernation_ops (or whatever you are now calling it).  Perhaps there
> > > should be a separate sysfs attribute controlling whether or not wakeup is
> > > enabled.  If it is then poweroff should go through all the ACPI (or the
> > > platform's equivalent) hoops, otherwise everything should just be turned
> > > off completely.  Regardless of whether the poweroff is part of a
> > > hibernate sequence.
> > 
> > Well, after reviewing the code once again I see that we already do it this
> > way on ACPI systems, since the 'normal' power off is done by entering the
> > ACPI S5 state.  Moreover, there shouldn't be any difference between
> > ACPI S4 and 'power off' with respect to the wake-up devices, so you're
> > absolutely right.
> > 
> > It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4)
> > to finish the hibernation in order to avoid problems like (*) and for this purpose
> > we need to use hibernation_ops earlier during the hibernation.
> 
> But why shouldn't a "normal" poweroff enter ACPI S4?  And why shouldn't a 
> "hibernate" poweroff enter ACPI S5?  The choice of which state to enter is 
> independent of the reason for shutting down, right?

Well, not exactly.

> In other words, the choice for whether or not to call
> acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not 
> you're hibernating.  So it shouldn't affect the usage of hibernation_ops 
> at all.

This works the other way around, I think. :-)

Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4)
as a 'power off method' so that they work correctly after the 'return' from hibernation.
If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might
not work on them (this is an experimental observation, I don't know what
exactly the reason of it is).

Now, since I have such a box, I need to do the
acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off
method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power
off method).  *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used
without previous preparations, which are made with the help of hibernation_ops.

IOW, all hibernation_ops, including the ->enter() method that actually calls
acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one
(complicated) 'platform' power off method.  It doesn't make sense to use the
(other) hibernation_ops without the ->enter() method.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 14:00                                                                   ` Alan Stern
  2007-05-03 17:17                                                                     ` Rafael J. Wysocki
@ 2007-05-03 20:33                                                                     ` David Brownell
  2007-05-03 20:51                                                                       ` Rafael J. Wysocki
  2007-05-04 14:51                                                                       ` Alan Stern
  2007-05-03 22:18                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
  2 siblings, 2 replies; 712+ messages in thread
From: David Brownell @ 2007-05-03 20:33 UTC (permalink / raw)
  To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Thursday 03 May 2007, Alan Stern wrote:

> In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> same as a normal non-hibernate poweroff? 

No.  One of the differences between ACPI S4 (hibernate)
and S5 (poweroff) states is for example how wakeup behaves.
Look for example at /proc/acpi/wakeup and see how many
devices are listed as "can wake from S5" vs from S4 ...
most systems support some S4 events, not so for S5.

Another is that S4 can consume more power.

(Although I believe I noticed a regression there in recent
kernels ... previously I was able to trigger wakeup from
hibernation using the RTC, but not with 2.6.21 patches.)

Non-ACPI systems can make the same natural distinctions.


> We are letting ourselves in for problems if we say that when the snapshot
> is restored, devices may or may not need to be reinitialized. 

We have those problems already.  Of course, most of the
time S4/hibernate involves device re-init, while S3/STR
doesn't.


> Drivers 
> might not be able to tell which, so they would have to reinitialize
> regardless, losing any advantage.

For those specific devices.  Of course, not many drivers
are power-aware enough to notice.  Most will re-init.
On PCs the exceptions are USB and, maybe, network drivers.

Drivers for embedded platforms more often leverage the
"retention" states which don't require complete re-init,
since those systems generally don't "hibernate".


> Even worse, the device may _appear_ not 
> to need reinitialization because the firmware (BIOS) has already
> initialized it but left it in a state that's useless for the kernel's
> purposes.  (That's part of the reason why PRETHAW was added.)

That's *ALL* of the reason for PRETHAW.  I asked the
guy who did it.  ;)


> If the only remaining difference between poweroff for hibernate and normal 
> poweroff is which wakeup devices will function, then it seems pointless.

There's the additional power usage involved in enabling additional
wakeup sources, plus the additional system components that are
expected (possibly unreasonably!) to work.


> Why shouldn't the same devices work for wakeup from hibernate and wakeup 
> from normal poweroff?

You're suggesting Linux not use the S5 state, essentially.

So the question is really "why should Linux use S5 (and similar
states on non-ACPI systems), instead of disregarding the ACPI
spec?"

The short answer:  having a "true OFF" state is valuable, if
for no other reason than to cope with buggy "partial-ON" states
like S4.  Also, it's not clear that disregarding ACPI's guidance
here would be a good thing.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 18:33                                                                       ` Alan Stern
  2007-05-03 19:47                                                                         ` Rafael J. Wysocki
@ 2007-05-03 20:33                                                                         ` David Brownell
  1 sibling, 0 replies; 712+ messages in thread
From: David Brownell @ 2007-05-03 20:33 UTC (permalink / raw)
  To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg

On Thursday 03 May 2007, Alan Stern wrote:

> First, as mentioned before this issue exists on only a small number of
> systems.  Second, I have submitted to Greg KH a couple of patches to
> maintain persistence of USB devices even when the power sessions are lost
> (they're still in his queue so you can't try them out yet).  This feature
> violates the USB spec and it is potentially dangerous -- users could
> easily lose data

... which is why I don't like having it as any kind of option.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 20:33                                                                     ` David Brownell
@ 2007-05-03 20:51                                                                       ` Rafael J. Wysocki
  2007-05-04 14:51                                                                       ` Alan Stern
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 20:51 UTC (permalink / raw)
  To: David Brownell
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thursday, 3 May 2007 22:33, David Brownell wrote:
> On Thursday 03 May 2007, Alan Stern wrote:
> 
> > In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> > same as a normal non-hibernate poweroff? 
> 
> No.  One of the differences between ACPI S4 (hibernate)
> and S5 (poweroff) states is for example how wakeup behaves.
> Look for example at /proc/acpi/wakeup and see how many
> devices are listed as "can wake from S5" vs from S4 ...
> most systems support some S4 events, not so for S5.
> 
> Another is that S4 can consume more power.
> 
> (Although I believe I noticed a regression there in recent
> kernels ... previously I was able to trigger wakeup from
> hibernation using the RTC, but not with 2.6.21 patches.)

May I ask you to test a patch (appended)?

Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>

In the platform mode of hibernation swsusp calls (indirectly) the function
acpi_pm_finish() in the nonerror resume-during-hibernation code paths, which
is wrong, because this function effectively aborts the power transition and
disables the wake-up capability of devices.  Fix it.

Remove references to the platform functions from the snapshot restore code path
in kernel/power/user.c , since they should not be there.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---

 kernel/power/disk.c |    1 -
 kernel/power/user.c |   15 +++------------
 2 files changed, 3 insertions(+), 13 deletions(-)

Index: linux-2.6.21/kernel/power/disk.c
===================================================================
--- linux-2.6.21.orig/kernel/power/disk.c	2007-05-03 12:24:05.000000000 +0200
+++ linux-2.6.21/kernel/power/disk.c	2007-05-03 14:42:26.000000000 +0200
@@ -195,7 +195,6 @@ int hibernate(void)
 
 	if (in_suspend) {
 		enable_nonboot_cpus();
-		platform_finish();
 		device_resume();
 		resume_console();
 		pr_debug("PM: writing image.\n");
Index: linux-2.6.21/kernel/power/user.c
===================================================================
--- linux-2.6.21.orig/kernel/power/user.c	2007-05-03 12:22:57.000000000 +0200
+++ linux-2.6.21/kernel/power/user.c	2007-05-03 14:40:49.000000000 +0200
@@ -169,7 +169,7 @@ static inline int snapshot_suspend(int p
 	}
 	enable_nonboot_cpus();
  Resume_devices:
-	if (platform_suspend)
+	if (platform_suspend && (!in_suspend || error))
 		platform_finish();
 
 	device_resume();
@@ -179,17 +179,12 @@ static inline int snapshot_suspend(int p
 	return error;
 }
 
-static inline int snapshot_restore(int platform_suspend)
+static inline int snapshot_restore(void)
 {
 	int error;
 
 	mutex_lock(&pm_mutex);
 	pm_prepare_console();
-	if (platform_suspend) {
-		error = platform_prepare();
-		if (error)
-			goto Finish;
-	}
 	suspend_console();
 	error = device_suspend(PMSG_PRETHAW);
 	if (error)
@@ -201,12 +196,8 @@ static inline int snapshot_restore(int p
 
 	enable_nonboot_cpus();
  Resume_devices:
-	if (platform_suspend)
-		platform_finish();
-
 	device_resume();
 	resume_console();
- Finish:
 	pm_restore_console();
 	mutex_unlock(&pm_mutex);
 	return error;
@@ -272,7 +263,7 @@ static int snapshot_ioctl(struct inode *
 			error = -EPERM;
 			break;
 		}
-		error = snapshot_restore(data->platform_suspend);
+		error = snapshot_restore();
 		break;
 
 	case SNAPSHOT_FREE:

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 14:00                                                                   ` Alan Stern
  2007-05-03 17:17                                                                     ` Rafael J. Wysocki
  2007-05-03 20:33                                                                     ` David Brownell
@ 2007-05-03 22:18                                                                     ` Pavel Machek
  2007-05-04 14:57                                                                       ` Alan Stern
  2 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-03 22:18 UTC (permalink / raw)
  To: Alan Stern; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham

Hi!

> > > Well... the powerdown during hibernation... does not have _anything_
> > > to do with snapshot/restore. It is really a very deep sleep; similar
> > > to soft powerdown, but not quite.
> 
> Is this really a good idea?

We have no other choice. ACPI spec says we should use S4.

> For that matter, what are the differences among the various sorts of 
> poweroff?
> 
> 	Which devices remain minimally powered for wakeup purposes?
> 
> 	Anything else?

Blinking moon LED.

Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-02 19:26                                                                   ` Rafael J. Wysocki
  2007-05-03 22:48                                                                     ` Pavel Machek
@ 2007-05-03 22:48                                                                     ` Pavel Machek
  2007-05-03 23:14                                                                       ` Rafael J. Wysocki
                                                                                         ` (3 more replies)
  1 sibling, 4 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-03 22:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Alexey Starikovskiy, Johannes Berg, linux-pm, Pekka Enberg,
	Nigel Cunningham, ACPI Devel Maling List

Hi!

Crazy idea... could we kill hibernate_ops-like struct, and just create
a device for ACPI, using its suspend()/resume()/whatever callbacks to
do the ACPI magic?

> Okay.  Since we're trying to separate the hibernation code from the
> suspend code anyway, we can use the opportunity to introduce some new
> callbacks for the hibernation and/or redefine the existing ones.
> 
> The spec suggests that we need the following callbacks:
> 
> (1) prepare() - called after device_suspend(), execute _PTS and
> disable GPEs

sysdev .suspend() method would do the trick.

> (2) cancel() - called at any time after prepare() if there's an error, execute
>     _WAK and enable run-time GPEs

sysdev .resume() should do the trick. 

> (3) enter() - called after the image has been saved, execute _GTS and do what's
>     currently done in pm_enter()

This one is tricky. It is essentially
powerdown_but_enter_S4_instead. I guess we can live with if()... as we
need to special-case reboot in the same place.

> (4) finish() - called after the image has been restored, do what's currently
>     done in pm_finish()

platform (?) device .resume() method should work.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-02 19:26                                                                   ` Rafael J. Wysocki
@ 2007-05-03 22:48                                                                     ` Pavel Machek
  2007-05-03 22:48                                                                     ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-03 22:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg,
	Johannes Berg, linux-pm

Hi!

Crazy idea... could we kill hibernate_ops-like struct, and just create
a device for ACPI, using its suspend()/resume()/whatever callbacks to
do the ACPI magic?

> Okay.  Since we're trying to separate the hibernation code from the
> suspend code anyway, we can use the opportunity to introduce some new
> callbacks for the hibernation and/or redefine the existing ones.
> 
> The spec suggests that we need the following callbacks:
> 
> (1) prepare() - called after device_suspend(), execute _PTS and
> disable GPEs

sysdev .suspend() method would do the trick.

> (2) cancel() - called at any time after prepare() if there's an error, execute
>     _WAK and enable run-time GPEs

sysdev .resume() should do the trick. 

> (3) enter() - called after the image has been saved, execute _GTS and do what's
>     currently done in pm_enter()

This one is tricky. It is essentially
powerdown_but_enter_S4_instead. I guess we can live with if()... as we
need to special-case reboot in the same place.

> (4) finish() - called after the image has been restored, do what's currently
>     done in pm_finish()

platform (?) device .resume() method should work.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-03 22:48                                                                     ` Pavel Machek
  2007-05-03 23:14                                                                       ` Rafael J. Wysocki
@ 2007-05-03 23:14                                                                       ` Rafael J. Wysocki
  2007-05-04 10:54                                                                       ` Johannes Berg
  2007-05-04 10:54                                                                       ` Johannes Berg
  3 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 23:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alexey Starikovskiy, Johannes Berg, linux-pm, Pekka Enberg,
	Nigel Cunningham, ACPI Devel Maling List

Hi,

On Friday, 4 May 2007 00:48, Pavel Machek wrote:
> Hi!
> 
> Crazy idea... could we kill hibernate_ops-like struct, and just create
> a device for ACPI, using its suspend()/resume()/whatever callbacks to
> do the ACPI magic?

Hmm, I didn't think about that.  It seems to be viable at first sight.

Still, I think we can first separate hibernation_ops from pm_ops, figure out
what they should be and then try to replace them with a cleaner solution.

> > Okay.  Since we're trying to separate the hibernation code from the
> > suspend code anyway, we can use the opportunity to introduce some new
> > callbacks for the hibernation and/or redefine the existing ones.
> > 
> > The spec suggests that we need the following callbacks:

In fact, I should have added

(0) start() - called before device_suspend(), execute _TTS(S4)

and I'm not sure if the GPEs should be disabled here or in prepare()

In principle this could be done as a device's .resume() call, but that would
have to be the very first device registered (can we do that?).

> > (1) prepare() - called after device_suspend(), execute _PTS and
> > disable GPEs
> 
> sysdev .suspend() method would do the trick.

Yes.

> > (2) cancel() - called at any time after prepare() if there's an error, execute
> >     _WAK and enable run-time GPEs
> 
> sysdev .resume() should do the trick.

But .resume() would be called unconditionally, so there should be a way of
figuring out what to do - looks complicated.
 
> > (3) enter() - called after the image has been saved, execute _GTS and do what's
> >     currently done in pm_enter()
> 
> This one is tricky. It is essentially
> powerdown_but_enter_S4_instead. I guess we can live with if()... as we
> need to special-case reboot in the same place.

Yes.

> > (4) finish() - called after the image has been restored, do what's currently
> >     done in pm_finish()
>
> platform (?) device .resume() method should work.

Hmm, perhaps.

And we need one more (in fact this one should be called finish() and the
previous one wake() or something like that):

(5) finish() - called after device_resume(), but only after the image has been
restored or in case of a hibernation error, execute _TTS(S0).  It looks like
this also should enable the GPEs (or the previous one; that's the information
I'm looking for).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-03 22:48                                                                     ` Pavel Machek
@ 2007-05-03 23:14                                                                       ` Rafael J. Wysocki
  2007-05-03 23:14                                                                       ` Rafael J. Wysocki
                                                                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-03 23:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg,
	Johannes Berg, linux-pm

Hi,

On Friday, 4 May 2007 00:48, Pavel Machek wrote:
> Hi!
> 
> Crazy idea... could we kill hibernate_ops-like struct, and just create
> a device for ACPI, using its suspend()/resume()/whatever callbacks to
> do the ACPI magic?

Hmm, I didn't think about that.  It seems to be viable at first sight.

Still, I think we can first separate hibernation_ops from pm_ops, figure out
what they should be and then try to replace them with a cleaner solution.

> > Okay.  Since we're trying to separate the hibernation code from the
> > suspend code anyway, we can use the opportunity to introduce some new
> > callbacks for the hibernation and/or redefine the existing ones.
> > 
> > The spec suggests that we need the following callbacks:

In fact, I should have added

(0) start() - called before device_suspend(), execute _TTS(S4)

and I'm not sure if the GPEs should be disabled here or in prepare()

In principle this could be done as a device's .resume() call, but that would
have to be the very first device registered (can we do that?).

> > (1) prepare() - called after device_suspend(), execute _PTS and
> > disable GPEs
> 
> sysdev .suspend() method would do the trick.

Yes.

> > (2) cancel() - called at any time after prepare() if there's an error, execute
> >     _WAK and enable run-time GPEs
> 
> sysdev .resume() should do the trick.

But .resume() would be called unconditionally, so there should be a way of
figuring out what to do - looks complicated.
 
> > (3) enter() - called after the image has been saved, execute _GTS and do what's
> >     currently done in pm_enter()
> 
> This one is tricky. It is essentially
> powerdown_but_enter_S4_instead. I guess we can live with if()... as we
> need to special-case reboot in the same place.

Yes.

> > (4) finish() - called after the image has been restored, do what's currently
> >     done in pm_finish()
>
> platform (?) device .resume() method should work.

Hmm, perhaps.

And we need one more (in fact this one should be called finish() and the
previous one wake() or something like that):

(5) finish() - called after device_resume(), but only after the image has been
restored or in case of a hibernation error, execute _TTS(S0).  It looks like
this also should enable the GPEs (or the previous one; that's the information
I'm looking for).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-03 22:48                                                                     ` Pavel Machek
  2007-05-03 23:14                                                                       ` Rafael J. Wysocki
  2007-05-03 23:14                                                                       ` Rafael J. Wysocki
@ 2007-05-04 10:54                                                                       ` Johannes Berg
  2007-05-04 12:08                                                                         ` Pavel Machek
  2007-05-04 12:08                                                                         ` Pavel Machek
  2007-05-04 10:54                                                                       ` Johannes Berg
  3 siblings, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 10:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Alexey Starikovskiy, linux-pm, Pekka Enberg,
	Nigel Cunningham, ACPI Devel Maling List

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Fri, 2007-05-04 at 00:48 +0200, Pavel Machek wrote:

> Crazy idea... could we kill hibernate_ops-like struct, and just create
> a device for ACPI, using its suspend()/resume()/whatever callbacks to
> do the ACPI magic?

Doesn't that have the ordering problem again? You must ensure that this
sysdev is suspended as the last one, and that's currently impossible if
ACPI is modular.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-03 22:48                                                                     ` Pavel Machek
                                                                                         ` (2 preceding siblings ...)
  2007-05-04 10:54                                                                       ` Johannes Berg
@ 2007-05-04 10:54                                                                       ` Johannes Berg
  3 siblings, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 10:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, linux-pm


[-- Attachment #1.1: Type: text/plain, Size: 405 bytes --]

On Fri, 2007-05-04 at 00:48 +0200, Pavel Machek wrote:

> Crazy idea... could we kill hibernate_ops-like struct, and just create
> a device for ACPI, using its suspend()/resume()/whatever callbacks to
> do the ACPI magic?

Doesn't that have the ordering problem again? You must ensure that this
sysdev is suspended as the last one, and that's currently impossible if
ACPI is modular.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-04 10:54                                                                       ` Johannes Berg
  2007-05-04 12:08                                                                         ` Pavel Machek
@ 2007-05-04 12:08                                                                         ` Pavel Machek
  2007-05-04 12:29                                                                           ` Rafael J. Wysocki
  2007-05-04 12:29                                                                           ` Rafael J. Wysocki
  1 sibling, 2 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-04 12:08 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Rafael J. Wysocki, Alexey Starikovskiy, linux-pm, Pekka Enberg,
	Nigel Cunningham, ACPI Devel Maling List

Hi!

> > Crazy idea... could we kill hibernate_ops-like struct, and just create
> > a device for ACPI, using its suspend()/resume()/whatever callbacks to
> > do the ACPI magic?
> 
> Doesn't that have the ordering problem again? You must ensure that this
> sysdev is suspended as the last one, and that's currently impossible if
> ACPI is modular.

I do not think acpi has these kinds of ordering requirements... (And I
do not see what it has to do with module or not).



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-04 10:54                                                                       ` Johannes Berg
@ 2007-05-04 12:08                                                                         ` Pavel Machek
  2007-05-04 12:08                                                                         ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-04 12:08 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, linux-pm

Hi!

> > Crazy idea... could we kill hibernate_ops-like struct, and just create
> > a device for ACPI, using its suspend()/resume()/whatever callbacks to
> > do the ACPI magic?
> 
> Doesn't that have the ordering problem again? You must ensure that this
> sysdev is suspended as the last one, and that's currently impossible if
> ACPI is modular.

I do not think acpi has these kinds of ordering requirements... (And I
do not see what it has to do with module or not).



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-04 12:08                                                                         ` Pavel Machek
@ 2007-05-04 12:29                                                                           ` Rafael J. Wysocki
  2007-05-04 12:29                                                                           ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 12:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Berg, Alexey Starikovskiy, linux-pm, Pekka Enberg,
	Nigel Cunningham, ACPI Devel Maling List

Hi,

On Friday, 4 May 2007 14:08, Pavel Machek wrote:
> Hi!
> 
> > > Crazy idea... could we kill hibernate_ops-like struct, and just create
> > > a device for ACPI, using its suspend()/resume()/whatever callbacks to
> > > do the ACPI magic?
> > 
> > Doesn't that have the ordering problem again? You must ensure that this
> > sysdev is suspended as the last one, and that's currently impossible if
> > ACPI is modular.
> 
> I do not think acpi has these kinds of ordering requirements... (And I
> do not see what it has to do with module or not).

Theoretically, ACPI has some ordering requirements.  For example, according to
the spec, the _PTS system-control method should be executed *after* devices are
placed in the appropriate Dx states, which (theoretically) requires us to
execute it after device_suspend() (we don't do this in practice, but I think we
should).

There are some more ordering assumptions like this in the spec and I think
we should at least try to follow them or, if that breaks things, document why
we don't.

That's why I think we should try to do what's needed using hibernation_ops 
(perhaps we'll need to add a couple of callbacks to hibernation_ops) and
then try to replace hibernation_ops with another mechanism allowing us to do
the same.  We first need to determine which operations have to be carried out
at what points so that things don't break.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops)
  2007-05-04 12:08                                                                         ` Pavel Machek
  2007-05-04 12:29                                                                           ` Rafael J. Wysocki
@ 2007-05-04 12:29                                                                           ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 12:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg,
	Johannes Berg, linux-pm

Hi,

On Friday, 4 May 2007 14:08, Pavel Machek wrote:
> Hi!
> 
> > > Crazy idea... could we kill hibernate_ops-like struct, and just create
> > > a device for ACPI, using its suspend()/resume()/whatever callbacks to
> > > do the ACPI magic?
> > 
> > Doesn't that have the ordering problem again? You must ensure that this
> > sysdev is suspended as the last one, and that's currently impossible if
> > ACPI is modular.
> 
> I do not think acpi has these kinds of ordering requirements... (And I
> do not see what it has to do with module or not).

Theoretically, ACPI has some ordering requirements.  For example, according to
the spec, the _PTS system-control method should be executed *after* devices are
placed in the appropriate Dx states, which (theoretically) requires us to
execute it after device_suspend() (we don't do this in practice, but I think we
should).

There are some more ordering assumptions like this in the spec and I think
we should at least try to follow them or, if that breaks things, document why
we don't.

That's why I think we should try to do what's needed using hibernation_ops 
(perhaps we'll need to add a couple of callbacks to hibernation_ops) and
then try to replace hibernation_ops with another mechanism allowing us to do
the same.  We first need to determine which operations have to be carried out
at what points so that things don't break.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 20:21                                                                             ` Rafael J. Wysocki
@ 2007-05-04 14:40                                                                               ` Alan Stern
  2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-04 14:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

Rafael, David, and Pavel:

You all misunderstood the point I was trying to make.

On Thu, 3 May 2007, Rafael J. Wysocki wrote:

> > But why shouldn't a "normal" poweroff enter ACPI S4?  And why shouldn't a 
> > "hibernate" poweroff enter ACPI S5?  The choice of which state to enter is 
> > independent of the reason for shutting down, right?
> 
> Well, not exactly.
> 
> > In other words, the choice for whether or not to call
> > acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not 
> > you're hibernating.  So it shouldn't affect the usage of hibernation_ops 
> > at all.
> 
> This works the other way around, I think. :-)
> 
> Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4)
> as a 'power off method' so that they work correctly after the 'return' from hibernation.
> If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might
> not work on them (this is an experimental observation, I don't know what
> exactly the reason of it is).
> 
> Now, since I have such a box, I need to do the
> acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off
> method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power
> off method).  *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used
> without previous preparations, which are made with the help of hibernation_ops.
> 
> IOW, all hibernation_ops, including the ->enter() method that actually calls
> acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one
> (complicated) 'platform' power off method.  It doesn't make sense to use the
> (other) hibernation_ops without the ->enter() method.

Let's look at the big picture.

Entering hibernation basically involves these steps:

	1. Freeze tasks

	2. Quiesce devices and drivers

	3. Create snapshot

	4. Reactivate devices and drivers

	5. Save snapshot to disk

	6. Prepare devices for wakeup

	7. Power down (ACPI S4 on systems which support it)

Leaving hibernation involves a similar sequence which I won't discuss.

Notice that steps 1-5 above are _completely_ independent of all issues 
concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
carried out for hibernation to work, no matter how the system ends up 
getting shut down.

On the other hand, steps 6 and 7 aren't really needed for hibernation.  
You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
wouldn't work, but the next time the user turned the computer on manually
it would still resume from hibernation.

Conversely, steps 6 and 7 can make sense even in situations where you
don't want to hibernate.  For example, you might want a normal shutdown in
which the operating system does a full restart when the firmware is
signalled by a wakeup device.

So there should be separate data structures associated with 1-5 and 6-7.  
Maybe the one associated with 6-7 is what you are calling hibernation_ops;  
if so then fine.  But I still think that it should be usable for
situations where you are not entering hibernation, and we should be 
possible to enter hibernation without using it.  The system administrator 
should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
regardless of whether it's to start hibernating.

The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy
to check and see), but that doesn't mean we have to use the terms
synonymously.

Does this make sense, or am I missing something very basic?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 20:33                                                                     ` David Brownell
  2007-05-03 20:51                                                                       ` Rafael J. Wysocki
@ 2007-05-04 14:51                                                                       ` Alan Stern
  2007-05-04 14:56                                                                         ` Johannes Berg
  2007-05-04 22:00                                                                         ` David Brownell
  1 sibling, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-04 14:51 UTC (permalink / raw)
  To: David Brownell
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Thu, 3 May 2007, David Brownell wrote:

> On Thursday 03 May 2007, Alan Stern wrote:
> 
> > In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> > same as a normal non-hibernate poweroff? 
> 
> No.  One of the differences between ACPI S4 (hibernate)
> and S5 (poweroff) states is for example how wakeup behaves.
> Look for example at /proc/acpi/wakeup and see how many
> devices are listed as "can wake from S5" vs from S4 ...
> most systems support some S4 events, not so for S5.
> 
> Another is that S4 can consume more power.

You are describing the difference between ACPI S4 and S5, but I was 
talking about the difference between "normal" poweroff and "hibernate" 
poweroff.  There doesn't seem to be any reason why we must always have

	hibernate = S4    and     normal = S5.

> Non-ACPI systems can make the same natural distinctions.

On such systems there seems to be even less reason for those equalities 
(or rather, their analogs).


> > We are letting ourselves in for problems if we say that when the snapshot
> > is restored, devices may or may not need to be reinitialized. 
> 
> We have those problems already.

Exactly because we are waffling on this issue.  If we settled the matter 
once and for all (devices must ALWAYS be reinitialized after the snapshot 
is restored) then we wouldn't have those problems.  (We might have other 
problems though...)


> > Even worse, the device may _appear_ not 
> > to need reinitialization because the firmware (BIOS) has already
> > initialized it but left it in a state that's useless for the kernel's
> > purposes.  (That's part of the reason why PRETHAW was added.)
> 
> That's *ALL* of the reason for PRETHAW.  I asked the
> guy who did it.  ;)

Well, be fair.  If your resume methods had some way to know whether or not 
a snapshot had just been restored then you wouldn't have needed to add 
PRETHAW.  So another part of the reason is that restore() methods don't 
take a pm_message_t argument.


> > Why shouldn't the same devices work for wakeup from hibernate and wakeup 
> > from normal poweroff?
> 
> You're suggesting Linux not use the S5 state, essentially.

No, I'm suggesting that the user should be able to control whether Linux 
uses S4 vs. S5 at poweroff time.  If the user selected always to use S4 
then wakeup devices would function in both hibernation and normal 
shutdown.  If the user selected always to use S5 then wakeup devices would 
not function in either hibernation or normal shutdown.

> So the question is really "why should Linux use S5 (and similar
> states on non-ACPI systems), instead of disregarding the ACPI
> spec?"
> 
> The short answer:  having a "true OFF" state is valuable, if
> for no other reason than to cope with buggy "partial-ON" states
> like S4.  Also, it's not clear that disregarding ACPI's guidance
> here would be a good thing.

Which part of ACPI's so-called guidance are you referring to?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:51                                                                       ` Alan Stern
@ 2007-05-04 14:56                                                                         ` Johannes Berg
  2007-05-04 20:27                                                                           ` Rafael J. Wysocki
  2007-05-04 22:00                                                                         ` David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 14:56 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 887 bytes --]

On Fri, 2007-05-04 at 10:51 -0400, Alan Stern wrote:

> Exactly because we are waffling on this issue.  If we settled the matter 
> once and for all (devices must ALWAYS be reinitialized after the snapshot 
> is restored) then we wouldn't have those problems.  (We might have other 
> problems though...)

From what I've understood so far, ACPI is very unhappy on some machines
if you go to S5 after hiberation. I still don't understand why, if the
ACPI code would properly re-initialise itself (treat ACPI as a device
and apply your "devices must ALWAYS be reinitialized after the snapshot
is restored") then this shouldn't be possible to happen.

And at that point I agree that the issue becomes completely orthogonal.

(btw, it's always possible right now to go to S5 instead of S4 when
doing hibernation simply by changing /sys/power/disk to "shutdown")

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-03 22:18                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
@ 2007-05-04 14:57                                                                       ` Alan Stern
  2007-05-04 20:50                                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-04 14:57 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham

On Fri, 4 May 2007, Pavel Machek wrote:

> Hi!
> 
> > > > Well... the powerdown during hibernation... does not have _anything_
> > > > to do with snapshot/restore. It is really a very deep sleep; similar
> > > > to soft powerdown, but not quite.
> > 
> > Is this really a good idea?
> 
> We have no other choice. ACPI spec says we should use S4.

I haven't checked the spec, but I find it hard to believe.  What could 
possibly be wrong with using S5?  It works just fine for normal poweroff, 
with no wakeup devices enabled.  Provided you don't enable the wakeup 
devices during hibernation, why not use S5?

> Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS.

We do normal powerdown whenever someone shuts off his computer without 
hibernating.  I haven't noticed any ACPI BIOS confusion from that...

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:40                                                                               ` Alan Stern
@ 2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
  2007-05-04 20:21                                                                                   ` Johannes Berg
  2007-05-04 20:58                                                                                 ` Pavel Machek
  2007-05-04 21:40                                                                                 ` David Brownell
  2 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 20:20 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Friday, 4 May 2007 16:40, Alan Stern wrote:
> Rafael, David, and Pavel:
> 
> You all misunderstood the point I was trying to make.
> 
> On Thu, 3 May 2007, Rafael J. Wysocki wrote:
> 
> > > But why shouldn't a "normal" poweroff enter ACPI S4?  And why shouldn't a 
> > > "hibernate" poweroff enter ACPI S5?  The choice of which state to enter is 
> > > independent of the reason for shutting down, right?
> > 
> > Well, not exactly.
> > 
> > > In other words, the choice for whether or not to call
> > > acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not 
> > > you're hibernating.  So it shouldn't affect the usage of hibernation_ops 
> > > at all.
> > 
> > This works the other way around, I think. :-)
> > 
> > Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4)
> > as a 'power off method' so that they work correctly after the 'return' from hibernation.
> > If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might
> > not work on them (this is an experimental observation, I don't know what
> > exactly the reason of it is).
> > 
> > Now, since I have such a box, I need to do the
> > acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off
> > method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power
> > off method).  *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used
> > without previous preparations, which are made with the help of hibernation_ops.
> > 
> > IOW, all hibernation_ops, including the ->enter() method that actually calls
> > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one
> > (complicated) 'platform' power off method.  It doesn't make sense to use the
> > (other) hibernation_ops without the ->enter() method.
> 
> Let's look at the big picture.
> 
> Entering hibernation basically involves these steps:
> 
> 	1. Freeze tasks
> 
> 	2. Quiesce devices and drivers
> 
> 	3. Create snapshot
> 
> 	4. Reactivate devices and drivers
> 
> 	5. Save snapshot to disk
> 
> 	6. Prepare devices for wakeup
> 
> 	7. Power down (ACPI S4 on systems which support it)
> 
> Leaving hibernation involves a similar sequence which I won't discuss.
> 
> Notice that steps 1-5 above are _completely_ independent of all issues 
> concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> carried out for hibernation to work, no matter how the system ends up 
> getting shut down.
> 
> On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> wouldn't work, but the next time the user turned the computer on manually
> it would still resume from hibernation.

That's correct, with the exception that the user may find the system not fully
functional after the resume in that case.

> Conversely, steps 6 and 7 can make sense even in situations where you
> don't want to hibernate.  For example, you might want a normal shutdown in
> which the operating system does a full restart when the firmware is
> signalled by a wakeup device.
> 
> So there should be separate data structures associated with 1-5 and 6-7.  
> Maybe the one associated with 6-7 is what you are calling hibernation_ops;  
> if so then fine.  But I still think that it should be usable for
> situations where you are not entering hibernation, and we should be 
> possible to enter hibernation without using it.  The system administrator 
> should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
> regardless of whether it's to start hibernating.

Yes, this should be doable.
 
> The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy
> to check and see),

Not directly.  The word "hibernation" is never used in the ACPI specification
(as of ACPI 2.0).

> but that doesn't mean we have to use the terms synonymously.

Agreed.

> Does this make sense, or am I missing something very basic?

Hmm, I think it makes sense.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
@ 2007-05-04 20:21                                                                                   ` Johannes Berg
  2007-05-04 20:55                                                                                     ` Pavel Machek
  2007-05-04 21:06                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki
  0 siblings, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 20:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list


[-- Attachment #1.1: Type: text/plain, Size: 762 bytes --]

On Fri, 2007-05-04 at 22:20 +0200, Rafael J. Wysocki wrote:

> > On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> > You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> > wouldn't work, but the next time the user turned the computer on manually
> > it would still resume from hibernation.
> 
> That's correct, with the exception that the user may find the system not fully
> functional after the resume in that case.

Why is that anyway? Is it just a matter of the acpi code getting
confused about the acpi bios state? How can the acpi bios possibly be
screwed up after what it must see as a fresh boot? Does the acpi code
poke it in ways it's not supposed to be poked after a fresh boot?

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:56                                                                         ` Johannes Berg
@ 2007-05-04 20:27                                                                           ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 20:27 UTC (permalink / raw)
  To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Friday, 4 May 2007 16:56, Johannes Berg wrote:
> On Fri, 2007-05-04 at 10:51 -0400, Alan Stern wrote:
> 
> > Exactly because we are waffling on this issue.  If we settled the matter 
> > once and for all (devices must ALWAYS be reinitialized after the snapshot 
> > is restored) then we wouldn't have those problems.  (We might have other 
> > problems though...)
> 
> From what I've understood so far, ACPI is very unhappy on some machines
> if you go to S5 after hiberation. I still don't understand why, if the
> ACPI code would properly re-initialise itself (treat ACPI as a device
> and apply your "devices must ALWAYS be reinitialized after the snapshot
> is restored") then this shouldn't be possible to happen.

I agree, and that's why I suspect that the ACPI driver's .resume() routines
make some, well, ACPIish assumptions about the resume from hibernation, which
is the source of the problem.  If we separate the hibernation code from the
suspend (s2ram, standby) code completely, this issue will have to be resolved
somehow.

> And at that point I agree that the issue becomes completely orthogonal.
> 
> (btw, it's always possible right now to go to S5 instead of S4 when
> doing hibernation simply by changing /sys/power/disk to "shutdown")

That's correct.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:50                                                                         ` Rafael J. Wysocki
@ 2007-05-04 20:49                                                                           ` Johannes Berg
  2007-05-04 21:11                                                                             ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 20:49 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 332 bytes --]

On Fri, 2007-05-04 at 22:50 +0200, Rafael J. Wysocki wrote:

> To prevent this from happening, we need a separate set of hibernation callbacks
> in device drivers.

You *can* actually do that now with prethaw and all that afaict. But all
the more argument for splitting up the callbacks as discussed
previously.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:57                                                                       ` Alan Stern
@ 2007-05-04 20:50                                                                         ` Rafael J. Wysocki
  2007-05-04 20:49                                                                           ` Johannes Berg
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 20:50 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Friday, 4 May 2007 16:57, Alan Stern wrote:
> On Fri, 4 May 2007, Pavel Machek wrote:
> 
> > Hi!
> > 
> > > > > Well... the powerdown during hibernation... does not have _anything_
> > > > > to do with snapshot/restore. It is really a very deep sleep; similar
> > > > > to soft powerdown, but not quite.
> > > 
> > > Is this really a good idea?
> > 
> > We have no other choice. ACPI spec says we should use S4.
> 
> I haven't checked the spec, but I find it hard to believe.  What could 
> possibly be wrong with using S5?  It works just fine for normal poweroff, 
> with no wakeup devices enabled.  Provided you don't enable the wakeup 
> devices during hibernation, why not use S5?

I think the problem is the "reinitialize from scratch after the resume" part.

If we're waking up from the hibernation, device drivers should reinitialize
their devices, but if we're waking up from a suspend (eg. s2ram), it would be
wrong to reinitialize, for example, the ACPI subsystem from scratch.  Now,
the problem is that the drivers (including ACPI drivers) cannot tell whether
the resume is from hibernation or from suspend so they try to do something
"generic".  This may lead to having the system not fully functional after the
resume from hibernation if we don't tell the ACPI BIOS that we're hibernating
(by entering the S4 state instead of S5).

> > Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS.
> 
> We do normal powerdown whenever someone shuts off his computer without 
> hibernating.  I haven't noticed any ACPI BIOS confusion from that...

In fact, I think, the BIOS isn't confused, but it may preserve some state
information that the OS can use later on.  By entering S4 we tell the BIOS
to tell the "next" kernel that we've hibernated and to preserve some
configuration information for it.  If this information is not present, our own
ACPI drivers get confised during the resume.

To prevent this from happening, we need a separate set of hibernation callbacks
in device drivers.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:21                                                                                   ` Johannes Berg
@ 2007-05-04 20:55                                                                                     ` Pavel Machek
  2007-05-04 21:08                                                                                       ` Johannes Berg
  2007-05-04 21:06                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-04 20:55 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list

Hi!

> > > On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> > > You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> > > wouldn't work, but the next time the user turned the computer on manually
> > > it would still resume from hibernation.
> > 
> > That's correct, with the exception that the user may find the system not fully
> > functional after the resume in that case.
> 
> Why is that anyway? Is it just a matter of the acpi code getting
> confused about the acpi bios state? How can the acpi bios possibly be
> screwed up after what it must see as a fresh boot? Does the acpi code
> poke it in ways it's not supposed to be poked after a fresh boot?

No, ACPI BIOS does not see a fresh boot.

ACPI BIOS communicates with hw, too. Suppose it generates random
number, stores it in memory and tells it to the keyboard conroller
during bootup (more specifically during ACPI enable phase).

Now, it periodically checks if number in memory is same as the number
known by keyboard controller.

If you suspend/resume without telling acpi, it will find out, because
numbers will not match.

(And now, ACPI is probably not crazy enough to store random numbers --
but it could -- but for example "I had AC power, now I do not, and I
did not see a interrupt telling me it went away" can be counted as
confusing for ACPI).
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:40                                                                               ` Alan Stern
  2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
@ 2007-05-04 20:58                                                                                 ` Pavel Machek
  2007-05-04 21:24                                                                                   ` Rafael J. Wysocki
  2007-05-04 21:40                                                                                 ` David Brownell
  2 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-04 20:58 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg

Hi!

> You all misunderstood the point I was trying to make.
> 

> > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one
> > (complicated) 'platform' power off method.  It doesn't make sense to use the
> > (other) hibernation_ops without the ->enter() method.
> 
> Let's look at the big picture.
> 
> Entering hibernation basically involves these steps:
> 
> 	1. Freeze tasks
> 
> 	2. Quiesce devices and drivers
> 
> 	3. Create snapshot
> 
> 	4. Reactivate devices and drivers
> 
> 	5. Save snapshot to disk
> 
> 	6. Prepare devices for wakeup
> 
> 	7. Power down (ACPI S4 on systems which support it)
> 
> Leaving hibernation involves a similar sequence which I won't discuss.
> 
> Notice that steps 1-5 above are _completely_ independent of all issues 
> concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to
> be

No, they are not. You probably should tell ACPI at step 2 that you are
suspending, and you definitely need to tell ACPI that you have resumed
(so it can re-scan AC adapters, for example).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:21                                                                                   ` Johannes Berg
  2007-05-04 20:55                                                                                     ` Pavel Machek
@ 2007-05-04 21:06                                                                                     ` Rafael J. Wysocki
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 21:06 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list

On Friday, 4 May 2007 22:21, Johannes Berg wrote:
> On Fri, 2007-05-04 at 22:20 +0200, Rafael J. Wysocki wrote:
> 
> > > On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> > > You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> > > wouldn't work, but the next time the user turned the computer on manually
> > > it would still resume from hibernation.
> > 
> > That's correct, with the exception that the user may find the system not fully
> > functional after the resume in that case.
> 
> Why is that anyway? Is it just a matter of the acpi code getting
> confused about the acpi bios state?

Yes, I think so.

> How can the acpi bios possibly be screwed up after what it must see as a
> fresh boot? Does the acpi code poke it in ways it's not supposed to be poked
> after a fresh boot?

Sort of.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:55                                                                                     ` Pavel Machek
@ 2007-05-04 21:08                                                                                       ` Johannes Berg
  2007-05-04 21:15                                                                                         ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 21:08 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list


[-- Attachment #1.1: Type: text/plain, Size: 1569 bytes --]

On Fri, 2007-05-04 at 22:55 +0200, Pavel Machek wrote:

> > Why is that anyway? Is it just a matter of the acpi code getting
> > confused about the acpi bios state? How can the acpi bios possibly be
> > screwed up after what it must see as a fresh boot? Does the acpi code
> > poke it in ways it's not supposed to be poked after a fresh boot?
> 
> No, ACPI BIOS does not see a fresh boot.

Sure. It just booted the machine so it must see it as a fresh boot.

> ACPI BIOS communicates with hw, too. Suppose it generates random
> number, stores it in memory and tells it to the keyboard conroller
> during bootup (more specifically during ACPI enable phase).
> 
> Now, it periodically checks if number in memory is same as the number
> known by keyboard controller.
> 
> If you suspend/resume without telling acpi, it will find out, because
> numbers will not match.
> 
> (And now, ACPI is probably not crazy enough to store random numbers --
> but it could -- but for example "I had AC power, now I do not, and I
> did not see a interrupt telling me it went away" can be counted as
> confusing for ACPI).

I don't follow.

 * you have AC power.
 * you save system state and shut down (S5)
 * you boot up again on battery power
 * you restore system state
 * ...

vs.

 * you have AC power
 * you shut down
 * you boot up again on battery power
 * ...

where's the difference to the ACPI bios? Oh, I see, it stores it
somewhere in the memory that you've stored/restored? Well, that's your
bug then, don't touch it.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:49                                                                           ` Johannes Berg
@ 2007-05-04 21:11                                                                             ` Rafael J. Wysocki
  2007-05-04 21:23                                                                               ` Johannes Berg
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 21:11 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Friday, 4 May 2007 22:49, Johannes Berg wrote:
> On Fri, 2007-05-04 at 22:50 +0200, Rafael J. Wysocki wrote:
> 
> > To prevent this from happening, we need a separate set of hibernation callbacks
> > in device drivers.
> 
> You *can* actually do that now with prethaw and all that afaict.

Actually, prethaw is to prevent drivers loaded before the image is restored
from doing unreasonable things.  It doesn't have any effect on the drivers'
.resume() routines.

Besides, if the drivers in question are compiled as modules and not loaded
before the image is restored, prethaw doesn't have any effect on them and
on their devices at all. ;-)

> But all the more argument for splitting up the callbacks as discussed
> previously.

Yes.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:08                                                                                       ` Johannes Berg
@ 2007-05-04 21:15                                                                                         ` Pavel Machek
  2007-05-04 21:53                                                                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-04 21:15 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list

Hi!

> > ACPI BIOS communicates with hw, too. Suppose it generates random
> > number, stores it in memory and tells it to the keyboard conroller
> > during bootup (more specifically during ACPI enable phase).
> > 
> > Now, it periodically checks if number in memory is same as the number
> > known by keyboard controller.
> > 
> > If you suspend/resume without telling acpi, it will find out, because
> > numbers will not match.
> > 
> > (And now, ACPI is probably not crazy enough to store random numbers --
> > but it could -- but for example "I had AC power, now I do not, and I
> > did not see a interrupt telling me it went away" can be counted as
> > confusing for ACPI).
> 
> I don't follow.
> 
>  * you have AC power.
>  * you save system state and shut down (S5)
>  * you boot up again on battery power
>  * you restore system state
>  * ...
> 
> vs.
> 
>  * you have AC power
>  * you shut down
>  * you boot up again on battery power
>  * ...
> 
> where's the difference to the ACPI bios? Oh, I see, it stores it
> somewhere in the memory that you've stored/restored? Well, that's your
> bug then, don't touch it.

Not sure... yes, it stores parts somewhere in memory. Plus, it may
have some parts related to the communications with operating system
(*)... I guess we need to save those, and parts related to hw
state... where your suggestion makes sense.

(*) and yes, there probably are such parts. If we set backlight to
20%, we'll be confused if it is 100% after resume... we probably could
handle those one-by-one...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:11                                                                             ` Rafael J. Wysocki
@ 2007-05-04 21:23                                                                               ` Johannes Berg
  2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
  2007-05-05 16:15                                                                                 ` Alan Stern
  0 siblings, 2 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 21:23 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 389 bytes --]

On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:

> Actually, prethaw is to prevent drivers loaded before the image is restored
> from doing unreasonable things.  It doesn't have any effect on the drivers'
> .resume() routines.

Oh, but it can, you could have a flag in your driver saying "the next
resume is after restore" and you set that flag in prethaw.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 20:58                                                                                 ` Pavel Machek
@ 2007-05-04 21:24                                                                                   ` Rafael J. Wysocki
  2007-05-05 16:19                                                                                     ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 21:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg

Hi,

On Friday, 4 May 2007 22:58, Pavel Machek wrote:
> Hi!
> 
> > You all misunderstood the point I was trying to make.
> > 
> 
> > > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one
> > > (complicated) 'platform' power off method.  It doesn't make sense to use the
> > > (other) hibernation_ops without the ->enter() method.
> > 
> > Let's look at the big picture.
> > 
> > Entering hibernation basically involves these steps:
> > 
> > 	1. Freeze tasks
> > 
> > 	2. Quiesce devices and drivers
> > 
> > 	3. Create snapshot
> > 
> > 	4. Reactivate devices and drivers
> > 
> > 	5. Save snapshot to disk
> > 
> > 	6. Prepare devices for wakeup
> > 
> > 	7. Power down (ACPI S4 on systems which support it)
> > 
> > Leaving hibernation involves a similar sequence which I won't discuss.
> > 
> > Notice that steps 1-5 above are _completely_ independent of all issues 
> > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to
> > be
> 
> No, they are not. You probably should tell ACPI at step 2 that you are
> suspending,

You can, but even if you don't, the BIOS shouldn't have problems.  What might
have problems is our ACPI code during the resume, if it cannot get appropriate
information from the BIOS.

> and you definitely need to tell ACPI that you have resumed 
> (so it can re-scan AC adapters, for example).

Yes, but that can be done in two different ways:

1) "We have restored the hibernation image, but the BIOS state corresponds to
a fresh reboot, so please initialize everything from scratch."

2) "We have restored the hibernation image and the ACPI S4 was used for
powering off (hint: you may try not to initialize everything from scratch)."

Of course, in the case 2) we are responsible for ensuring that the contents of
the hibernation image are consistent with the information preserved by the
BIOS.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:40                                                                               ` Alan Stern
  2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
  2007-05-04 20:58                                                                                 ` Pavel Machek
@ 2007-05-04 21:40                                                                                 ` David Brownell
  2007-05-04 22:19                                                                                   ` Rafael J. Wysocki
  2007-05-05 16:08                                                                                   ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern
  2 siblings, 2 replies; 712+ messages in thread
From: David Brownell @ 2007-05-04 21:40 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Friday 04 May 2007, Alan Stern wrote:
> Rafael, David, and Pavel:
> 
> You all misunderstood the point I was trying to make.
> 
> ...
> 
> Let's look at the big picture.
> 
> Entering hibernation basically involves these steps:
> 
> 	1. Freeze tasks
> 	2. Quiesce devices and drivers
> 	3. Create snapshot
> 	4. Reactivate devices and drivers
> 	5. Save snapshot to disk
> 	6. Prepare devices for wakeup
> 	7. Power down (ACPI S4 on systems which support it)
> 
> Leaving hibernation involves a similar sequence which I won't discuss.
> 
> Notice that steps 1-5 above are _completely_ independent of all issues 
> concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> carried out for hibernation to work, no matter how the system ends up 
> getting shut down.

Not exactly.  Step 2 is supposed to be aware of the target state's
capabilities, including what's wakeup-capable.  ACPI uses target
device states to choose which _SxD methods to execute, etc.  (Or it
should ... though come to think of it, I don't think I ever saw a
hook whereby PCI could trigger that.)


> On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> wouldn't work, but the next time the user turned the computer on manually
> it would still resume from hibernation.

I believe I did comment on your point that step 7 could use S5.

However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C)
that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state.  (Then
fuzzes the issue in 2.4, but those bits are less relevant here;
2.2 also mentions G3 = "Mechanical OFF", which is the only state
in which machine disassembly/reassembly is expected to be safe.

ACPI is allowed to distinguish between S4 and S5 in more ways
than just the power usage.  It'd be fair for the AML to store
state in something that retains power, and rely on that.  It'd
be better not to do things that are allowed to confuse ACPI.


> Conversely, steps 6 and 7 can make sense even in situations where you
> don't want to hibernate.  For example, you might want a normal shutdown in
> which the operating system does a full restart when the firmware is
> signalled by a wakeup device.

Wakeup devices in S4 are expected to be a superset of those in S5,
and system documentation often covers that.  Yeah, I know, "who
bothers to RTFM".  Still, the point is that these systems are now
documented to work in a particular way, and there really ought to
be a good reason to invalidate user training and documentation.

 
> So there should be separate data structures associated with 1-5 and 6-7.  
> Maybe the one associated with 6-7 is what you are calling hibernation_ops;  
> if so then fine.  But I still think that it should be usable for
> situations where you are not entering hibernation, and we should be 
> possible to enter hibernation without using it.  The system administrator 
> should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
> regardless of whether it's to start hibernating.

But ... why?  What value would users see from that?

We do have /sys/power/disk today, but that's only for
hibernation.  (And it's a bit confusing, too.)

A "Soft OFF" should be S5 to conform to specs and
documentation.


> The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy
> to check and see), but that doesn't mean we have to use the terms
> synonymously.

It talks S4 as a "sleeping" state, like S1, S2, and S3.
Or, about S4 as a "Non-Volatle sleep" state

I think it also assumes more intelligence on resume-from-S4
than Linux has just now, which may partly explain why it
takes so long for swsusp to finish its thing.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:53                                                                                           ` Rafael J. Wysocki
@ 2007-05-04 21:53                                                                                             ` Johannes Berg
  2007-05-04 22:25                                                                                               ` Rafael J. Wysocki
  2007-05-05 15:52                                                                                             ` Alan Stern
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 21:53 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list


[-- Attachment #1.1: Type: text/plain, Size: 646 bytes --]

On Fri, 2007-05-04 at 23:53 +0200, Rafael J. Wysocki wrote:
> > Plus, it may have some parts related to the communications with operating
> > system (*)... I guess we need to save those, and parts related to hw
> > state... where your suggestion makes sense.
> 
> If they are accessible to us, then we can, but what if they aren't (eg. the
> state information is stored in the embedded controller, can only be read with
> the help of some AML invocations and cannot be changed from the OS level)?

Well, in that case you also haven't overwritten/changed them during
restore so there's no room for mismatches and confusion.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:15                                                                                         ` Pavel Machek
@ 2007-05-04 21:53                                                                                           ` Rafael J. Wysocki
  2007-05-04 21:53                                                                                             ` Johannes Berg
  2007-05-05 15:52                                                                                             ` Alan Stern
  0 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 21:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, Pekka Enberg, Johannes Berg, Linux-pm mailing list

On Friday, 4 May 2007 23:15, Pavel Machek wrote:
> Hi!
> 
> > > ACPI BIOS communicates with hw, too. Suppose it generates random
> > > number, stores it in memory and tells it to the keyboard conroller
> > > during bootup (more specifically during ACPI enable phase).
> > > 
> > > Now, it periodically checks if number in memory is same as the number
> > > known by keyboard controller.
> > > 
> > > If you suspend/resume without telling acpi, it will find out, because
> > > numbers will not match.
> > > 
> > > (And now, ACPI is probably not crazy enough to store random numbers --
> > > but it could -- but for example "I had AC power, now I do not, and I
> > > did not see a interrupt telling me it went away" can be counted as
> > > confusing for ACPI).
> > 
> > I don't follow.
> > 
> >  * you have AC power.
> >  * you save system state and shut down (S5)
> >  * you boot up again on battery power
> >  * you restore system state
> >  * ...
> > 
> > vs.
> > 
> >  * you have AC power
> >  * you shut down
> >  * you boot up again on battery power
> >  * ...
> > 
> > where's the difference to the ACPI bios? Oh, I see, it stores it
> > somewhere in the memory that you've stored/restored? Well, that's your
> > bug then, don't touch it.
> 
> Not sure... yes, it stores parts somewhere in memory.

These are reserved regions.  On the majority of systems we handle them
correctly.

> Plus, it may have some parts related to the communications with operating
> system (*)... I guess we need to save those, and parts related to hw
> state... where your suggestion makes sense.

If they are accessible to us, then we can, but what if they aren't (eg. the
state information is stored in the embedded controller, can only be read with
the help of some AML invocations and cannot be changed from the OS level)?

> (*) and yes, there probably are such parts. If we set backlight to
> 20%, we'll be confused if it is 100% after resume... we probably could
> handle those one-by-one...

*If* we reinitialize devices *and* ACPI from scratch after restoring the image,
we'll discard the old value (20%) and read the new value (100%) from the BIOS.
The problems occur, IMO, because we try to be smart and use the BIOS
after the resume as though we'd resumed from a real suspend (eg. s2ram).

Which is natural, if we use the same set of .resume() callbacks for both cases.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
@ 2007-05-04 21:54                                                                                   ` Johannes Berg
  2007-05-04 22:21                                                                                     ` Rafael J. Wysocki
  2007-05-04 22:12                                                                                   ` David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Johannes Berg @ 2007-05-04 21:54 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 755 bytes --]

On Fri, 2007-05-04 at 23:55 +0200, Rafael J. Wysocki wrote:
> On Friday, 4 May 2007 23:23, Johannes Berg wrote:
> > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> > 
> > > Actually, prethaw is to prevent drivers loaded before the image is restored
> > > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > > .resume() routines.
> > 
> > Oh, but it can, you could have a flag in your driver saying "the next
> > resume is after restore" and you set that flag in prethaw.
> 
> No, you should have set that flag in .suspend(), really. :-)

Yeah, whatever. You can fix the problem but it's ugly. Let's come up
with a good way to do the 6 callbacks mentioned in some other thread
earlier.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:23                                                                               ` Johannes Berg
@ 2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
  2007-05-04 21:54                                                                                   ` Johannes Berg
  2007-05-04 22:12                                                                                   ` David Brownell
  2007-05-05 16:15                                                                                 ` Alan Stern
  1 sibling, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 21:55 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Friday, 4 May 2007 23:23, Johannes Berg wrote:
> On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> 
> > Actually, prethaw is to prevent drivers loaded before the image is restored
> > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > .resume() routines.
> 
> Oh, but it can, you could have a flag in your driver saying "the next
> resume is after restore" and you set that flag in prethaw.

No, you should have set that flag in .suspend(), really. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 14:51                                                                       ` Alan Stern
  2007-05-04 14:56                                                                         ` Johannes Berg
@ 2007-05-04 22:00                                                                         ` David Brownell
  2007-05-05 15:49                                                                           ` Alan Stern
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-04 22:00 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Friday 04 May 2007, Alan Stern wrote:
> On Thu, 3 May 2007, David Brownell wrote:
> 
> > On Thursday 03 May 2007, Alan Stern wrote:
> > 
> > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the 
> > > same as a normal non-hibernate poweroff? 
> > 
> > No.  One of the differences between ACPI S4 (hibernate)
> > and S5 (poweroff) states is for example how wakeup behaves.
> > Look for example at /proc/acpi/wakeup and see how many
> > devices are listed as "can wake from S5" vs from S4 ...
> > most systems support some S4 events, not so for S5.
> > 
> > Another is that S4 can consume more power.
> 
> You are describing the difference between ACPI S4 and S5, but I was 
> talking about the difference between "normal" poweroff and "hibernate" 
> poweroff.  There doesn't seem to be any reason why we must always have
> 
> 	hibernate = S4    and     normal = S5.

What the ACPI spec describes for the "Non-Volatile Sleep" is
that either S4 or S5 could match "hibernate" ... but for
a software-controlled "poweroff", only S5 is appropriate.

That's a reason.  Another:  pretty much all end-user docs
on this stuff match what ACPI says.

Lacking compelling reasons to violate specs (like them
being clearly broken), I avoid breaking them.


> > Non-ACPI systems can make the same natural distinctions.
> 
> On such systems there seems to be even less reason for those equalities 
> (or rather, their analogs).

This is one of those "less is more" things, right?  :)

People doing embedded designs _like_ their flexibility.

It's common to have multiple power levels.  If you mean
that they _could_ give up that flexibility and only use
one of those state analogues, yes they could ... but if
you mean they'd see that as a Good Thing, I doubt it.
 

> 
> > > We are letting ourselves in for problems if we say that when the snapshot
> > > is restored, devices may or may not need to be reinitialized. 
> > 
> > We have those problems already.
> 
> Exactly because we are waffling on this issue.  If we settled the matter 
> once and for all (devices must ALWAYS be reinitialized after the snapshot 
> is restored) then we wouldn't have those problems.  (We might have other 
> problems though...)

We *WOULD* have problems.

I guess I don't see why you want to throw away all the
work the hardware (and/or software) designers did to
ensure that some key devices use a "retention" mode
in their S4-analogue state.

Me, I always thought that leveraging those retention
states was a great way to shrink wakeup times and get
more functionality.


> > > Even worse, the device may _appear_ not 
> > > to need reinitialization because the firmware (BIOS) has already
> > > initialized it but left it in a state that's useless for the kernel's
> > > purposes.  (That's part of the reason why PRETHAW was added.)
> > 
> > That's *ALL* of the reason for PRETHAW.  I asked the
> > guy who did it.  ;)
> 
> Well, be fair.  If your resume methods had some way to know whether or not 
> a snapshot had just been restored then you wouldn't have needed to add 
> PRETHAW.  So another part of the reason is that restore() methods don't 
> take a pm_message_t argument.

Well, to be fair he says he didn't even consider such an
intrusive change.  The entire *reason* was to address that
particular issue.  Implementation tradeoffs are separate.


> > > Why shouldn't the same devices work for wakeup from hibernate and wakeup 
> > > from normal poweroff?
> > 
> > You're suggesting Linux not use the S5 state, essentially.
> 
> No, I'm suggesting that the user should be able to control whether Linux 
> uses S4 vs. S5 at poweroff time.  If the user selected always to use S4 
> then wakeup devices would function in both hibernation and normal 
> shutdown.  If the user selected always to use S5 then wakeup devices would 
> not function in either hibernation or normal shutdown.

That's a different suggestion, yes.  I'm not sure I see any
benefit of that flexibility for "soft off" states though,
especially if it made "off" consume more power.

 
> > So the question is really "why should Linux use S5 (and similar
> > states on non-ACPI systems), instead of disregarding the ACPI
> > spec?"
> > 
> > The short answer:  having a "true OFF" state is valuable, if
> > for no other reason than to cope with buggy "partial-ON" states
> > like S4.  Also, it's not clear that disregarding ACPI's guidance
> > here would be a good thing.
> 
> Which part of ACPI's so-called guidance are you referring to?

Section 2.2 of the spec I looked at, which defines how non-volatile
sleep relates to S4 and S5 states, and to the G3 "Mechanical OFF"
which could also be entered from either of those by flick'o'switch.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
  2007-05-04 21:54                                                                                   ` Johannes Berg
@ 2007-05-04 22:12                                                                                   ` David Brownell
  2007-05-04 22:31                                                                                     ` Rafael J. Wysocki
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-04 22:12 UTC (permalink / raw)
  To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Friday 04 May 2007, Rafael J. Wysocki wrote:
> On Friday, 4 May 2007 23:23, Johannes Berg wrote:
> > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> > 
> > > Actually, prethaw is to prevent drivers loaded before the image is restored
> > > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > > .resume() routines.
> > 
> > Oh, but it can, you could have a flag in your driver saying "the next
> > resume is after restore" and you set that flag in prethaw.
> 
> No, you should have set that flag in .suspend(), really. :-)

That doesn't work very well.  Not only does suspend() not
know the target state, but you don't want to trash the
controller state if you're getting resumed after some kind
of fault in the suspend-to-disk path...

I'm hoping that explains the smiley!

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:40                                                                                 ` David Brownell
@ 2007-05-04 22:19                                                                                   ` Rafael J. Wysocki
  2007-05-07  1:05                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell
  2007-05-05 16:08                                                                                   ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 22:19 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Friday, 4 May 2007 23:40, David Brownell wrote:
> On Friday 04 May 2007, Alan Stern wrote:
> > Rafael, David, and Pavel:
> > 
> > You all misunderstood the point I was trying to make.
> > 
> > ...
> > 
> > Let's look at the big picture.
> > 
> > Entering hibernation basically involves these steps:
> > 
> > 	1. Freeze tasks
> > 	2. Quiesce devices and drivers
> > 	3. Create snapshot
> > 	4. Reactivate devices and drivers
> > 	5. Save snapshot to disk
> > 	6. Prepare devices for wakeup
> > 	7. Power down (ACPI S4 on systems which support it)
> > 
> > Leaving hibernation involves a similar sequence which I won't discuss.
> > 
> > Notice that steps 1-5 above are _completely_ independent of all issues 
> > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> > carried out for hibernation to work, no matter how the system ends up 
> > getting shut down.
> 
> Not exactly.  Step 2 is supposed to be aware of the target state's
> capabilities, including what's wakeup-capable.  ACPI uses target
> device states to choose which _SxD methods to execute, etc.  (Or it
> should ... though come to think of it, I don't think I ever saw a
> hook whereby PCI could trigger that.)

Still, step 4 effectively undoes at least some things we did in 2.  At least
the GPEs should be enabled for normal operation so that we can save the image.

> > On the other hand, steps 6 and 7 aren't really needed for hibernation.  
> > You _could_ shut the system off completely (ACPI S5).  Automatic wakeup
> > wouldn't work, but the next time the user turned the computer on manually
> > it would still resume from hibernation.
> 
> I believe I did comment on your point that step 7 could use S5.
> 
> However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C)
> that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state.  (Then
> fuzzes the issue in 2.4, but those bits are less relevant here;
> 2.2 also mentions G3 = "Mechanical OFF", which is the only state
> in which machine disassembly/reassembly is expected to be safe.

But then there's the nice picture in 9.3.3 (OS loading) that shows how OSPM
(that would be us) can verify that the hardware configuration hasn't changed.

In fact we don't do this, because we always go to the "Load OS Images" block
and load the hibernation image from this newly loaded OS (aka the boot kernel).

Thus our resume is always different from the "ACPI wake up from S4".

> ACPI is allowed to distinguish between S4 and S5 in more ways
> than just the power usage.  It'd be fair for the AML to store
> state in something that retains power, and rely on that.  It'd
> be better not to do things that are allowed to confuse ACPI.

As far as I understand the specification, OSPM (ie. we) can always discard
the fact that the system has entered S4 and reinitialize everything from
scratch.

> > Conversely, steps 6 and 7 can make sense even in situations where you
> > don't want to hibernate.  For example, you might want a normal shutdown in
> > which the operating system does a full restart when the firmware is
> > signalled by a wakeup device.
> 
> Wakeup devices in S4 are expected to be a superset of those in S5,
> and system documentation often covers that.  Yeah, I know, "who
> bothers to RTFM".  Still, the point is that these systems are now
> documented to work in a particular way, and there really ought to
> be a good reason to invalidate user training and documentation.

That's a very important point, IMO.

> > So there should be separate data structures associated with 1-5 and 6-7.  
> > Maybe the one associated with 6-7 is what you are calling hibernation_ops;  
> > if so then fine.  But I still think that it should be usable for
> > situations where you are not entering hibernation, and we should be 
> > possible to enter hibernation without using it.  The system administrator 
> > should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
> > regardless of whether it's to start hibernating.
> 
> But ... why?  What value would users see from that?
> 
> We do have /sys/power/disk today, but that's only for
> hibernation.  (And it's a bit confusing, too.)
> 
> A "Soft OFF" should be S5 to conform to specs and
> documentation.
> 
> 
> > The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy
> > to check and see), but that doesn't mean we have to use the terms
> > synonymously.
> 
> It talks S4 as a "sleeping" state, like S1, S2, and S3.
> Or, about S4 as a "Non-Volatle sleep" state
> 
> I think it also assumes more intelligence on resume-from-S4
> than Linux has just now, which may partly explain why it
> takes so long for swsusp to finish its thing.

Well, please look at the picture in 9.3.3 and compare it to what we're
doing. ;-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:54                                                                                   ` Johannes Berg
@ 2007-05-04 22:21                                                                                     ` Rafael J. Wysocki
  2007-05-05 15:37                                                                                       ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 22:21 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Friday, 4 May 2007 23:54, Johannes Berg wrote:
> On Fri, 2007-05-04 at 23:55 +0200, Rafael J. Wysocki wrote:
> > On Friday, 4 May 2007 23:23, Johannes Berg wrote:
> > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> > > 
> > > > Actually, prethaw is to prevent drivers loaded before the image is restored
> > > > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > > > .resume() routines.
> > > 
> > > Oh, but it can, you could have a flag in your driver saying "the next
> > > resume is after restore" and you set that flag in prethaw.
> > 
> > No, you should have set that flag in .suspend(), really. :-)
> 
> Yeah, whatever. You can fix the problem but it's ugly. Let's come up
> with a good way to do the 6 callbacks mentioned in some other thread
> earlier.

This is the plan, but we need to do some preparations.

For example, I think, we should introduce some consistent terminology, so that
we *always* know what we're talking about.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:53                                                                                             ` Johannes Berg
@ 2007-05-04 22:25                                                                                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 22:25 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list

On Friday, 4 May 2007 23:53, Johannes Berg wrote:
> On Fri, 2007-05-04 at 23:53 +0200, Rafael J. Wysocki wrote:
> > > Plus, it may have some parts related to the communications with operating
> > > system (*)... I guess we need to save those, and parts related to hw
> > > state... where your suggestion makes sense.
> > 
> > If they are accessible to us, then we can, but what if they aren't (eg. the
> > state information is stored in the embedded controller, can only be read with
> > the help of some AML invocations and cannot be changed from the OS level)?
> 
> Well, in that case you also haven't overwritten/changed them during
> restore so there's no room for mismatches and confusion.

Not if we went for S5 to finish the hibernation and then we try to be smart and
rely on the BIOS-provided information/functionality *as though* we had passed
through S4.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 22:12                                                                                   ` David Brownell
@ 2007-05-04 22:31                                                                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-04 22:31 UTC (permalink / raw)
  To: David Brownell
  Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham

On Saturday, 5 May 2007 00:12, David Brownell wrote:
> On Friday 04 May 2007, Rafael J. Wysocki wrote:
> > On Friday, 4 May 2007 23:23, Johannes Berg wrote:
> > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> > > 
> > > > Actually, prethaw is to prevent drivers loaded before the image is restored
> > > > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > > > .resume() routines.
> > > 
> > > Oh, but it can, you could have a flag in your driver saying "the next
> > > resume is after restore" and you set that flag in prethaw.
> > 
> > No, you should have set that flag in .suspend(), really. :-)
> 
> That doesn't work very well.  Not only does suspend() not
> know the target state, but you don't want to trash the
> controller state if you're getting resumed after some kind
> of fault in the suspend-to-disk path...
> 
> I'm hoping that explains the smiley!

Yes, among other things (like that passing anything from prethaw to .resume()
really doesn't work unless the data are stored in a device ;-)).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 22:21                                                                                     ` Rafael J. Wysocki
@ 2007-05-05 15:37                                                                                       ` Alan Stern
  2007-05-05 18:49                                                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 15:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Sat, 5 May 2007, Rafael J. Wysocki wrote:

> > Yeah, whatever. You can fix the problem but it's ugly. Let's come up
> > with a good way to do the 6 callbacks mentioned in some other thread
> > earlier.
> 
> This is the plan, but we need to do some preparations.
> 
> For example, I think, we should introduce some consistent terminology, so that
> we *always* know what we're talking about.

A proposal:

For suspend-to-RAM we already have suspend() and resume().  At the 
possible cost of introducing some confusion, I think it makes sense to 
keep those method names.

For hibernation we need these:

	pre_snapshot()
	post_snapshot()
	pre_restore()
	post_restore()

In addition we may want to have early/late variations on these (for use 
after interrupts have been disabled), which would lead to:

	pre_snapshot()
	pre_snapshot_late()
	post_snapshot_early()
	post_snapshot()
	pre_restore()
	pre_restore_late()
	post_restore_early()
	post_restore()

Yes, it's a large list...  But it seems to be necessary for providing all 
the information drivers will need.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 22:00                                                                         ` David Brownell
@ 2007-05-05 15:49                                                                           ` Alan Stern
  2007-05-07  1:10                                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 15:49 UTC (permalink / raw)
  To: David Brownell
  Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek,
	Nigel Cunningham

On Fri, 4 May 2007, David Brownell wrote:

> > You are describing the difference between ACPI S4 and S5, but I was 
> > talking about the difference between "normal" poweroff and "hibernate" 
> > poweroff.  There doesn't seem to be any reason why we must always have
> > 
> > 	hibernate = S4    and     normal = S5.
> 
> What the ACPI spec describes for the "Non-Volatile Sleep" is
> that either S4 or S5 could match "hibernate" ... but for
> a software-controlled "poweroff", only S5 is appropriate.
> 
> That's a reason.  Another:  pretty much all end-user docs
> on this stuff match what ACPI says.
> 
> Lacking compelling reasons to violate specs (like them
> being clearly broken), I avoid breaking them.

Again you misunderstand.  I concede that either S4 or S5 is appropriate
for "Non-Volatile Sleep" whereas only S5 is appropriate for
software-controlled "poweroff".

But who says that hibernate has to use "Non-Volatile Sleep" and normal 
shutdown has to use software-controlled "poweroff"?  Why shouldn't the 
user be able to do it the other way 'round?


> > > Non-ACPI systems can make the same natural distinctions.
> > 
> > On such systems there seems to be even less reason for those equalities 
> > (or rather, their analogs).
> 
> This is one of those "less is more" things, right?  :)
> 
> People doing embedded designs _like_ their flexibility.
> 
> It's common to have multiple power levels.  If you mean
> that they _could_ give up that flexibility and only use
> one of those state analogues, yes they could ... but if
> you mean they'd see that as a Good Thing, I doubt it.

No, no!  That's not what I mean.  I'm proposing that we offer the user
_more_ flexibility by giving a choice of power levels.  The user should be
able to choose whether the system uses "Non-Volatile Sleep" vs.
software-controlled "poweroff"; the choice shouldn't be dictated by
whether or not the system is entering hibernation.

> I guess I don't see why you want to throw away all the
> work the hardware (and/or software) designers did to
> ensure that some key devices use a "retention" mode
> in their S4-analogue state.
> 
> Me, I always thought that leveraging those retention
> states was a great way to shrink wakeup times and get
> more functionality.

I can't imagine why you think I proposed anything along those lines.


> > > You're suggesting Linux not use the S5 state, essentially.
> > 
> > No, I'm suggesting that the user should be able to control whether Linux 
> > uses S4 vs. S5 at poweroff time.  If the user selected always to use S4 
> > then wakeup devices would function in both hibernation and normal 
> > shutdown.  If the user selected always to use S5 then wakeup devices would 
> > not function in either hibernation or normal shutdown.
> 
> That's a different suggestion, yes.  I'm not sure I see any
> benefit of that flexibility for "soft off" states though,
> especially if it made "off" consume more power.

The benefit is that it allows more devices to function as wakeup sources, 
right?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:53                                                                                           ` Rafael J. Wysocki
  2007-05-04 21:53                                                                                             ` Johannes Berg
@ 2007-05-05 15:52                                                                                             ` Alan Stern
  2007-05-07  1:16                                                                                               ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 15:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Fri, 4 May 2007, Rafael J. Wysocki wrote:

> > Plus, it may have some parts related to the communications with operating
> > system (*)... I guess we need to save those, and parts related to hw
> > state... where your suggestion makes sense.
> 
> If they are accessible to us, then we can, but what if they aren't (eg. the
> state information is stored in the embedded controller, can only be read with
> the help of some AML invocations and cannot be changed from the OS level)?
> 
> > (*) and yes, there probably are such parts. If we set backlight to
> > 20%, we'll be confused if it is 100% after resume... we probably could
> > handle those one-by-one...
> 
> *If* we reinitialize devices *and* ACPI from scratch after restoring the image,
> we'll discard the old value (20%) and read the new value (100%) from the BIOS.
> The problems occur, IMO, because we try to be smart and use the BIOS
> after the resume as though we'd resumed from a real suspend (eg. s2ram).
> 
> Which is natural, if we use the same set of .resume() callbacks for both cases.

Agreed, these all sound like problems in the ACPI driver's implementation 
of suspend and resume.  Problems that are caused (at least in part) by the 
fact that the PM core doesn't tell the driver whether it's doing
suspend-to-RAM vs. hibernation.  Once that is straighened out, everything 
else should become much simpler.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:40                                                                                 ` David Brownell
  2007-05-04 22:19                                                                                   ` Rafael J. Wysocki
@ 2007-05-05 16:08                                                                                   ` Alan Stern
  2007-05-05 17:50                                                                                     ` Rafael J. Wysocki
  2007-05-07  1:31                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
  1 sibling, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-05 16:08 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Fri, 4 May 2007, David Brownell wrote:

> > Entering hibernation basically involves these steps:
> > 
> > 	1. Freeze tasks
> > 	2. Quiesce devices and drivers
> > 	3. Create snapshot
> > 	4. Reactivate devices and drivers
> > 	5. Save snapshot to disk
> > 	6. Prepare devices for wakeup
> > 	7. Power down (ACPI S4 on systems which support it)
> > 
> > Leaving hibernation involves a similar sequence which I won't discuss.
> > 
> > Notice that steps 1-5 above are _completely_ independent of all issues 
> > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> > carried out for hibernation to work, no matter how the system ends up 
> > getting shut down.
> 
> Not exactly.  Step 2 is supposed to be aware of the target state's
> capabilities, including what's wakeup-capable.

Not true.  Step 2 is (or should be) divorced from power-level
considerations.  All it needs to do is quiesce things so that a consistent
snapshot can be obtained; changing power levels would take time and
ideally should be avoided.  Furthermore, anything done in step 2 should be
reversed in step 4.

Did you mean to say that Step _6_ is supposed to be aware of the target 
state's capabilities?  I'll agree to that.


> However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C)
> that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state.  (Then
> fuzzes the issue in 2.4, but those bits are less relevant here;
> 2.2 also mentions G3 = "Mechanical OFF", which is the only state
> in which machine disassembly/reassembly is expected to be safe.

Sure.  But entering hibernation need not involve putting the system into a 
"sleeping" state.  Going into G3 should also work for hibernation.

> ACPI is allowed to distinguish between S4 and S5 in more ways
> than just the power usage.  It'd be fair for the AML to store
> state in something that retains power, and rely on that.  It'd
> be better not to do things that are allowed to confuse ACPI.

None of that should matter for post-snapshot-restore processing.  The 
boot kernel interacts with ACPI when the system wakes up; the restored 
kernel is handed an already-running BIOS, which it should do its best to 
reinitialize from the existing hardware state.


> > possible to enter hibernation without using it.  The system administrator 
> > should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
> > regardless of whether it's to start hibernating.
> 
> But ... why?  What value would users see from that?
> 
> We do have /sys/power/disk today, but that's only for
> hibernation.  (And it's a bit confusing, too.)

Yes.  I'm proposing that it be generalized.  (And it should be renamed,
too -- that's a separate issue.)

I'm also pointing out that the policy choice decided by the contents of 
/sys/power/disk comes into play during steps 6-7 above, but not at all in 
steps 1-5.  Hence any associated software structures should explicitly be 
connected only with steps 6 and 7.

And since normal shutdown ought to have its own analog of steps 6 and 7, 
the same software structures should be used there.  Hence naming them 
"hibernation_ops" isn't a good idea.


> I think it also assumes more intelligence on resume-from-S4
> than Linux has just now, which may partly explain why it
> takes so long for swsusp to finish its thing.

And it may explain some of the strange behavior people sometimes observe
when they try to hibernate twice in a row.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:23                                                                               ` Johannes Berg
  2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
@ 2007-05-05 16:15                                                                                 ` Alan Stern
  1 sibling, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-05 16:15 UTC (permalink / raw)
  To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek

On Fri, 4 May 2007, Johannes Berg wrote:

> On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote:
> 
> > Actually, prethaw is to prevent drivers loaded before the image is restored
> > from doing unreasonable things.  It doesn't have any effect on the drivers'
> > .resume() routines.
> 
> Oh, but it can, you could have a flag in your driver saying "the next
> resume is after restore" and you set that flag in prethaw.

You're both wrong.  PRETHAW is to prevent drivers present in the image
from doing reasonable-but-wrong things (because they were misled by
actions taken by the boot kernel or the BIOS before the image was
restored).  It gives the boot kernel's driver a chance to put the device
in a state which won't be misleading.

And while you could have a flag in your driver saying "the next resume is
after restore", setting it during PRETHAW would accomplish nothing.  
PRETHAW occurs immediately before the image is restored, which means the
flag would get overwritten by the contents of the image.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-04 21:24                                                                                   ` Rafael J. Wysocki
@ 2007-05-05 16:19                                                                                     ` Alan Stern
  2007-05-05 17:46                                                                                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 16:19 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Fri, 4 May 2007, Rafael J. Wysocki wrote:

> > > Entering hibernation basically involves these steps:
> > > 
> > > 	1. Freeze tasks
> > > 
> > > 	2. Quiesce devices and drivers
> > > 
> > > 	3. Create snapshot
> > > 
> > > 	4. Reactivate devices and drivers
> > > 
> > > 	5. Save snapshot to disk
> > > 
> > > 	6. Prepare devices for wakeup
> > > 
> > > 	7. Power down (ACPI S4 on systems which support it)
> > > 
> > > Leaving hibernation involves a similar sequence which I won't discuss.
> > > 
> > > Notice that steps 1-5 above are _completely_ independent of all issues 
> > > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to
> > > be
> > 
> > No, they are not. You probably should tell ACPI at step 2 that you are
> > suspending,

At step 2 you don't _know_ that you are suspending!  Step 5 might fail.  
You should tell ACPI during step 6 or 7.

> You can, but even if you don't, the BIOS shouldn't have problems.  What might
> have problems is our ACPI code during the resume, if it cannot get appropriate
> information from the BIOS.
> 
> > and you definitely need to tell ACPI that you have resumed 
> > (so it can re-scan AC adapters, for example).
> 
> Yes, but that can be done in two different ways:
> 
> 1) "We have restored the hibernation image, but the BIOS state corresponds to
> a fresh reboot, so please initialize everything from scratch."
> 
> 2) "We have restored the hibernation image and the ACPI S4 was used for
> powering off (hint: you may try not to initialize everything from scratch)."
> 
> Of course, in the case 2) we are responsible for ensuring that the contents of
> the hibernation image are consistent with the information preserved by the
> BIOS.

Keep in mind also that before you can do either 1) or 2), the boot kernel 
has already communicated with the BIOS, possibly changing some of the ACPI 
state.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 16:19                                                                                     ` Alan Stern
@ 2007-05-05 17:46                                                                                       ` Rafael J. Wysocki
  2007-05-05 21:42                                                                                         ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 17:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Saturday, 5 May 2007 18:19, Alan Stern wrote:
> On Fri, 4 May 2007, Rafael J. Wysocki wrote:
> 
> > > > Entering hibernation basically involves these steps:
> > > > 
> > > > 	1. Freeze tasks
> > > > 
> > > > 	2. Quiesce devices and drivers
> > > > 
> > > > 	3. Create snapshot
> > > > 
> > > > 	4. Reactivate devices and drivers
> > > > 
> > > > 	5. Save snapshot to disk
> > > > 
> > > > 	6. Prepare devices for wakeup
> > > > 
> > > > 	7. Power down (ACPI S4 on systems which support it)
> > > > 
> > > > Leaving hibernation involves a similar sequence which I won't discuss.
> > > > 
> > > > Notice that steps 1-5 above are _completely_ independent of all issues 
> > > > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to
> > > > be
> > > 
> > > No, they are not. You probably should tell ACPI at step 2 that you are
> > > suspending,
> 
> At step 2 you don't _know_ that you are suspending!  Step 5 might fail.  
> You should tell ACPI during step 6 or 7.
> 
> > You can, but even if you don't, the BIOS shouldn't have problems.  What might
> > have problems is our ACPI code during the resume, if it cannot get appropriate
> > information from the BIOS.
> > 
> > > and you definitely need to tell ACPI that you have resumed 
> > > (so it can re-scan AC adapters, for example).
> > 
> > Yes, but that can be done in two different ways:
> > 
> > 1) "We have restored the hibernation image, but the BIOS state corresponds to
> > a fresh reboot, so please initialize everything from scratch."
> > 
> > 2) "We have restored the hibernation image and the ACPI S4 was used for
> > powering off (hint: you may try not to initialize everything from scratch)."
> > 
> > Of course, in the case 2) we are responsible for ensuring that the contents of
> > the hibernation image are consistent with the information preserved by the
> > BIOS.
> 
> Keep in mind also that before you can do either 1) or 2), the boot kernel 
> has already communicated with the BIOS, possibly changing some of the ACPI 
> state.

That's correct, but it follows from the ACPI spec that there is a way for the
boot kernel to distinguish 'normal' boot from 'S4 resume' boot.  If this
mechanism is used and the boot kernel states that it's doing a 'S4 resume',
it will be able to leave ACPI alone and restore the hibernation image.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 16:08                                                                                   ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern
@ 2007-05-05 17:50                                                                                     ` Rafael J. Wysocki
  2007-05-05 21:43                                                                                       ` Alan Stern
  2007-05-07  1:31                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 17:50 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Saturday, 5 May 2007 18:08, Alan Stern wrote:
> On Fri, 4 May 2007, David Brownell wrote:
> 
> > > Entering hibernation basically involves these steps:
> > > 
> > > 	1. Freeze tasks
> > > 	2. Quiesce devices and drivers
> > > 	3. Create snapshot
> > > 	4. Reactivate devices and drivers
> > > 	5. Save snapshot to disk
> > > 	6. Prepare devices for wakeup
> > > 	7. Power down (ACPI S4 on systems which support it)
> > > 
> > > Leaving hibernation involves a similar sequence which I won't discuss.
> > > 
> > > Notice that steps 1-5 above are _completely_ independent of all issues 
> > > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> > > carried out for hibernation to work, no matter how the system ends up 
> > > getting shut down.
> > 
> > Not exactly.  Step 2 is supposed to be aware of the target state's
> > capabilities, including what's wakeup-capable.
> 
> Not true.  Step 2 is (or should be) divorced from power-level
> considerations.  All it needs to do is quiesce things so that a consistent
> snapshot can be obtained; changing power levels would take time and
> ideally should be avoided.  Furthermore, anything done in step 2 should be
> reversed in step 4.
> 
> Did you mean to say that Step _6_ is supposed to be aware of the target 
> state's capabilities?  I'll agree to that.
> 
> 
> > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C)
> > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state.  (Then
> > fuzzes the issue in 2.4, but those bits are less relevant here;
> > 2.2 also mentions G3 = "Mechanical OFF", which is the only state
> > in which machine disassembly/reassembly is expected to be safe.
> 
> Sure.  But entering hibernation need not involve putting the system into a 
> "sleeping" state.  Going into G3 should also work for hibernation.
> 
> > ACPI is allowed to distinguish between S4 and S5 in more ways
> > than just the power usage.  It'd be fair for the AML to store
> > state in something that retains power, and rely on that.  It'd
> > be better not to do things that are allowed to confuse ACPI.
> 
> None of that should matter for post-snapshot-restore processing.  The 
> boot kernel interacts with ACPI when the system wakes up; the restored 
> kernel is handed an already-running BIOS, which it should do its best to 
> reinitialize from the existing hardware state.
> 
> 
> > > possible to enter hibernation without using it.  The system administrator 
> > > should be able to choose which of S4 or S5 gets used for _any_ poweroff, 
> > > regardless of whether it's to start hibernating.
> > 
> > But ... why?  What value would users see from that?
> > 
> > We do have /sys/power/disk today, but that's only for
> > hibernation.  (And it's a bit confusing, too.)
> 
> Yes.  I'm proposing that it be generalized.  (And it should be renamed,
> too -- that's a separate issue.)
> 
> I'm also pointing out that the policy choice decided by the contents of 
> /sys/power/disk comes into play during steps 6-7 above, but not at all in 
> steps 1-5.  Hence any associated software structures should explicitly be 
> connected only with steps 6 and 7.

At present, this policy choice does affect the earlier steps too.
 
> And since normal shutdown ought to have its own analog of steps 6 and 7, 
> the same software structures should be used there.  Hence naming them 
> "hibernation_ops" isn't a good idea.
> 
> 
> > I think it also assumes more intelligence on resume-from-S4
> > than Linux has just now, which may partly explain why it
> > takes so long for swsusp to finish its thing.
> 
> And it may explain some of the strange behavior people sometimes observe
> when they try to hibernate twice in a row.

Yes, this seems to be the case.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 15:37                                                                                       ` Alan Stern
@ 2007-05-05 18:49                                                                                         ` Rafael J. Wysocki
  2007-05-05 21:44                                                                                           ` Alan Stern
  2007-05-07  8:51                                                                                           ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
  0 siblings, 2 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 18:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Saturday, 5 May 2007 17:37, Alan Stern wrote:
> On Sat, 5 May 2007, Rafael J. Wysocki wrote:
> 
> > > Yeah, whatever. You can fix the problem but it's ugly. Let's come up
> > > with a good way to do the 6 callbacks mentioned in some other thread
> > > earlier.
> > 
> > This is the plan, but we need to do some preparations.
> > 
> > For example, I think, we should introduce some consistent terminology, so that
> > we *always* know what we're talking about.
> 
> A proposal:
> 
> For suspend-to-RAM we already have suspend() and resume().  At the 
> possible cost of introducing some confusion, I think it makes sense to 
> keep those method names.

I agree.

> For hibernation we need these:
> 
> 	pre_snapshot()
> 	post_snapshot()
> 	pre_restore()
> 	post_restore()
> 
> In addition we may want to have early/late variations on these (for use 
> after interrupts have been disabled), which would lead to:
> 
> 	pre_snapshot()
> 	pre_snapshot_late()
> 	post_snapshot_early()
> 	post_snapshot()
> 	pre_restore()
> 	pre_restore_late()
> 	post_restore_early()
> 	post_restore()
> 
> Yes, it's a large list...  But it seems to be necessary for providing all 
> the information drivers will need.

I think we may need yet another callback, executed before pre_snapshot()
and before we shrink memory during the hibernation, to be used by drivers
that need a lot of additional memory in pre_snapshot().

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 17:46                                                                                       ` Rafael J. Wysocki
@ 2007-05-05 21:42                                                                                         ` Alan Stern
  2007-05-05 22:14                                                                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 21:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Sat, 5 May 2007, Rafael J. Wysocki wrote:

> > > Yes, but that can be done in two different ways:
> > > 
> > > 1) "We have restored the hibernation image, but the BIOS state corresponds to
> > > a fresh reboot, so please initialize everything from scratch."
> > > 
> > > 2) "We have restored the hibernation image and the ACPI S4 was used for
> > > powering off (hint: you may try not to initialize everything from scratch)."
> > > 
> > > Of course, in the case 2) we are responsible for ensuring that the contents of
> > > the hibernation image are consistent with the information preserved by the
> > > BIOS.
> > 
> > Keep in mind also that before you can do either 1) or 2), the boot kernel 
> > has already communicated with the BIOS, possibly changing some of the ACPI 
> > state.
> 
> That's correct, but it follows from the ACPI spec that there is a way for the
> boot kernel to distinguish 'normal' boot from 'S4 resume' boot.  If this
> mechanism is used and the boot kernel states that it's doing a 'S4 resume',
> it will be able to leave ACPI alone and restore the hibernation image.

Okay, good.  That means part of the resume-from-hibernation handling must
be included in the standard startup code of the ACPI driver, because it 
runs in the boot kernel rather than the restored kernel.  Does it work 
that way now?  You'd think it must...

The restored kernel could do either 1) or 2), I don't see that it matters
much which.  1) might be safer, because it's possible that external power
was turned off at some point during the hibernation (and no battery power 
was available).

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 17:50                                                                                     ` Rafael J. Wysocki
@ 2007-05-05 21:43                                                                                       ` Alan Stern
  2007-05-05 22:16                                                                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 21:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Sat, 5 May 2007, Rafael J. Wysocki wrote:

> > I'm also pointing out that the policy choice decided by the contents of 
> > /sys/power/disk comes into play during steps 6-7 above, but not at all in 
> > steps 1-5.  Hence any associated software structures should explicitly be 
> > connected only with steps 6 and 7.
> 
> At present, this policy choice does affect the earlier steps too.

Isn't this then another aspect of hibernation needing to be fixed?  Or is 
there some genuine reason I'm not aware of that the choice of shutdown 
method should affect those steps?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 18:49                                                                                         ` Rafael J. Wysocki
@ 2007-05-05 21:44                                                                                           ` Alan Stern
  2007-05-05 22:36                                                                                             ` Rafael J. Wysocki
  2007-05-07  8:51                                                                                           ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
  1 sibling, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-05 21:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Sat, 5 May 2007, Rafael J. Wysocki wrote:

> > In addition we may want to have early/late variations on these (for use 
> > after interrupts have been disabled), which would lead to:
> > 
> > 	pre_snapshot()
> > 	pre_snapshot_late()
> > 	post_snapshot_early()
> > 	post_snapshot()
> > 	pre_restore()
> > 	pre_restore_late()
> > 	post_restore_early()
> > 	post_restore()
> > 
> > Yes, it's a large list...  But it seems to be necessary for providing all 
> > the information drivers will need.
> 
> I think we may need yet another callback, executed before pre_snapshot()
> and before we shrink memory during the hibernation, to be used by drivers
> that need a lot of additional memory in pre_snapshot().

	pre_snapshot_early()

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 21:42                                                                                         ` Alan Stern
@ 2007-05-05 22:14                                                                                           ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 22:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Saturday, 5 May 2007 23:42, Alan Stern wrote:
> On Sat, 5 May 2007, Rafael J. Wysocki wrote:
> 
> > > > Yes, but that can be done in two different ways:
> > > > 
> > > > 1) "We have restored the hibernation image, but the BIOS state corresponds to
> > > > a fresh reboot, so please initialize everything from scratch."
> > > > 
> > > > 2) "We have restored the hibernation image and the ACPI S4 was used for
> > > > powering off (hint: you may try not to initialize everything from scratch)."
> > > > 
> > > > Of course, in the case 2) we are responsible for ensuring that the contents of
> > > > the hibernation image are consistent with the information preserved by the
> > > > BIOS.
> > > 
> > > Keep in mind also that before you can do either 1) or 2), the boot kernel 
> > > has already communicated with the BIOS, possibly changing some of the ACPI 
> > > state.
> > 
> > That's correct, but it follows from the ACPI spec that there is a way for the
> > boot kernel to distinguish 'normal' boot from 'S4 resume' boot.  If this
> > mechanism is used and the boot kernel states that it's doing a 'S4 resume',
> > it will be able to leave ACPI alone and restore the hibernation image.
> 
> Okay, good.  That means part of the resume-from-hibernation handling must
> be included in the standard startup code of the ACPI driver, because it 
> runs in the boot kernel rather than the restored kernel.  Does it work 
> that way now?  You'd think it must...

Well, I'm not sure, but I don't think so.  It looks like the ACPI code that we
use in the hibernation/suspend code paths is not in a good shape in general.

IOW, we may want to implement that in the future, but I'd rather like to get
1) working reliably for everyone first.

> The restored kernel could do either 1) or 2), I don't see that it matters
> much which.  1) might be safer, because it's possible that external power
> was turned off at some point during the hibernation (and no battery power 
> was available).

I think that the 'ACPI S4' handling adds quite a lot of complexity to the
picture and should be added on top of a working infrastructure, as an
extension.

Currently, we don't handle the hibernation in accordance with the ACPI spec
anyway.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 21:43                                                                                       ` Alan Stern
@ 2007-05-05 22:16                                                                                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 22:16 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Saturday, 5 May 2007 23:43, Alan Stern wrote:
> On Sat, 5 May 2007, Rafael J. Wysocki wrote:
> 
> > > I'm also pointing out that the policy choice decided by the contents of 
> > > /sys/power/disk comes into play during steps 6-7 above, but not at all in 
> > > steps 1-5.  Hence any associated software structures should explicitly be 
> > > connected only with steps 6 and 7.
> > 
> > At present, this policy choice does affect the earlier steps too.
> 
> Isn't this then another aspect of hibernation needing to be fixed?  Or is 
> there some genuine reason I'm not aware of that the choice of shutdown 
> method should affect those steps?

Well, I think it should be fixed, but I'm afraid that'll take a *lot* of time.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 21:44                                                                                           ` Alan Stern
@ 2007-05-05 22:36                                                                                             ` Rafael J. Wysocki
  2007-05-06 22:01                                                                                               ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-05 22:36 UTC (permalink / raw)
  To: Alan Stern
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Saturday, 5 May 2007 23:44, Alan Stern wrote:
> On Sat, 5 May 2007, Rafael J. Wysocki wrote:
> 
> > > In addition we may want to have early/late variations on these (for use 
> > > after interrupts have been disabled), which would lead to:
> > > 
> > > 	pre_snapshot()
> > > 	pre_snapshot_late()
> > > 	post_snapshot_early()
> > > 	post_snapshot()
> > > 	pre_restore()
> > > 	pre_restore_late()
> > > 	post_restore_early()
> > > 	post_restore()
> > > 
> > > Yes, it's a large list...  But it seems to be necessary for providing all 
> > > the information drivers will need.
> > 
> > I think we may need yet another callback, executed before pre_snapshot()
> > and before we shrink memory during the hibernation, to be used by drivers
> > that need a lot of additional memory in pre_snapshot().
> 
> 	pre_snapshot_early()

OK

So, I think the hibernation code ordering should be like this (let's forget
about ACPI for now):

1) tasks are frozen
2) pre_snapshot_early()
3) memory is freed for the snapshot image
4) pre_snapshod()
5) nonboot CPUs are offlined
6) IRQs are disabled
7) pre_snapshot_late()
8) sysdev_pre_snapshot()
9) snapshot image is created
10) sysdev_post_snapshot()
11) post_snapshot_early()
12) IRQs are enabled
13) nonboot CPUs are enabled
14) post_snapshot()
15) snapshot image is saved
16) device_shutdown()
17) system is powered off

Apart from this, we may need notifiers for subsystems that should do something
before the freezing and after the thawing of tasks (like FUSE etc.).

Also, if there's an error, we have to be able to thaw tasks after
post_snapshot() and continue running.

The restore code, IMO, should be like this (again, let's ignore ACPI for now):

1) boot kernel is started, initrd is loaded etc.
2) tasks are frozen
3) snapshot image is loaded
4) pre_restore()
5) nonboot CPUs are offlined
6) IRQs are disabled
7) pre_restore_late()
8) sysdev_pre_restore()
9) boot kernel is replaced with the 'hibernated' kernel
10) sysdev_post_restore()
11) post_restore_early()
12) IRQs are enabled
13) nonboot CPUs are enabled
14) post_restore()
15) tasks are thawed
16) system is running

and we may need a notifier for subsystems that should do something after
tasks have been thawed.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 22:36                                                                                             ` Rafael J. Wysocki
@ 2007-05-06 22:01                                                                                               ` Alan Stern
  2007-05-06 22:31                                                                                                 ` Rafael J. Wysocki
  2007-05-07  1:37                                                                                                 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell
  0 siblings, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-06 22:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Sun, 6 May 2007, Rafael J. Wysocki wrote:

> > > I think we may need yet another callback, executed before pre_snapshot()
> > > and before we shrink memory during the hibernation, to be used by drivers
> > > that need a lot of additional memory in pre_snapshot().
> > 
> > 	pre_snapshot_early()
> 
> OK

I changed my mind -- pre_hibernate() seems like a better name.  There
could be a matching post_hibernate(), if anyone finds it necessary.  I
considered pre_freeze(), but that's not such a good choice since the
freezer can be used for other things in addition to hibernation.

> So, I think the hibernation code ordering should be like this (let's forget
> about ACPI for now):
> 
> 1) tasks are frozen
> 2) pre_snapshot_early()

Or rather: 2) pre_hibernate()

> 3) memory is freed for the snapshot image
> 4) pre_snapshod()
> 5) nonboot CPUs are offlined
> 6) IRQs are disabled
> 7) pre_snapshot_late()
> 8) sysdev_pre_snapshot()
> 9) snapshot image is created
> 10) sysdev_post_snapshot()
> 11) post_snapshot_early()
> 12) IRQs are enabled
> 13) nonboot CPUs are enabled
> 14) post_snapshot()
> 15) snapshot image is saved
> 16) device_shutdown()
> 17) system is powered off
> 
> Apart from this, we may need notifiers for subsystems that should do something
> before the freezing and after the thawing of tasks (like FUSE etc.).

Quite so.

> Also, if there's an error, we have to be able to thaw tasks after
> post_snapshot() and continue running.
> 
> The restore code, IMO, should be like this (again, let's ignore ACPI for now):
> 
> 1) boot kernel is started, initrd is loaded etc.
> 2) tasks are frozen
> 3) snapshot image is loaded
> 4) pre_restore()
> 5) nonboot CPUs are offlined
> 6) IRQs are disabled
> 7) pre_restore_late()
> 8) sysdev_pre_restore()
> 9) boot kernel is replaced with the 'hibernated' kernel
> 10) sysdev_post_restore()
> 11) post_restore_early()
> 12) IRQs are enabled
> 13) nonboot CPUs are enabled
> 14) post_restore()
> 15) tasks are thawed
> 16) system is running
> 
> and we may need a notifier for subsystems that should do something after
> tasks have been thawed.

It sounds good to me.  Now if only it were possible to get rid of those
pesky sysdevs...

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-06 22:01                                                                                               ` Alan Stern
@ 2007-05-06 22:31                                                                                                 ` Rafael J. Wysocki
  2007-05-07  1:37                                                                                                 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell
  1 sibling, 0 replies; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-06 22:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham

On Monday, 7 May 2007 00:01, Alan Stern wrote:
> On Sun, 6 May 2007, Rafael J. Wysocki wrote:
> 
> > > > I think we may need yet another callback, executed before pre_snapshot()
> > > > and before we shrink memory during the hibernation, to be used by drivers
> > > > that need a lot of additional memory in pre_snapshot().
> > > 
> > > 	pre_snapshot_early()
> > 
> > OK
> 
> I changed my mind -- pre_hibernate() seems like a better name.

OK

> There could be a matching post_hibernate(), if anyone finds it necessary.  I
> considered pre_freeze(), but that's not such a good choice since the
> freezer can be used for other things in addition to hibernation.

Agreed.

> > So, I think the hibernation code ordering should be like this (let's forget
> > about ACPI for now):
> > 
> > 1) tasks are frozen
> > 2) pre_snapshot_early()
> 
> Or rather: 2) pre_hibernate()

OK

> > 3) memory is freed for the snapshot image
> > 4) pre_snapshod()
> > 5) nonboot CPUs are offlined
> > 6) IRQs are disabled
> > 7) pre_snapshot_late()
> > 8) sysdev_pre_snapshot()
> > 9) snapshot image is created
> > 10) sysdev_post_snapshot()
> > 11) post_snapshot_early()
> > 12) IRQs are enabled
> > 13) nonboot CPUs are enabled
> > 14) post_snapshot()
> > 15) snapshot image is saved
> > 16) device_shutdown()
> > 17) system is powered off
> > 
> > Apart from this, we may need notifiers for subsystems that should do something
> > before the freezing and after the thawing of tasks (like FUSE etc.).
> 
> Quite so.
> 
> > Also, if there's an error, we have to be able to thaw tasks after
> > post_snapshot() and continue running.
> > 
> > The restore code, IMO, should be like this (again, let's ignore ACPI for now):
> > 
> > 1) boot kernel is started, initrd is loaded etc.
> > 2) tasks are frozen
> > 3) snapshot image is loaded
> > 4) pre_restore()
> > 5) nonboot CPUs are offlined
> > 6) IRQs are disabled
> > 7) pre_restore_late()
> > 8) sysdev_pre_restore()
> > 9) boot kernel is replaced with the 'hibernated' kernel
> > 10) sysdev_post_restore()
> > 11) post_restore_early()
> > 12) IRQs are enabled
> > 13) nonboot CPUs are enabled
> > 14) post_restore()
> > 15) tasks are thawed
> > 16) system is running
> > 
> > and we may need a notifier for subsystems that should do something after
> > tasks have been thawed.
> 
> It sounds good to me.  Now if only it were possible to get rid of those
> pesky sysdevs...

I think that will be possible over time.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-04 22:19                                                                                   ` Rafael J. Wysocki
@ 2007-05-07  1:05                                                                                     ` David Brownell
  0 siblings, 0 replies; 712+ messages in thread
From: David Brownell @ 2007-05-07  1:05 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Friday 04 May 2007, Rafael J. Wysocki wrote:
> On Friday, 4 May 2007 23:40, David Brownell wrote:
> > On Friday 04 May 2007, Alan Stern wrote:

> > > 	1. Freeze tasks
> > > 	2. Quiesce devices and drivers
> > > 	3. Create snapshot
> > > 	4. Reactivate devices and drivers
> > > 	5. Save snapshot to disk
> > > 	6. Prepare devices for wakeup
> > > 	7. Power down (ACPI S4 on systems which support it)
> > > 
> > > Leaving hibernation involves a similar sequence which I won't discuss.
> > > 
> > > Notice that steps 1-5 above are _completely_ independent of all issues 
> > > concerning wakeup devices and S4 vs. S5 vs. whatever.  They have to be 
> > > carried out for hibernation to work, no matter how the system ends up 
> > > getting shut down.
> > 
> > Not exactly.  Step 2 is supposed to be aware of the target state's
> > capabilities, including what's wakeup-capable.  ACPI uses target
> > device states to choose which _SxD methods to execute, etc.  (Or it
> > should ... though come to think of it, I don't think I ever saw a
> > hook whereby PCI could trigger that.)

The hook is there, but it's not yet implemented ... patch in the
works.  Whoever implemented pci_choose_state() botched it up.

 
> Still, step 4 effectively undoes at least some things we did in 2.  At least
> the GPEs should be enabled for normal operation so that we can save the image.

And for that matter, wakeup shouldn't be limited to wake-from-sleep;
runtime device PM should be able to use it.  ACPI doesn't use GPEs
very well at all, except maybe runtime GPEs.  Step 6 needs to know
the same info, so it can enable the GPEs that work from S4.

 

> But then there's the nice picture in 9.3.3 (OS loading) that shows how OSPM
> (that would be us) can verify that the hardware configuration hasn't changed.
> 
> In fact we don't do this, because we always go to the "Load OS Images" block
> and load the hibernation image from this newly loaded OS (aka the boot kernel).
> 
> Thus our resume is always different from the "ACPI wake up from S4".

Right ... "slower" being one consequence.


> > ACPI is allowed to distinguish between S4 and S5 in more ways
> > than just the power usage.  It'd be fair for the AML to store
> > state in something that retains power, and rely on that.  It'd
> > be better not to do things that are allowed to confuse ACPI.
> 
> As far as I understand the specification, OSPM (ie. we) can always discard
> the fact that the system has entered S4 and reinitialize everything from
> scratch.

At the price of making some things needlessly misbehave.  Devices
that can wake from D3cold will detect state being trashed if you
re-init, which is at least sub-optimal if not wrong.


> >	 Still, the point is that these systems are now
> > documented to work in a particular way, and there really ought to
> > be a good reason to invalidate user training and documentation.
> 
> That's a very important point, IMO.

So I just re-quoted it.  ;)


> > A "Soft OFF" should be S5 to conform to specs and
> > documentation.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-05 15:49                                                                           ` Alan Stern
@ 2007-05-07  1:10                                                                             ` David Brownell
  2007-05-07 18:46                                                                               ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07  1:10 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek,
	Nigel Cunningham

On Saturday 05 May 2007, Alan Stern wrote:

> But who says that hibernate has to use "Non-Volatile Sleep" and normal 
> shutdown has to use software-controlled "poweroff"?  Why shouldn't the 
> user be able to do it the other way 'round?

Well, the definition of NVS matches hibernation, and
the definition of soft-off matches poweroff.


> > > No, I'm suggesting that the user should be able to control whether Linux 
> > > uses S4 vs. S5 at poweroff time.  If the user selected always to use S4 
> > > then wakeup devices would function in both hibernation and normal 
> > > shutdown.  If the user selected always to use S5 then wakeup devices would 
> > > not function in either hibernation or normal shutdown.
> > 
> > That's a different suggestion, yes.  I'm not sure I see any
> > benefit of that flexibility for "soft off" states though,
> > especially if it made "off" consume more power.
> 
> The benefit is that it allows more devices to function as wakeup sources, 
> right?

With downsides of "more power consumed during 'off' states"
and "invalidating documentation, training, and expectations".

This is a case where the fact that something could technically
be done doesn't recommend it to me.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-05 15:52                                                                                             ` Alan Stern
@ 2007-05-07  1:16                                                                                               ` David Brownell
  2007-05-07 21:00                                                                                                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07  1:16 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Saturday 05 May 2007, Alan Stern wrote:

> Agreed, these all sound like problems in the ACPI driver's implementation 
> of suspend and resume.  Problems that are caused (at least in part) by the 
> fact that the PM core doesn't tell the driver whether it's doing
> suspend-to-RAM vs. hibernation.  Once that is straighened out, everything 
> else should become much simpler.

I'm not sure I agree with that diagnosis, but for the record:
updating drivers/pci/pci-acpi.c so that it can implement the
platform_pci_choose_state() hook requires ACPI to export that
information.

So for now I have drivers/acpi/sleep/main.c exporting

        s_state = acpi_get_target_sleep_state();

so that ACPI-aware code can know to call "_S3D" instead of
the "_S1D" or "_S4D" methods (and "_S3W" etc).  Of course
the $SUBJECT patch will finish borking that for S4.  :(

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-05 16:08                                                                                   ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern
  2007-05-05 17:50                                                                                     ` Rafael J. Wysocki
@ 2007-05-07  1:31                                                                                     ` David Brownell
  2007-05-07 16:33                                                                                       ` Alan Stern
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07  1:31 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Saturday 05 May 2007, Alan Stern wrote:
> On Fri, 4 May 2007, David Brownell wrote:
> 
> Did you mean to say that Step _6_ is supposed to be aware of the target 
> state's capabilities?  I'll agree to that.

Yes ... but I don't see why it would be wrong for step 2 either.
If the device can't wake from S5, it wouldn't set up with the
assumption that was a possibility.


> > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C)
> > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state.  (Then
> > fuzzes the issue in 2.4, but those bits are less relevant here;
> > 2.2 also mentions G3 = "Mechanical OFF", which is the only state
> > in which machine disassembly/reassembly is expected to be safe.
> 
> Sure.  But entering hibernation need not involve putting the system into a 
> "sleeping" state.  Going into G3 should also work for hibernation.

For some definitions of "should"; that's where specs get fuzzy.

Since disassembly is allowed in G3, if you swapped a disk that
should prevent the system from resuming ... it should force a
boot-from-scratch.  But if you just swapped a power supply it
would probably work OK.


> I'm also pointing out that the policy choice decided by the contents of 
> /sys/power/disk comes into play during steps 6-7 above, but not at all in 
> steps 1-5.  Hence any associated software structures should explicitly be 
> connected only with steps 6 and 7.

The difference between S4 and S5 could matter to step 2 though.
Perhaps it's not the most likely thing, but certainly avoiding
the work to setup wake-from-S4 is reasonable when going to S5.

 
> And since normal shutdown ought to have its own analog of steps 6 and 7, 
> the same software structures should be used there.  Hence naming them 
> "hibernation_ops" isn't a good idea.

That's something of a different stance.  And it's untrue for
step 6 too ... suspend() and shutdown() differ a lot.  Maybe
if I saw some details, that would make more sense to me.


> > I think it also assumes more intelligence on resume-from-S4
> > than Linux has just now, which may partly explain why it
> > takes so long for swsusp to finish its thing.
> 
> And it may explain some of the strange behavior people sometimes observe
> when they try to hibernate twice in a row.

There's all kinds of bizarreness there.  I kind of get the
feeling the ACPI folk were so deluged by IRQ and other resource
setup issues (the "C" in ACPI) that the power management bits
(the "P") didn't get that much attention.  As pointed out very
recently by Rafael.  :)

Plus there's the issue that while this thread has touched a lot
on ACPI issues and models, Linux must not assume ACPI.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..)
  2007-05-06 22:01                                                                                               ` Alan Stern
  2007-05-06 22:31                                                                                                 ` Rafael J. Wysocki
@ 2007-05-07  1:37                                                                                                 ` David Brownell
  2007-05-08  2:57                                                                                                   ` Greg KH
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07  1:37 UTC (permalink / raw)
  To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg

On Sunday 06 May 2007, Alan Stern wrote:
> It sounds good to me.  Now if only it were possible to get rid of those
> pesky sysdevs...

Other than lack of patches ... is there a reason??
I thought that sysdevs were no longer needed.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy))
  2007-05-05 18:49                                                                                         ` Rafael J. Wysocki
  2007-05-05 21:44                                                                                           ` Alan Stern
@ 2007-05-07  8:51                                                                                           ` Johannes Berg
  1 sibling, 0 replies; 712+ messages in thread
From: Johannes Berg @ 2007-05-07  8:51 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek


[-- Attachment #1.1: Type: text/plain, Size: 504 bytes --]

On Sat, 2007-05-05 at 20:49 +0200, Rafael J. Wysocki wrote:

> I think we may need yet another callback, executed before pre_snapshot()
> and before we shrink memory during the hibernation, to be used by drivers
> that need a lot of additional memory in pre_snapshot().

I'm not sure we really need a callback here for that, your suspend
memory allocation chain seemed good enough since most drivers won't
actually be using it and it's not a hard requirement. Not that I care
much.

johannes

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
                                                                 ` (5 preceding siblings ...)
  2007-05-07 12:29                                               ` Pavel Machek
@ 2007-05-07 12:29                                               ` Pavel Machek
  6 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-07 12:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel,
	Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel,
	Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	linux-pm

Hi!

> > > So, the "suspend" and "resume" for the functions being called for that are
> > > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > > be that difficult ...
> > 
> > It would be ugly big patch I'm afraid.
> 
> It'd be a lot of code churn, but well worth it. And most of the changes
> would be trivial too. You need to start looking beyond "this is ugly in
> the short term" and realise that it's much more maintainable in the long
> term if driver writers know what they're supposed to do as opposed to
> just hacking at it until it mostly works or just doing a full device
> down/up cycle including resetting full driver state.

I do not disagree with you. It will be ugly big patch, but it is
probably worth it, so the patch will be welcome.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: driver power operations (was Re: suspend2 merge)
  2007-04-27 10:21                                             ` Johannes Berg
                                                                 ` (4 preceding siblings ...)
  2007-04-27 15:56                                               ` [linux-pm] " David Brownell
@ 2007-05-07 12:29                                               ` Pavel Machek
  2007-05-07 12:29                                               ` Pavel Machek
  6 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-07 12:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel,
	Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm,
	Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven

Hi!

> > > So, the "suspend" and "resume" for the functions being called for that are
> > > wrong, but then we call them with PMSG_FREEZE. ;-)  Still, we could add
> > > .freeze() and .thaw() callbacks for hibernation just fine.  This wouldn't even
> > > be that difficult ...
> > 
> > It would be ugly big patch I'm afraid.
> 
> It'd be a lot of code churn, but well worth it. And most of the changes
> would be trivial too. You need to start looking beyond "this is ugly in
> the short term" and realise that it's much more maintainable in the long
> term if driver writers know what they're supposed to do as opposed to
> just hacking at it until it mostly works or just doing a full device
> down/up cycle including resetting full driver state.

I do not disagree with you. It will be ugly big patch, but it is
probably worth it, so the patch will be welcome.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07  1:31                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
@ 2007-05-07 16:33                                                                                       ` Alan Stern
  2007-05-07 20:49                                                                                         ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-07 16:33 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek,
	Linux-pm mailing list, Johannes Berg

On Sun, 6 May 2007, David Brownell wrote:

> On Saturday 05 May 2007, Alan Stern wrote:
> > On Fri, 4 May 2007, David Brownell wrote:
> > 
> > Did you mean to say that Step _6_ is supposed to be aware of the target 
> > state's capabilities?  I'll agree to that.
> 
> Yes ... but I don't see why it would be wrong for step 2 either.

The principle of information hiding: If step 2 doesn't _need_ to know 
the final target state (which it shouldn't!) then we ought not to tell it.

> If the device can't wake from S5, it wouldn't set up with the
> assumption that was a possibility.

But step 2 doesn't set up devices' wakeup functions.  It merely quiesces
them so the snapshot can be made safely.  Then step 4 reactivates the
devices, and step 6 takes care of setting up the devices for the final
sleep state.


> > Sure.  But entering hibernation need not involve putting the system into a 
> > "sleeping" state.  Going into G3 should also work for hibernation.
> 
> For some definitions of "should"; that's where specs get fuzzy.
> 
> Since disassembly is allowed in G3, if you swapped a disk that
> should prevent the system from resuming ... it should force a
> boot-from-scratch.  But if you just swapped a power supply it
> would probably work OK.

Yep.  The problem isn't so much in the specs; it's that no one has ever
(so far as I know) given a precise definition of what Linux's "hibernate"  
is supposed to do.  Is it supposed to be safe to disassemble a hibernating
computer?  Is remote wakeup necessarily supported?  I've never seen
answers to these questions.

> > I'm also pointing out that the policy choice decided by the contents of 
> > /sys/power/disk comes into play during steps 6-7 above, but not at all in 
> > steps 1-5.  Hence any associated software structures should explicitly be 
> > connected only with steps 6 and 7.
> 
> The difference between S4 and S5 could matter to step 2 though.
> Perhaps it's not the most likely thing, but certainly avoiding
> the work to setup wake-from-S4 is reasonable when going to S5.

I don't understand.  Step 2 doesn't do the work to set up wake-from-S4;  
step 6 does.  So why should the knowledge of S4 vs. S5 matter to step 2?

> > And since normal shutdown ought to have its own analog of steps 6 and 7, 
> > the same software structures should be used there.  Hence naming them 
> > "hibernation_ops" isn't a good idea.
> 
> That's something of a different stance.  And it's untrue for
> step 6 too ... suspend() and shutdown() differ a lot.  Maybe
> if I saw some details, that would make more sense to me.

It is true that for G3 type shutdown, step 6 can be empty.  We don't need 
to do anything to the devices or drivers, we just turn off all the power.  
Still, the empty set _is_ a set.  :-)

Here's another way to express my ideas: We want to support at least two 
different kinds of powered-down states:

	(A) Remote wakeup may be enabled on some devices, there can be
	    a certain power drain on the batteries or power line, it may 
	    not be safe to disassemble the machine, etc.

	(B) Remote wakeup is completely disabled, there is no power
	    drain at all, it is safe to disassemble the machine provided
	    you don't switch components like disks, etc.

(With (B) it should always be _physically_ safe to switch disks and other
components.  Whether it is _logically_ safe depends on what happens the
next time you start the machine: Will you try to restore a saved memory
image or not?  This isn't directly related to the nature of the
powered-down state except for the obvious fact that you can't restore an
image if no image has been saved.)

I don't see any reason why (A) and (B) shouldn't both be allowed for 
hibernate, as in fact they are now by way of /sys/power/disk.  And I don't 
see any reason why they shouldn't both be allowed for normal non-hibernate 
shutdowns as well.

Furthermore, the choice of whether to use (A) or (B) shouldn't matter 
during steps 1-5 of the hibernate sequence.  It should matter during steps 
6-7 and during normal shutdown (which doesn't have steps 1-5 since it 
doesn't save a memory image).

> Plus there's the issue that while this thread has touched a lot
> on ACPI issues and models, Linux must not assume ACPI.

Yes indeed.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07  1:10                                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell
@ 2007-05-07 18:46                                                                               ` Alan Stern
  2007-05-07 21:29                                                                                 ` Rafael J. Wysocki
  2007-05-07 21:43                                                                                 ` David Brownell
  0 siblings, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-07 18:46 UTC (permalink / raw)
  To: David Brownell
  Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek,
	Nigel Cunningham

On Sun, 6 May 2007, David Brownell wrote:

> On Saturday 05 May 2007, Alan Stern wrote:
> 
> > But who says that hibernate has to use "Non-Volatile Sleep" and normal 
> > shutdown has to use software-controlled "poweroff"?  Why shouldn't the 
> > user be able to do it the other way 'round?
> 
> Well, the definition of NVS matches hibernation, and
> the definition of soft-off matches poweroff.

Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec.  Here's the story
in a nutshell:

	G3 = "mechanical off" = no wakeup devices are enabled,
				safe to disassemble
	G2/S5 = "soft off" = wakeup may be enabled, not safe to
				disassemble
	S4 = "non-volatile sleep" = hibernation, memory image is saved
	S5 = "soft off" = almost the same as S4 except there is no
				memory image

The spec does not explicitly associate S4 with either G2 or G3, and in
fact it contains language suggesting very strongly that the system could
be in either one.  The spec also uses the same name for G2 and for S5, no 
doubt leading to extra levels of confusion.

So there's no question that S4 = NVS = hibernation.  But hibernation
can involve either G2 or G3.

And there's no question (in my mind at least) that normal shutdown should
be able to involve either G2/S5 or G3.  So although the spec doesn't put 
things quite this way, we could say:

	hibernation = S4 = G2/S4 or G3/S4,

	shutdown = S5 = G2/S5 or G3/S5.

Thus the choice between S4 vs. S5 is made at the very start, and steps 1-5 
are executed only for S4.  The choice between G2 vs. G3 can be (and should 
be!) deferred until steps 6-7.


> > > That's a different suggestion, yes.  I'm not sure I see any
> > > benefit of that flexibility for "soft off" states though,
> > > especially if it made "off" consume more power.
> > 
> > The benefit is that it allows more devices to function as wakeup sources, 
> > right?
> 
> With downsides of "more power consumed during 'off' states"
> and "invalidating documentation, training, and expectations".

Okay, let's clear up the confusion.  The additional flexibility I'm 
suggesting for "soft off" = G2 states is that we should allow both G2/S4 
and G2/S5.  They would consume the same amount of power since they are 
both G2 states; the difference is that G2/S4 involves saving and restoring 
a memory image and G2/S5 does not.

This does not invalidate any documentation or training so far as I know.  
And as for expectations...  That's a little harder.  What people _expect_ 
of Linux and what Linux actually _does_ don't always jibe well, owing to 
lack of sufficient documentation -- typical of Open Source projects.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 16:33                                                                                       ` Alan Stern
@ 2007-05-07 20:49                                                                                         ` Pavel Machek
  2007-05-07 21:38                                                                                           ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-07 20:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg

Hi!

> It is true that for G3 type shutdown, step 6 can be empty.  We don't need 
> to do anything to the devices or drivers, we just turn off all the power.  
> Still, the empty set _is_ a set.  :-)
> 
> Here's another way to express my ideas: We want to support at least two 
> different kinds of powered-down states:
> 
> 	(A) Remote wakeup may be enabled on some devices, there can be
> 	    a certain power drain on the batteries or power line, it may 
> 	    not be safe to disassemble the machine, etc.
> 
> 	(B) Remote wakeup is completely disabled, there is no power
> 	    drain at all, it is safe to disassemble the machine provided
> 	    you don't switch components like disks, etc.
> 
...
> 
> I don't see any reason why (A) and (B) shouldn't both be allowed for 
> hibernate, as in fact they are now by way of /sys/power/disk.  And I don't 
> see any reason why they shouldn't both be allowed for normal non-hibernate 
> shutdowns as well.

No, sorry, that does not work. Software can't select (A) vs. (B). Only
user can, by physically switching real power switch, or by unplugging
the machine.

And yes, there's documentation about expectations of swsusp, in
Doc*/power/swsusp.txt.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07  1:16                                                                                               ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
@ 2007-05-07 21:00                                                                                                 ` Rafael J. Wysocki
  2007-05-07 21:45                                                                                                   ` David Brownell
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-07 21:00 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Monday, 7 May 2007 03:16, David Brownell wrote:
> On Saturday 05 May 2007, Alan Stern wrote:
> 
> > Agreed, these all sound like problems in the ACPI driver's implementation 
> > of suspend and resume.  Problems that are caused (at least in part) by the 
> > fact that the PM core doesn't tell the driver whether it's doing
> > suspend-to-RAM vs. hibernation.  Once that is straighened out, everything 
> > else should become much simpler.
> 
> I'm not sure I agree with that diagnosis, but for the record:
> updating drivers/pci/pci-acpi.c so that it can implement the
> platform_pci_choose_state() hook requires ACPI to export that
> information.
> 
> So for now I have drivers/acpi/sleep/main.c exporting
> 
>         s_state = acpi_get_target_sleep_state();
> 
> so that ACPI-aware code can know to call "_S3D" instead of
> the "_S1D" or "_S4D" methods (and "_S3W" etc).  Of course
> the $SUBJECT patch will finish borking that for S4.  :(

Why exactly?

Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 18:46                                                                               ` Alan Stern
@ 2007-05-07 21:29                                                                                 ` Rafael J. Wysocki
  2007-05-07 22:22                                                                                   ` Alan Stern
  2007-05-07 21:43                                                                                 ` David Brownell
  1 sibling, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-07 21:29 UTC (permalink / raw)
  To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg

On Monday, 7 May 2007 20:46, Alan Stern wrote:
> On Sun, 6 May 2007, David Brownell wrote:
> 
> > On Saturday 05 May 2007, Alan Stern wrote:
> > 
> > > But who says that hibernate has to use "Non-Volatile Sleep" and normal 
> > > shutdown has to use software-controlled "poweroff"?  Why shouldn't the 
> > > user be able to do it the other way 'round?
> > 
> > Well, the definition of NVS matches hibernation, and
> > the definition of soft-off matches poweroff.
> 
> Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec.  Here's the story
> in a nutshell:
> 
> 	G3 = "mechanical off" = no wakeup devices are enabled,
> 				safe to disassemble
> 	G2/S5 = "soft off" = wakeup may be enabled, not safe to
> 				disassemble
> 	S4 = "non-volatile sleep" = hibernation, memory image is saved
> 	S5 = "soft off" = almost the same as S4 except there is no
> 				memory image
> 
> The spec does not explicitly associate S4 with either G2 or G3, and in
> fact it contains language suggesting very strongly that the system could
> be in either one.  The spec also uses the same name for G2 and for S5, no 
> doubt leading to extra levels of confusion.

Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1.
Moreover, it's reiterated several times in different places that
S5 Soft off = G2.

> So there's no question that S4 = NVS = hibernation.  But hibernation
> can involve either G2 or G3.

Not according to ACPI.

> And there's no question (in my mind at least) that normal shutdown should
> be able to involve either G2/S5 or G3.  So although the spec doesn't put 
> things quite this way, we could say:
> 
> 	hibernation = S4 = G2/S4 or G3/S4,
> 
> 	shutdown = S5 = G2/S5 or G3/S5.
> 
> Thus the choice between S4 vs. S5 is made at the very start, and steps 1-5 
> are executed only for S4.  The choice between G2 vs. G3 can be (and should 
> be!) deferred until steps 6-7.

The problem is that ACPI insists on treating S4 as a sleeping state.

Still, I agree that what we do in steps 1 - 5 should be independent of
whether or not we're going to enter S4.  Devices should not be
suspended before creating the image, because the system is not going to
enter any power state *at that time*.  There seems to be no reason whatsoever
for putting devices in low power states for creating the hibernation image.

> > > > That's a different suggestion, yes.  I'm not sure I see any
> > > > benefit of that flexibility for "soft off" states though,
> > > > especially if it made "off" consume more power.
> > > 
> > > The benefit is that it allows more devices to function as wakeup sources, 
> > > right?
> > 
> > With downsides of "more power consumed during 'off' states"
> > and "invalidating documentation, training, and expectations".
> 
> Okay, let's clear up the confusion.  The additional flexibility I'm 
> suggesting for "soft off" = G2 states is that we should allow both G2/S4 
> and G2/S5.  They would consume the same amount of power since they are 
> both G2 states; the difference is that G2/S4 involves saving and restoring 
> a memory image and G2/S5 does not.

There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to
avoid confusion.

That's why I said that what we want to call 'hibernation' is and will probably
always be different from an ACPI transition to S4 (at least until we make a
bootloader capable of reading suspend images and ACPI-aware).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 20:49                                                                                         ` Pavel Machek
@ 2007-05-07 21:38                                                                                           ` Alan Stern
  2007-05-08  0:30                                                                                             ` Pavel Machek
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-07 21:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg

On Mon, 7 May 2007, Pavel Machek wrote:

> > 	(A) Remote wakeup may be enabled on some devices, there can be
> > 	    a certain power drain on the batteries or power line, it may 
> > 	    not be safe to disassemble the machine, etc.
> > 
> > 	(B) Remote wakeup is completely disabled, there is no power
> > 	    drain at all, it is safe to disassemble the machine provided
> > 	    you don't switch components like disks, etc.
> > 
> ...
> > 
> > I don't see any reason why (A) and (B) shouldn't both be allowed for 
> > hibernate, as in fact they are now by way of /sys/power/disk.  And I don't 
> > see any reason why they shouldn't both be allowed for normal non-hibernate 
> > shutdowns as well.
> 
> No, sorry, that does not work. Software can't select (A) vs. (B). Only
> user can, by physically switching real power switch, or by unplugging
> the machine.

Okay.  Then what exactly is the difference between the kind of poweroff we 
do during hibernate (say with "platform" in /sys/power/disk) and the kind 
of poweroff we do during a normal system shutdown?

> And yes, there's documentation about expectations of swsusp, in
> Doc*/power/swsusp.txt.

It says this near the start:

 * 		If you change
 * your hardware while system is suspended... well, it was not good idea;
 * but it will probably only crash.

with similar warnings elsewhere.

This appears to refer to confusion in the kernel after the image is 
restored; it doesn't seem to mean that you could damage equipment or 
electrocute yourself.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 18:46                                                                               ` Alan Stern
  2007-05-07 21:29                                                                                 ` Rafael J. Wysocki
@ 2007-05-07 21:43                                                                                 ` David Brownell
  2007-05-07 22:41                                                                                   ` Alan Stern
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07 21:43 UTC (permalink / raw)
  To: Alan Stern
  Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek,
	Nigel Cunningham

On Monday 07 May 2007, Alan Stern wrote:
> On Sun, 6 May 2007, David Brownell wrote:
> 
> > On Saturday 05 May 2007, Alan Stern wrote:
> > 
> > > But who says that hibernate has to use "Non-Volatile Sleep" and normal 
> > > shutdown has to use software-controlled "poweroff"?  Why shouldn't the 
> > > user be able to do it the other way 'round?
> > 
> > Well, the definition of NVS matches hibernation, and
> > the definition of soft-off matches poweroff.
> 
> Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec.  Here's the story
> in a nutshell:
> 
> 	G3 = "mechanical off" = no wakeup devices are enabled,
> 				safe to disassemble
> 	G2/S5 = "soft off" = wakeup may be enabled, not safe to
> 				disassemble
> 	S4 = "non-volatile sleep" = hibernation, memory image is saved
> 	S5 = "soft off" = almost the same as S4 except there is no
> 				memory image

This summary suggests there are two S5 states, which I believe
is incorrect.  G2 is just another name for S5.  See Fig 3-1;
the ACPI 2.0 spec has the same figure.

Also, section 2.2 highlights that after S5 the OS restarts,
which it doesn't do from S4 (table 2-1) ... although when it
describes S4/NVS it fuzzes that issue by saying the key issue
is whether an NVS state file is found and used, not the level
of power available.


> The spec does not explicitly associate S4 with either G2 or G3, and in
> fact it contains language suggesting very strongly that the system could
> be in either one.  The spec also uses the same name for G2 and for S5, no 
> doubt leading to extra levels of confusion.

Figure 3-1 seemed quite explicit to me ... S4 is one of the G1
states, S5 is the only G2 state, and G3 is is a different beast.
Text elsewhere agrees with that.

What's confusing is how it describes NVS/hibernate.  It's very
explicitly a G1 state.  But leaving G2 or G3 can also trigger
a resume-from-NVS ... according to the text in 2.2 but not the
state diagrams, which don't show entering G3 even cleanly, much
less uncleanly (like a neighborhood power failure).  Bleech.

I think the implication is that going to either G2 or G3 "off"
states discards something that a G1 state preserves.  But I'd
have to search more deeply to see if that's clearly defined.
It's suggestive that there are no "_S5D" or "_S5W" methods;
such wake events would evidently be managed by BIOS not OSPM.


> So there's no question that S4 = NVS = hibernation.  But hibernation
> can involve either G2 or G3.

I suspect there's a reason this part of ACPI is so vague;
it may relate to the desire to allow direct BIOS handling
of the NVS state.

 
> And there's no question (in my mind at least) that normal shutdown should
> be able to involve either G2/S5 or G3.

G2/S5, yes ... that can be entered under software control.

But by definition, not G3 since it requires a mechanical/manual
power switch update.  ("Mechanical OFF", or in the spec's example
"movement of a large red switch".)


> So although the spec doesn't put  
> things quite this way, we could say:
> 
> 	hibernation = S4 = G2/S4 or G3/S4,
> 
> 	shutdown = S5 = G2/S5 or G3/S5.

No, you're missing the key "mechanical" red-switch-ish step in G3.

G3 *can't* be entered under software control.  By definition.  It's
there for among other things regulatory reasons ... the only power
consumed in G3 is from the on-board RTC battery.


> > > > That's a different suggestion, yes.  I'm not sure I see any
> > > > benefit of that flexibility for "soft off" states though,
> > > > especially if it made "off" consume more power.
> > > 
> > > The benefit is that it allows more devices to function as wakeup sources, 
> > > right?
> > 
> > With downsides of "more power consumed during 'off' states"
> > and "invalidating documentation, training, and expectations".
> 
> Okay, let's clear up the confusion.  The additional flexibility I'm 
> suggesting for "soft off" = G2 states is that we should allow both G2/S4 
> and G2/S5.  They would consume the same amount of power since they are 
> both G2 states; the difference is that G2/S4 involves saving and restoring 
> a memory image and G2/S5 does not.

There is no G2/S4 state; it's G1/S4 or G2/S5.  And S5 does not
involve an NVS file, or it'd be S4.  The ACPI spec is sadly
vague in those areas, however.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 21:00                                                                                                 ` Rafael J. Wysocki
@ 2007-05-07 21:45                                                                                                   ` David Brownell
  2007-05-07 22:16                                                                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-07 21:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Monday 07 May 2007, Rafael J. Wysocki wrote:
> On Monday, 7 May 2007 03:16, David Brownell wrote:

> > So for now I have drivers/acpi/sleep/main.c exporting
> > 
> >         s_state = acpi_get_target_sleep_state();
> > 
> > so that ACPI-aware code can know to call "_S3D" instead of
> > the "_S1D" or "_S4D" methods (and "_S3W" etc).  Of course
> > the $SUBJECT patch will finish borking that for S4.  :(
> 
> Why exactly?

Because it adds new code paths ... currently pm_ops methods
record the target state.  Fixable later.

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 21:45                                                                                                   ` David Brownell
@ 2007-05-07 22:16                                                                                                     ` Rafael J. Wysocki
  2007-05-09 19:23                                                                                                       ` David Brownell
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-07 22:16 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Monday, 7 May 2007 23:45, David Brownell wrote:
> On Monday 07 May 2007, Rafael J. Wysocki wrote:
> > On Monday, 7 May 2007 03:16, David Brownell wrote:
> 
> > > So for now I have drivers/acpi/sleep/main.c exporting
> > > 
> > >         s_state = acpi_get_target_sleep_state();
> > > 
> > > so that ACPI-aware code can know to call "_S3D" instead of
> > > the "_S1D" or "_S4D" methods (and "_S3W" etc).  Of course
> > > the $SUBJECT patch will finish borking that for S4.  :(
> > 
> > Why exactly?
> 
> Because it adds new code paths ... currently pm_ops methods
> record the target state.  Fixable later.

Hmm, I think hibernation_ops do the equivalent of what pm_ops did for
ACPI_STATE_S4 and the target state is still recorded (in
acpi_enter_sleep_state_prep()).  Isn't that correct?

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 21:29                                                                                 ` Rafael J. Wysocki
@ 2007-05-07 22:22                                                                                   ` Alan Stern
  2007-05-07 22:47                                                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 712+ messages in thread
From: Alan Stern @ 2007-05-07 22:22 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Mon, 7 May 2007, Rafael J. Wysocki wrote:

> > 	G3 = "mechanical off" = no wakeup devices are enabled,
> > 				safe to disassemble
> > 	G2/S5 = "soft off" = wakeup may be enabled, not safe to
> > 				disassemble
> > 	S4 = "non-volatile sleep" = hibernation, memory image is saved
> > 	S5 = "soft off" = almost the same as S4 except there is no
> > 				memory image
> > 
> > The spec does not explicitly associate S4 with either G2 or G3, and in
> > fact it contains language suggesting very strongly that the system could
> > be in either one.  The spec also uses the same name for G2 and for S5, no 
> > doubt leading to extra levels of confusion.
> 
> Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1.
> Moreover, it's reiterated several times in different places that
> S5 Soft off = G2.

More confusion in the spec...  It describes two different kinds of S4
states!

I was talking about "S4 Non-Volatile Sleep", defined on p.20 just above
Table 2-1.  The text says this:

	The machine will then enter the S4 state.  When the system
	leaves the Soft Off or Mechanical Off state,...

That's a pretty clear indication that S4-NVS involves G2 or G3.

You're talking about "S4 Sleeping State", defined on p.22, section 2.4.  
Evidently these two "S4" states are quite different.

> The problem is that ACPI insists on treating S4 as a sleeping state.

Section 2.4 is rather confusing.  What I gather is that S4 and S5 are 
essentially the same except for the presence or absence of a stored 
memory snapshot.  And yet S4 counts as a sleeping state while S5 doesn't.  
What's the explanation for that?

> Still, I agree that what we do in steps 1 - 5 should be independent of
> whether or not we're going to enter S4.  Devices should not be
> suspended before creating the image, because the system is not going to
> enter any power state *at that time*.  There seems to be no reason whatsoever
> for putting devices in low power states for creating the hibernation image.

Agreed.


> There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to
> avoid confusion.

Except for the text on p.20.

> That's why I said that what we want to call 'hibernation' is and will probably
> always be different from an ACPI transition to S4 (at least until we make a
> bootloader capable of reading suspend images and ACPI-aware).

In what sense is the boot kernel different from a "bootloader"?  It 
certainly is capable of reading suspend images and is ACPI-aware.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 21:43                                                                                 ` David Brownell
@ 2007-05-07 22:41                                                                                   ` Alan Stern
  0 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-07 22:41 UTC (permalink / raw)
  To: David Brownell
  Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek,
	Nigel Cunningham

On Mon, 7 May 2007, David Brownell wrote:

> This summary suggests there are two S5 states, which I believe
> is incorrect.  G2 is just another name for S5.  See Fig 3-1;
> the ACPI 2.0 spec has the same figure.
> 
> Also, section 2.2 highlights that after S5 the OS restarts,
> which it doesn't do from S4 (table 2-1) ... although when it
> describes S4/NVS it fuzzes that issue by saying the key issue
> is whether an NVS state file is found and used, not the level
> of power available.

It also says that the NVS state file is found and used when the system
leaves the Soft Off (G2) or Mechanical Off (G3) state.  How did it enter
either of those states in the first place if S4-NVS is a Sleeping (G1)
state?

I imagine that business about the OS not restarting from S4-NVS is 
intended to mean the OS continues from the restored image rather than 
starting over completely fresh.

> Figure 3-1 seemed quite explicit to me ... S4 is one of the G1
> states, S5 is the only G2 state, and G3 is is a different beast.
> Text elsewhere agrees with that.

Yes, okay.

> What's confusing is how it describes NVS/hibernate.  It's very
> explicitly a G1 state.  But leaving G2 or G3 can also trigger
> a resume-from-NVS ... according to the text in 2.2 but not the
> state diagrams, which don't show entering G3 even cleanly, much
> less uncleanly (like a neighborhood power failure).  Bleech.

You can understand my confusion...

> I think the implication is that going to either G2 or G3 "off"
> states discards something that a G1 state preserves.  But I'd
> have to search more deeply to see if that's clearly defined.

Or what it is that gets discarded.  Especially since 2.4 lists only one 
difference between S5 and S4: whether or not there is a saved image.

> I suspect there's a reason this part of ACPI is so vague;
> it may relate to the desire to allow direct BIOS handling
> of the NVS state.

Could be.  I wish the spec was more upfront about its vagueness,
explaining what has been left out and why instead of just skipping over
some things and contradicting itself.

> G2/S5, yes ... that can be entered under software control.
> 
> But by definition, not G3 since it requires a mechanical/manual
> power switch update.  ("Mechanical OFF", or in the spec's example
> "movement of a large red switch".)

Okay, I understand that now.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 22:22                                                                                   ` Alan Stern
@ 2007-05-07 22:47                                                                                     ` Rafael J. Wysocki
  2007-05-08 14:56                                                                                       ` Alan Stern
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-07 22:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Tuesday, 8 May 2007 00:22, Alan Stern wrote:
> On Mon, 7 May 2007, Rafael J. Wysocki wrote:
> 
> > > 	G3 = "mechanical off" = no wakeup devices are enabled,
> > > 				safe to disassemble
> > > 	G2/S5 = "soft off" = wakeup may be enabled, not safe to
> > > 				disassemble
> > > 	S4 = "non-volatile sleep" = hibernation, memory image is saved
> > > 	S5 = "soft off" = almost the same as S4 except there is no
> > > 				memory image
> > > 
> > > The spec does not explicitly associate S4 with either G2 or G3, and in
> > > fact it contains language suggesting very strongly that the system could
> > > be in either one.  The spec also uses the same name for G2 and for S5, no 
> > > doubt leading to extra levels of confusion.
> > 
> > Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1.
> > Moreover, it's reiterated several times in different places that
> > S5 Soft off = G2.
> 
> More confusion in the spec...  It describes two different kinds of S4
> states!
> 
> I was talking about "S4 Non-Volatile Sleep", defined on p.20 just above
> Table 2-1.  The text says this:
> 
> 	The machine will then enter the S4 state.  When the system
> 	leaves the Soft Off or Mechanical Off state,...
> 
> That's a pretty clear indication that S4-NVS involves G2 or G3.
> 
> You're talking about "S4 Sleeping State", defined on p.22, section 2.4.  
> Evidently these two "S4" states are quite different.
> 
> > The problem is that ACPI insists on treating S4 as a sleeping state.
> 
> Section 2.4 is rather confusing.  What I gather is that S4 and S5 are 
> essentially the same except for the presence or absence of a stored 
> memory snapshot.  And yet S4 counts as a sleeping state while S5 doesn't.  
> What's the explanation for that?

As far as I understand it, for S4 the platform provides a means for verifying
if the hardware wasn't changed too much while the system was "sleeping" (via
the NVS memory region).

> > Still, I agree that what we do in steps 1 - 5 should be independent of
> > whether or not we're going to enter S4.  Devices should not be
> > suspended before creating the image, because the system is not going to
> > enter any power state *at that time*.  There seems to be no reason whatsoever
> > for putting devices in low power states for creating the hibernation image.
> 
> Agreed.
> 
> 
> > There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to
> > avoid confusion.
> 
> Except for the text on p.20.

Yes, this is very confusing.  I think what they wanted to say there is that the
image restore could in principle happen when the system is started after being
in a "power off" state.  In that case, however, it wouldn't be known if it's
safe to restore the image and continue, because the hardware might have
changed.  For this reason, a special "sleeping" state is needed such that when
leaving it, the PM software can detect any (substantial) hardware changes
before even loading the entire image.

> > That's why I said that what we want to call 'hibernation' is and will probably
> > always be different from an ACPI transition to S4 (at least until we make a
> > bootloader capable of reading suspend images and ACPI-aware).
> 
> In what sense is the boot kernel different from a "bootloader"?  It 
> certainly is capable of reading suspend images and is ACPI-aware.

The boot loader uses the BIOS to read from disks and it can avoid initializing
ACPI.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 21:38                                                                                           ` Alan Stern
@ 2007-05-08  0:30                                                                                             ` Pavel Machek
  0 siblings, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-08  0:30 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg

Hi!

> It says this near the start:
> 
>  * 		If you change
>  * your hardware while system is suspended... well, it was not good idea;
>  * but it will probably only crash.
> 
> with similar warnings elsewhere.
> 
> This appears to refer to confusion in the kernel after the image is 
> restored; it doesn't seem to mean that you could damage equipment or 
> electrocute yourself.

For electrocuting, see product manual :-). Basically, you have to
unplug PC from AC power physically in order to open it. shutdown -h
now is _not_ enough. For notebooks, remove battery, too.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..)
  2007-05-07  1:37                                                                                                 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell
@ 2007-05-08  2:57                                                                                                   ` Greg KH
  0 siblings, 0 replies; 712+ messages in thread
From: Greg KH @ 2007-05-08  2:57 UTC (permalink / raw)
  To: David Brownell
  Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Johannes Berg, Pavel Machek

On Sun, May 06, 2007 at 06:37:36PM -0700, David Brownell wrote:
> On Sunday 06 May 2007, Alan Stern wrote:
> > It sounds good to me.  Now if only it were possible to get rid of those
> > pesky sysdevs...
> 
> Other than lack of patches ... is there a reason??
> I thought that sysdevs were no longer needed.

I would love to get rid of them, patches gladly accepted :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-07 22:47                                                                                     ` Rafael J. Wysocki
@ 2007-05-08 14:56                                                                                       ` Alan Stern
  2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
                                                                                                           ` (2 more replies)
  0 siblings, 3 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-08 14:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Tue, 8 May 2007, Rafael J. Wysocki wrote:

> As far as I understand it, for S4 the platform provides a means for verifying
> if the hardware wasn't changed too much while the system was "sleeping" (via
> the NVS memory region).

Rereading p.20, it appears to go the other way: The system checks for
hardware changes when booting from Soft Off.  Or perhaps it always checks.  
I guess there aren't supposed to be any hardware changes while in S4,
since then it's not safe to disassemble the machine.

Sounds a lot like USB's power sessions...

> Yes, this is very confusing.  I think what they wanted to say there is that the
> image restore could in principle happen when the system is started after being
> in a "power off" state.  In that case, however, it wouldn't be known if it's
> safe to restore the image and continue, because the hardware might have
> changed.  For this reason, a special "sleeping" state is needed such that when
> leaving it, the PM software can detect any (substantial) hardware changes
> before even loading the entire image.

And apparently the bootloader is not expected to restore the memory image
if the hardware has changed too much.

So here's the current state of my understanding of ACPI:

	S4 is the lowest-power Sleep state.  RAM is not powered, the OS
	has stored a non-volatile memory image somewhere, and some ACPI
	state is maintained.

	S5 is misnamed, in that it isn't really a Sleep state at all --
	it's an Off state.  In fact, it is the state the computer enters
	when you first plug it in (or insert the battery).

If the OS stores a memory image and then switches to S5, at reboot the
bootloader will probably try to restore it.  (That's what p.20 says.)  
And if the user unplugs the computer (removes the battery) while it is in
S4, then upon replugging the computer will enter S5.  Thus, when waking
from either S4 or S5 the bootloader will try to restore an image if one
can be found (and if the hardware hasn't changed too much and if the user
doesn't abort the restore).

I've never encountered any documentation saying that you shouldn't unplug 
the computer while it's in hibernation.  It doesn't look like you would 
lose much by doing so, except that perhaps not as many wakeup devices are 
functional in S5 as in S4.

Now as for how all this relates to Linux:

What we do for hibernation is not an exact match for either S4 or S5.  It
may be closest to S4, but we don't use a bootloader.  Instead the boot
kernel does some sort of ACPI reset and restores the memory image all by
itself.  Whatever ACPI state information may be saved in the image is not
accessible to the boot kernel.  Conversely, the information about whether
we booted from S4 or from S5 is lost when the image overwrites the boot
kernel.

As a result, hibernation is capable of using either S4 or S5 -- as it must
be, since the user could always unplug the computer while it's in S4 --
although perhaps when using S4 it manages to confuse ACPI somewhat through
not matching the spec's expectations.

What do the differences between S4 and S5 amount to?  As far as I can 
tell, they look like this:

	ACPI expects there to be a memory image in S4.  In S5 there
	may or may not be an image.

	ACPI expects that when resuming from S4, the kernel will
	continue using some preserved ACPI state.  It expects that 
	when starting from S5, the kernel will need to reinitialize
	pretty much all the ACPI state.

	S4 involves a larger power consumption and may allow for
	more wakeup devices than S5.

And how do these relate to Linux?

	In fact, ACPI has no way of knowing whether or not there is an
	image.  The kernel is perfectly free to do whatever it wants.

	The boot kernel can't make much use of the state preserved by
	ACPI because it doesn't have access to the image kernel's
	records.  It needs to reinitialize ACPI no matter what.
	Consequently the restored kernel cannot use any preserved ACPI
	state, since this state gets wiped out by the boot kernel.
	Information about hardware changes might be available to the
	boot kernel, which could in principle then decide not to restore 
	the image.  It's not clear that this would be a good idea.  In
	any case, ACPI is limited to knowledge about devices on the
	motherboard -- it knows nothing about hotplugged devices, which
	makes the information less useful.

	Hibernation allows the user to choose whether to go to S4 or S5
	by means of /sys/power/disk.  Therefore the user gets to decide
	how the power-consumption vs. wakeup-functionality tradeoff
	should be made.

In short, the boot kernel should do whatever it needs to in order to make
ACPI happy.  This might involve telling ACPI that it has successfully
resumed from S4, even though the boot kernel is unaware of system state at
the start of hibernation.  In fact, the boot kernel has to take care of
all this before it even knows whether a valid image exists in the swap
partition.

Putting this together, it says that there should be no impediment to doing
a fresh boot from S4; i.e., not restoring a memory image but simply
letting the boot kernel continue on with a normal startup.  The corollary
is that there should be no impediment to entering S4 during a normal
shutdown.

>From the user's point of view, the differences between S4 and S5 amount to
just these: power consumption and availability of wakeup devices.  
(Perhaps also the presence of a blinking LED -- but in my experience the
blinking LED indicates STR, not hibernation.)  In the end, this is nothing 
more than the usual tradeoff between power usage and functionality.

We give the user a chance to decide how this tradeoff should go when 
entering hibernation.  Why not also give the user a chance to decide the 
tradeoff during normal shutdown?

Yes, it violates the spec in the sense that we would be entering S4 
without saving a memory image.  But we _already_ violate the spec by not 
using a bootloader to restore the image.  I don't see this as being any 
worse.


Finally, what about non-ACPI systems?  Basically this boils down to two 
choices:

	Should a memory image be stored?

	How much power/wakeup-functionality should the system
	consume/provide while it is down?

The first choice is decided by the user, by either entering hibernation or 
shutting down.  Why shouldn't the second also be decided by the user?

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-08 14:56                                                                                       ` Alan Stern
@ 2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
  2007-05-08 21:26                                                                                           ` Alan Stern
  2007-05-09  8:17                                                                                         ` Pavel Machek
  2007-05-09 19:35                                                                                         ` David Brownell
  2 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-08 19:59 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Tuesday, 8 May 2007 16:56, Alan Stern wrote:
> On Tue, 8 May 2007, Rafael J. Wysocki wrote:
> 
> > As far as I understand it, for S4 the platform provides a means for verifying
> > if the hardware wasn't changed too much while the system was "sleeping" (via
> > the NVS memory region).
> 
> Rereading p.20, it appears to go the other way: The system checks for
> hardware changes when booting from Soft Off.

Nope.  That's clarified later on.  Please read Section 15, "Waking and
Sleeping" (it's short ;-)), in particular 15.3.3.

> Or perhaps it always checks.  I guess there aren't supposed to be any
> hardware changes while in S4, since then it's not safe to disassemble the
> machine.

That's correct, and that's why the hardware signature in FACS is needed for S4
(according to the spec), while it's not needed for the wake up from "Soft Off"
(S5).

> Sounds a lot like USB's power sessions...

Well, not exactly that.  The hardware signature in FACS only covers some
"essential" hardware (I'm not sure what that is, probably depends on the
platform design).

> > Yes, this is very confusing.  I think what they wanted to say there is that the
> > image restore could in principle happen when the system is started after being
> > in a "power off" state.  In that case, however, it wouldn't be known if it's
> > safe to restore the image and continue, because the hardware might have
> > changed.  For this reason, a special "sleeping" state is needed such that when
> > leaving it, the PM software can detect any (substantial) hardware changes
> > before even loading the entire image.
> 
> And apparently the bootloader is not expected to restore the memory image
> if the hardware has changed too much.

Yes.

> So here's the current state of my understanding of ACPI:
> 
> 	S4 is the lowest-power Sleep state.  RAM is not powered, the OS
> 	has stored a non-volatile memory image somewhere, and some ACPI
> 	state is maintained.

That's correct, AFAICS.

> 	S5 is misnamed, in that it isn't really a Sleep state at all --
> 	it's an Off state.  In fact, it is the state the computer enters
> 	when you first plug it in (or insert the battery).

Yes.

> If the OS stores a memory image and then switches to S5, at reboot the
> bootloader will probably try to restore it.  (That's what p.20 says.)  

That may happen.  The bootloader will probably check if there's the image
and if it's there, it will try compare the hardware signature in the image with
the one in FACS.  If the test is passed, it will attempt to restore the image
(this is illustrated in the picture in 15.3.3, BTW).

> And if the user unplugs the computer (removes the battery) while it is in
> S4, then upon replugging the computer will enter S5.  Thus, when waking
> from either S4 or S5 the bootloader will try to restore an image if one
> can be found (and if the hardware hasn't changed too much and if the user
> doesn't abort the restore).

That's correct.

> I've never encountered any documentation saying that you shouldn't unplug 
> the computer while it's in hibernation.  It doesn't look like you would 
> lose much by doing so, except that perhaps not as many wakeup devices are 
> functional in S5 as in S4.
> 
> Now as for how all this relates to Linux:
> 
> What we do for hibernation is not an exact match for either S4 or S5.  It
> may be closest to S4, but we don't use a bootloader.  Instead the boot
> kernel does some sort of ACPI reset and restores the memory image all by
> itself.  Whatever ACPI state information may be saved in the image is not
> accessible to the boot kernel.

In principle, it could be, but we don't use it in the boot kernel.

> Conversely, the information about whether we booted from S4 or from S5
> is lost when the image overwrites the boot kernel.

Yes.

> As a result, hibernation is capable of using either S4 or S5 -- as it must
> be, since the user could always unplug the computer while it's in S4 --
> although perhaps when using S4 it manages to confuse ACPI somewhat through
> not matching the spec's expectations.
> 
> What do the differences between S4 and S5 amount to?  As far as I can 
> tell, they look like this:
> 
> 	ACPI expects there to be a memory image in S4.  In S5 there
> 	may or may not be an image.
> 
> 	ACPI expects that when resuming from S4, the kernel will
> 	continue using some preserved ACPI state.  It expects that 
> 	when starting from S5, the kernel will need to reinitialize
> 	pretty much all the ACPI state.
> 
> 	S4 involves a larger power consumption and may allow for
> 	more wakeup devices than S5.
> 
> And how do these relate to Linux?
> 
> 	In fact, ACPI has no way of knowing whether or not there is an
> 	image.  The kernel is perfectly free to do whatever it wants.
> 
> 	The boot kernel can't make much use of the state preserved by
> 	ACPI because it doesn't have access to the image kernel's
> 	records.  It needs to reinitialize ACPI no matter what.

To be precise, it usually needs to initialize ACPI to read the image (drivers
use ACPI to some extent).  In principle we could make it behave as though
ACPI were not compiled in and read the image while being in that state.
Then, it could use the ACPI state information contained in the image
(it would have to be pointed to by the image header, but that's easy).

> 	Consequently the restored kernel cannot use any preserved ACPI
> 	state, since this state gets wiped out by the boot kernel.
> 	Information about hardware changes might be available to the
> 	boot kernel, which could in principle then decide not to restore 
> 	the image.  It's not clear that this would be a good idea.  In
> 	any case, ACPI is limited to knowledge about devices on the
> 	motherboard -- it knows nothing about hotplugged devices, which
> 	makes the information less useful.
> 
> 	Hibernation allows the user to choose whether to go to S4 or S5
> 	by means of /sys/power/disk.  Therefore the user gets to decide
> 	how the power-consumption vs. wakeup-functionality tradeoff
> 	should be made.
> 
> In short, the boot kernel should do whatever it needs to in order to make
> ACPI happy.  This might involve telling ACPI that it has successfully
> resumed from S4, even though the boot kernel is unaware of system state at
> the start of hibernation.  In fact, the boot kernel has to take care of
> all this before it even knows whether a valid image exists in the swap
> partition.
> 
> Putting this together, it says that there should be no impediment to doing
> a fresh boot from S4; i.e., not restoring a memory image but simply
> letting the boot kernel continue on with a normal startup.  The corollary
> is that there should be no impediment to entering S4 during a normal
> shutdown.
> 
> From the user's point of view, the differences between S4 and S5 amount to
> just these: power consumption and availability of wakeup devices.  
> (Perhaps also the presence of a blinking LED -- but in my experience the
> blinking LED indicates STR, not hibernation.)  In the end, this is nothing 
> more than the usual tradeoff between power usage and functionality.
> 
> We give the user a chance to decide how this tradeoff should go when 
> entering hibernation.  Why not also give the user a chance to decide the 
> tradeoff during normal shutdown?
> 
> Yes, it violates the spec in the sense that we would be entering S4 
> without saving a memory image.  But we _already_ violate the spec by not 
> using a bootloader to restore the image.  I don't see this as being any 
> worse.
> 
> 
> Finally, what about non-ACPI systems?  Basically this boils down to two 
> choices:
> 
> 	Should a memory image be stored?
> 
> 	How much power/wakeup-functionality should the system
> 	consume/provide while it is down?
> 
> The first choice is decided by the user, by either entering hibernation or 
> shutting down.  Why shouldn't the second also be decided by the user?

I generally agree.

Moreover, it doesn't seem to be necessary to assume that the image should
be created and saved *after* we've put devices into low power states and
prepared ACPI for the power transition.  I think it's equally possible to
create and save the image *before* the power transition is initiated.

Greetings,
Rafael


> 
> Alan Stern
> 
> 
> 

-- 
If you don't have the time to read,
you don't have the time or the tools to write.
		- Stephen King

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
@ 2007-05-08 21:26                                                                                           ` Alan Stern
  0 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-08 21:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Tue, 8 May 2007, Rafael J. Wysocki wrote:

> On Tuesday, 8 May 2007 16:56, Alan Stern wrote:
> > On Tue, 8 May 2007, Rafael J. Wysocki wrote:
> > 
> > > As far as I understand it, for S4 the platform provides a means for verifying
> > > if the hardware wasn't changed too much while the system was "sleeping" (via
> > > the NVS memory region).
> > 
> > Rereading p.20, it appears to go the other way: The system checks for
> > hardware changes when booting from Soft Off.
> 
> Nope.  That's clarified later on.  Please read Section 15, "Waking and
> Sleeping" (it's short ;-)), in particular 15.3.3.

You're right.  It says specifically that when booting from an S4 state,
the bootloader compares the signature in the NVS image with hardware
signature in the BIOS's FACS table.  (Although Figure 15-5 makes no 
mention of different pathways for S4 and S5.)

Does the Linux boot kernel actually do the comparison?

Chapter 15 doesn't seem to take into account the possibility that the
computer might be unplugged after entering S4.  It talks about the next 
wakeup being a wake from S4 -- although the actions of the BIOS are 
supposed to be the same when waking from S4 or booting from S5.  In either 
case the BIOS runs the POST and initializes the ACPI tables.  Only the 
actions of the bootloader are different.

So how is the bootloader supposed to know whether it is booting from S4 or
S5?  Does it just assume that the presence of a valid NVS image indicates
an S4 boot, even though it may really be booting from S5?

> > Sounds a lot like USB's power sessions...
> 
> Well, not exactly that.  The hardware signature in FACS only covers some
> "essential" hardware (I'm not sure what that is, probably depends on the
> platform design).

15.1.4.1 says:

	A change in hardware configuration is defined to be any change in
	the platform hardware that would cause the platform to fail when
	trying to restore the S4 context; this hardware is normally
	limited to boot devices.  For example, changing the graphics
	adapter or hard disk controller while in the S4 state should cause
	the hardware signature to change.  On the other hand, removing or
	adding a PC Card device from a PC Card slot should not cause the
	hardware signature to change.

Take it for what it's worth.


> > 	The boot kernel can't make much use of the state preserved by
> > 	ACPI because it doesn't have access to the image kernel's
> > 	records.  It needs to reinitialize ACPI no matter what.
> 
> To be precise, it usually needs to initialize ACPI to read the image (drivers
> use ACPI to some extent).  In principle we could make it behave as though
> ACPI were not compiled in and read the image while being in that state.
> Then, it could use the ACPI state information contained in the image
> (it would have to be pointed to by the image header, but that's easy).

In other words, make the boot kernel act as a bootloader.

Isn't this likely to cause problems?  There must be plenty of systems that
won't work properly without ACPI.  Certainly there are reported cases of
IRQ routing being wrong (and also cases where it is wrong only when ACPI
_is_ in use).


> I generally agree.
> 
> Moreover, it doesn't seem to be necessary to assume that the image should
> be created and saved *after* we've put devices into low power states and
> prepared ACPI for the power transition.  I think it's equally possible to
> create and save the image *before* the power transition is initiated.

Possible and desirable, both.

Okay, so the two of us are in agreement.  I don't know about anyone else, 
though...  :-)

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-08 14:56                                                                                       ` Alan Stern
  2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
@ 2007-05-09  8:17                                                                                         ` Pavel Machek
  2007-05-09 15:21                                                                                           ` Alan Stern
  2007-05-09 19:35                                                                                         ` David Brownell
  2 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-05-09  8:17 UTC (permalink / raw)
  To: Alan Stern; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg

Hi!

> We give the user a chance to decide how this tradeoff should go when 
> entering hibernation.  Why not also give the user a chance to decide the 
> tradeoff during normal shutdown?
> 
> Yes, it violates the spec in the sense that we would be entering S4 
> without saving a memory image.  

I think you already replied to yourself :-).

There are more reasons, like we getting useless code paths to
debug. So far you demonstrated that S4-on-shutdown is probably
possible, and while violating specs, it should probably work.

What do you expect now? Me jumping with joy and implementing
S4-on-shutdown because it should be possible?

Now... if you feel very strongly about S4-on-shutdown, you may try to
create a patch. If it is not-too-ugly, and if it is really good for
something, we may merge it.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-09  8:17                                                                                         ` Pavel Machek
@ 2007-05-09 15:21                                                                                           ` Alan Stern
  0 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-09 15:21 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg

On Wed, 9 May 2007, Pavel Machek wrote:

> Hi!
> 
> > We give the user a chance to decide how this tradeoff should go when 
> > entering hibernation.  Why not also give the user a chance to decide the 
> > tradeoff during normal shutdown?
> > 
> > Yes, it violates the spec in the sense that we would be entering S4 
> > without saving a memory image.  
> 
> I think you already replied to yourself :-).

Yes -- but going to S5 during hibernation (which is what "echo shutdown 
>/sys/power/disk" does, right?) also violates the spec.  So I don't feel 
too guilty about this.

> There are more reasons, like we getting useless code paths to
> debug. So far you demonstrated that S4-on-shutdown is probably
> possible, and while violating specs, it should probably work.
> 
> What do you expect now? Me jumping with joy and implementing
> S4-on-shutdown because it should be possible?

Actually all I wanted was someone to look over my reasoning and check that 
it was correct.  You and Raphael have now done so, thank you.

And when I first began contributing to this thread, the main purpose was 
to point out that hibernation_ops (or anything else related to the 
shutdown method) should not be involved in the steps responsible for 
creating and storing the snapshot image.

> Now... if you feel very strongly about S4-on-shutdown, you may try to
> create a patch. If it is not-too-ugly, and if it is really good for
> something, we may merge it.

At some time I might just do that...

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)
  2007-05-07 22:16                                                                                                     ` Rafael J. Wysocki
@ 2007-05-09 19:23                                                                                                       ` David Brownell
  0 siblings, 0 replies; 712+ messages in thread
From: David Brownell @ 2007-05-09 19:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg,
	Linux-pm mailing list

On Monday 07 May 2007, Rafael J. Wysocki wrote:
> On Monday, 7 May 2007 23:45, David Brownell wrote:
> > On Monday 07 May 2007, Rafael J. Wysocki wrote:
> > > On Monday, 7 May 2007 03:16, David Brownell wrote:
> > 
> > > > So for now I have drivers/acpi/sleep/main.c exporting
> > > > 
> > > >         s_state = acpi_get_target_sleep_state();
> > > > 
> > > > so that ACPI-aware code can know to call "_S3D" instead of
> > > > the "_S1D" or "_S4D" methods (and "_S3W" etc).  Of course
> > > > the $SUBJECT patch will finish borking that for S4.  :(
> > > 
> > > Why exactly?
> > 
> > Because it adds new code paths ... currently pm_ops methods
> > record the target state.  Fixable later.
> 
> Hmm, I think hibernation_ops do the equivalent of what pm_ops did for
> ACPI_STATE_S4 and the target state is still recorded (in
> acpi_enter_sleep_state_prep()).  Isn't that correct?

I didn't use that method, because of information hiding.

See the patch I just posted.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-08 14:56                                                                                       ` Alan Stern
  2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
  2007-05-09  8:17                                                                                         ` Pavel Machek
@ 2007-05-09 19:35                                                                                         ` David Brownell
  2007-05-09 20:04                                                                                           ` Alan Stern
  2 siblings, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-09 19:35 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Tuesday 08 May 2007, Alan Stern wrote:

> So here's the current state of my understanding of ACPI:
> 
> 	S4 is the lowest-power Sleep state.  RAM is not powered, the OS
> 	has stored a non-volatile memory image somewhere, and some ACPI
> 	state is maintained.
> 
> 	S5 is misnamed, in that it isn't really a Sleep state at all --
> 	it's an Off state. 

It's called "Soft Off" ... :)

The reason it resembles a sleep state is that various events other
than power switches are allowed to wake systems in S5.  RTC alarms
and keyboard events come to mind as common examples.

Agreed that the distinction between S4 and S5 seems too much in the
category of "because we said so!" than because of real technical
differences (beyond presence/absence of a non-volatile image, and
a few additional wakeup event sources).


> 			In fact, it is the state the computer enters 
> 	when you first plug it in (or insert the battery).

No; again, you're missing the entire point of G3 "mechanical off".

When you first plug it in, it's going to be in G3.  Then you turn
on the power switch.  Then you press the "on/off" button.

>From then on you can use only the "on/off" button, but the system
is vampiric ... when off/dead, it can choose to come alive, and is
always sucking power/blood at a low level.

But the "large red switch" option is available to put the system
into G3 ... driving a bloody stake through its heart, so it can't
re-activate itself at midnight, and preventing constant power drain.


> From the user's point of view, the differences between S4 and S5 amount to
> just these: power consumption and availability of wakeup devices.

And the fact that in S4 there's always a resumable OS image.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-09 19:35                                                                                         ` David Brownell
@ 2007-05-09 20:04                                                                                           ` Alan Stern
  2007-05-09 20:21                                                                                             ` David Brownell
  2007-05-09 21:07                                                                                             ` Pavel Machek
  0 siblings, 2 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-09 20:04 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Wed, 9 May 2007, David Brownell wrote:

> > 			In fact, it is the state the computer enters 
> > 	when you first plug it in (or insert the battery).
> 
> No; again, you're missing the entire point of G3 "mechanical off".
> 
> When you first plug it in, it's going to be in G3.  Then you turn
> on the power switch.  Then you press the "on/off" button.
> 
> From then on you can use only the "on/off" button, but the system
> is vampiric ... when off/dead, it can choose to come alive, and is
> always sucking power/blood at a low level.
> 
> But the "large red switch" option is available to put the system
> into G3 ... driving a bloody stake through its heart, so it can't
> re-activate itself at midnight, and preventing constant power drain.

Sorry.  What I meant to say was that S5 is the state the computer enters 
when you first plug it in and turn on the power switch -- before you press 
the on/off button.

> > From the user's point of view, the differences between S4 and S5 amount to
> > just these: power consumption and availability of wakeup devices.
> 
> And the fact that in S4 there's always a resumable OS image.

Are you sure?  What happens if the OSPM writes a defective, non-resumable 
OS image and then goes into S4?

What happens if the OS writes a resumable OS image and goes into S4, and 
then the user unplugs the computer, plugs it back in, and turns the power 
switch on?  At that point the system must be in S5 (by definition), but 
there's still a resumable image.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-09 20:04                                                                                           ` Alan Stern
@ 2007-05-09 20:21                                                                                             ` David Brownell
  2007-05-10 15:17                                                                                               ` Alan Stern
  2007-05-09 21:07                                                                                             ` Pavel Machek
  1 sibling, 1 reply; 712+ messages in thread
From: David Brownell @ 2007-05-09 20:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

> > > From the user's point of view, the differences between S4 and S5 amount to
> > > just these: power consumption and availability of wakeup devices.
> > 
> > And the fact that in S4 there's always a resumable OS image.
> 
> Are you sure?  What happens if the OSPM writes a defective, non-resumable 
> OS image and then goes into S4?

The ACPI spec omits all such error transitions.  As well as
a fair number of non-error ones ... like how to enter G3.


> What happens if the OS writes a resumable OS image and goes into S4, and 
> then the user unplugs the computer, plugs it back in, and turns the power 
> switch on?  At that point the system must be in S5 (by definition), but 
> there's still a resumable image.

As allowed by the chapter 2 text I pointed out earlier.
S4 *always* has such an image.

- Dave

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-09 20:04                                                                                           ` Alan Stern
  2007-05-09 20:21                                                                                             ` David Brownell
@ 2007-05-09 21:07                                                                                             ` Pavel Machek
  1 sibling, 0 replies; 712+ messages in thread
From: Pavel Machek @ 2007-05-09 21:07 UTC (permalink / raw)
  To: Alan Stern; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg

Hi!

> > > 			In fact, it is the state the computer enters 
> > > 	when you first plug it in (or insert the battery).
> > 
> > No; again, you're missing the entire point of G3 "mechanical off".
> > 
> > When you first plug it in, it's going to be in G3.  Then you turn
> > on the power switch.  Then you press the "on/off" button.
> > 
> > From then on you can use only the "on/off" button, but the system
> > is vampiric ... when off/dead, it can choose to come alive, and is
> > always sucking power/blood at a low level.
> > 
> > But the "large red switch" option is available to put the system
> > into G3 ... driving a bloody stake through its heart, so it can't
> > re-activate itself at midnight, and preventing constant power drain.
> 
> Sorry.  What I meant to say was that S5 is the state the computer enters 
> when you first plug it in and turn on the power switch -- before you press 
> the on/off button.

Actually... some machines just power on when you first plug them in,
and some other have it configurable in BIOS.

For server, you want it to power up after power fail.

For home desktop, you definitely want it to stay powered off after
power fail.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...))
  2007-05-09 20:21                                                                                             ` David Brownell
@ 2007-05-10 15:17                                                                                               ` Alan Stern
  0 siblings, 0 replies; 712+ messages in thread
From: Alan Stern @ 2007-05-10 15:17 UTC (permalink / raw)
  To: David Brownell
  Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg

On Wed, 9 May 2007, David Brownell wrote:

> > > > From the user's point of view, the differences between S4 and S5 amount to
> > > > just these: power consumption and availability of wakeup devices.
> > > 
> > > And the fact that in S4 there's always a resumable OS image.

> > What happens if the OS writes a resumable OS image and goes into S4, and 
> > then the user unplugs the computer, plugs it back in, and turns the power 
> > switch on?  At that point the system must be in S5 (by definition), but 
> > there's still a resumable image.
> 
> As allowed by the chapter 2 text I pointed out earlier.
> S4 *always* has such an image.

So the correct statement is that S4 always has a resumable OS image and S5
may have a resumable image.  From a user's point of view that doesn't
sound like much of a difference, especially since the image can be 
successfully restored from either state.

Alan Stern

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-04-25  7:23                         ` Pavel Machek
                                             ` (3 preceding siblings ...)
  2007-04-25 19:43                           ` Kenneth Crudup
@ 2007-05-26 17:37                           ` Martin Steigerwald
  2007-05-26 20:35                             ` Rafael J. Wysocki
  4 siblings, 1 reply; 712+ messages in thread
From: Martin Steigerwald @ 2007-05-26 17:37 UTC (permalink / raw)
  To: suspend2-devel
  Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith,
	linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar,
	Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 4902 bytes --]

Am Mittwoch 25 April 2007 schrieb Pavel Machek:
> Hi!
>
> > This is why there's a lot to be said for
> >
> > 	echo mem > /sys/power/state
> >
> > and being able to follow the path through _one_ object (the kernel)
> > over trying to figure out the interaction between many different
> > parts with different versions.
>
> The 'promise' is 'if you can get echo disk > /sys/power/state working,
> uswsusp will work. too'. IOW it should be ok to debug the in-kernel
> parts, only.

Hello Nigel, Pavel, Rafael and everyone else who is involved,

I would like to ask what come out of the suspend2 merge discussion. Nigel 
just told that suspend2 likely won't be merged anytime soon and thats its 
business as usual:

---------------------------------------------------------------------
It's pretty much business as usual. Linus doesn't want another
implementation merged, and he wants the three of us (Pavel, Rafael and
myself) to agree on a way forward. He also believes that we're
approaching things from the wrong direction at the moment. Funnily
enough, this is the one area on which we do all agree.
---------------------------------------------------------------------
http://lists.suspend2.net/lurker/message/20070510.021641.fe306add.en.html


Has there been any further discussion and preferably agreement on the way 
to go forward?

Although you Linus, as I read from different mails only use suspend to 
RAM, there are many users out there who use suspend to disk daily. 

I used in kernel software suspend initially and it worked quite nice with 
starting from 2.6.10 or 2.6.11 where suspend2 didn't work for me before 
2.6.14 with the hibernate script. But from then on suspend2 worked better 
than in kernel software suspend for me and colleagues on:

- ThinkPad T23
- ThinkPad T42
- Possibly some other ThinkPads
- as well various Dell workstations we have at work

It was faster and more reliable, yielding uptimes up to 40 days on my 
workstation recently (with 2.6.17.7 still). And even that uptime was only 
ended by booting a newly build kernel (2.6.21 with sws 2.2.9.13). For me 
in the role of a user actually this is a really satisfying solution!

I tried userspace software suspend from time to time but then just was fed 
up with it, cause I could not get it to work within any sensible amount 
of time - even with some bog standard Debian kernel, I think it was some 
2.6.18 one. Maybe I am dumb, but so be it, it should not be that 
complicated to get it to work. Recently I didn't even bother to try 
anymore. Well and I read in the suspend2 merge discussion that even in 
kernel suspend does not work reliably anymore.

As long I cannot be convinced that the vanilla kernel contains a suspend 
to disk solution that works as good as suspend2 I will patch suspend2 
into all of the desktop kernels I build.

Thats quite bad IMHO for exactly the same reason than having drivers 
maintained out of the kernel. For the same reason I think swap prefetch 
should go in as soon as possible. It will never have the adoption and 
care taking of an in-kernel-tree solution.

I am convinced that a working suspend to RAM just is not enough - well it 
wasn't working correctly last time I checked. But I even don't bother 
about suspend to RAM anymore. I can wait those additional seconds for 
suspend to disk and it allows me to drive my laptops without batteries 
most of the time and have workstations switched off completely so that 
they do not consume standby power.

So please, pretty please consider working together on a reliable, fast, 
stable, easy to use and configurable in-kernel-tree snapshot solution! 
Actually I as a user I couldn't care less about the implementation 
details, but as someone who is interested in kernel technologies I like 
it to be a clean and well designed solution, too. ;-)

Maybe when the Linux Foundation organizes a meeting for Nigel, Pavel, 
Rafael and other kernel developers interested in creating such a solution 
it will help. To me it seems such a concentrated meeting in a good 
atmosphere could be more effective than endless mailing list discussions 
not leading to a clear result. When its not easy for the involved people 
to work together maybe a casual bystander who understands enough of 
kernel details should moderate the meeting and help finding an agreement.

It would just be such a pity to miss the chance to have a nicely working 
snapshot solution in the Linux kernel, that may even be interesting for 
virtualization (you could store a backup of a machine state permanently 
or even more of them - if not already available through other 
technologies like well suspend2 with filewriter for example).

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-05-26 17:37                           ` Martin Steigerwald
@ 2007-05-26 20:35                             ` Rafael J. Wysocki
  2007-05-26 22:23                               ` Martin Steigerwald
  0 siblings, 1 reply; 712+ messages in thread
From: Rafael J. Wysocki @ 2007-05-26 20:35 UTC (permalink / raw)
  To: Martin Steigerwald
  Cc: suspend2-devel, Pavel Machek, Linus Torvalds, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Pekka J Enberg

On Saturday, 26 May 2007 19:37, Martin Steigerwald wrote:
> Am Mittwoch 25 April 2007 schrieb Pavel Machek:
> > Hi!
> >
> > > This is why there's a lot to be said for
> > >
> > > 	echo mem > /sys/power/state
> > >
> > > and being able to follow the path through _one_ object (the kernel)
> > > over trying to figure out the interaction between many different
> > > parts with different versions.
> >
> > The 'promise' is 'if you can get echo disk > /sys/power/state working,
> > uswsusp will work. too'. IOW it should be ok to debug the in-kernel
> > parts, only.
> 
> Hello Nigel, Pavel, Rafael and everyone else who is involved,
> 
> I would like to ask what come out of the suspend2 merge discussion. Nigel 
> just told that suspend2 likely won't be merged anytime soon and thats its 
> business as usual:
> 
> ---------------------------------------------------------------------
> It's pretty much business as usual. Linus doesn't want another
> implementation merged, and he wants the three of us (Pavel, Rafael and
> myself) to agree on a way forward. He also believes that we're
> approaching things from the wrong direction at the moment. Funnily
> enough, this is the one area on which we do all agree.
> ---------------------------------------------------------------------
> http://lists.suspend2.net/lurker/message/20070510.021641.fe306add.en.html
> 
> 
> Has there been any further discussion and preferably agreement on the way 
> to go forward?

The outcome was, more-or-less, that we'll work on merging suspend2 or at least
some parts of it.

However, in the meantime there have been some discussions implying that we have
some important problems with suspend/hibernation that suspend2 doesn't solve
and that IMHO are more urgent than the merging of suspend2 right not.

So, as far as I'm concerned, the plan is to fix the more urgent problems first
and to work on merging suspend2 as far as there's time to do this.

The problem is there are only a few people working on it and there's a lot to
do, so I can only ask you to be patient. ;-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 712+ messages in thread

* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
  2007-05-26 20:35                             ` Rafael J. Wysocki
@ 2007-05-26 22:23                               ` Martin Steigerwald
  0 siblings, 0 replies; 712+ messages in thread
From: Martin Steigerwald @ 2007-05-26 22:23 UTC (permalink / raw)
  To: suspend2-devel
  Cc: Rafael J. Wysocki, Pavel Machek, Linus Torvalds, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	Ingo Molnar, Andrew Morton, Arjan van de Ven, Pekka J Enberg

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

Am Samstag 26 Mai 2007 schrieb Rafael J. Wysocki:

Hi Rafael!

> The outcome was, more-or-less, that we'll work on merging suspend2 or
> at least some parts of it.
>
> However, in the meantime there have been some discussions implying that
> we have some important problems with suspend/hibernation that suspend2
> doesn't solve and that IMHO are more urgent than the merging of
> suspend2 right not.
>
> So, as far as I'm concerned, the plan is to fix the more urgent
> problems first and to work on merging suspend2 as far as there's time
> to do this.
>
> The problem is there are only a few people working on it and there's a
> lot to do, so I can only ask you to be patient. ;-)

Thats fine with me - I understand that. I just thought that there has been 
no outcome at all.

I will try to be patient as long as I do not dig into kernel hacking 
myself deeply enough to be able to help with that - did not do more than 
to put together two conflicting patches to compile my own kernels till 
now and forward port a patch for a sundance network card.  I can help 
with testing once there is something testable tough.

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 712+ messages in thread

* [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy))
  2007-04-26 13:45                             ` Johannes Berg
@ 2007-06-29 22:44                               ` Pavel Machek
  2007-06-30  0:06                                 ` Adrian Bunk
  0 siblings, 1 reply; 712+ messages in thread
From: Pavel Machek @ 2007-06-29 22:44 UTC (permalink / raw)
  To: Johannes Berg, Andrew Morton
  Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel,
	Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar,
	Andrew Morton, Arjan van de Ven, Rafael J. Wysocki

Hi!

> By the way.
> 
> > diff --git a/kernel/power/power.h b/kernel/power/power.h
> > index eb461b8..dc13af5 100644
> > --- a/kernel/power/power.h
> > +++ b/kernel/power/power.h
>         ^^^^^^^^^^^^^^^^^^^^
> 
> Don't these definitions need to be exported to userspace? That
> definitely is not a header file for userspace.

Yes, they do. Does this look like a fix?
									Pavel

--- 

Split userinterface part of power.h into separate file.

Signed-off-by: Pavel Machek <pavel@suse.cz>


diff --git a/include/linux/power.h b/include/linux/power.h
new file mode 100644
index 0000000..37bf890
--- /dev/null
+++ b/include/linux/power.h
@@ -0,0 +1,31 @@
+#ifndef INCLUDE_LINUX_POWER_H
+#define INCLUDE_LINUX_POWER_H
+
+/*
+ * This structure is used to pass the values needed for the identification
+ * of the resume swap area from a user space to the kernel via the
+ * SNAPSHOT_SET_SWAP_AREA ioctl
+ */
+struct resume_swap_area {
+	u_int64_t offset;
+	u_int32_t dev;
+} __attribute__((packed));
+
+#define SNAPSHOT_IOC_MAGIC	'3'
+#define SNAPSHOT_FREEZE			_IO(SNAPSHOT_IOC_MAGIC, 1)
+#define SNAPSHOT_UNFREEZE		_IO(SNAPSHOT_IOC_MAGIC, 2)
+#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */
+#define SNAPSHOT_ATOMIC_RESTORE		_IO(SNAPSHOT_IOC_MAGIC, 4)
+#define SNAPSHOT_FREE			_IO(SNAPSHOT_IOC_MAGIC, 5)
+#define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */
+#define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */
+#define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */
+#define SNAPSHOT_FREE_SWAP_PAGES	_IO(SNAPSHOT_IOC_MAGIC, 9)
+#define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */
+#define SNAPSHOT_S2RAM			_IO(SNAPSHOT_IOC_MAGIC, 11)
+#define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */
+#define SNAPSHOT_SET_SWAP_AREA		_IOW(SNAPSHOT_IOC_MAGIC, 13, \
+							struct resume_swap_area)
+#define SNAPSHOT_IOC_MAXNR	13
+
+#endif
diff --git a/kernel/power/power.h b/kernel/power/power.h
index 41d33eb..e68352b 100644
--- a/kernel/power/power.h
+++ b/kernel/power/power.h
@@ -1,5 +1,9 @@
+#ifndef KERNEL_POWER_POWER_H
+#define KERNEL_POWER_POWER_H
+
 #include <linux/suspend.h>
 #include <linux/utsname.h>
+#include <linux/power.h>
 
 struct swsusp_info {
 	struct new_utsname	uts;
@@ -114,33 +118,6 @@ extern int snapshot_write_next(struct sn
 extern void snapshot_write_finalize(struct snapshot_handle *handle);
 extern int snapshot_image_loaded(struct snapshot_handle *handle);
 
-/*
- * This structure is used to pass the values needed for the identification
- * of the resume swap area from a user space to the kernel via the
- * SNAPSHOT_SET_SWAP_AREA ioctl
- */
-struct resume_swap_area {
-	u_int64_t offset;
-	u_int32_t dev;
-} __attribute__((packed));
-
-#define SNAPSHOT_IOC_MAGIC	'3'
-#define SNAPSHOT_FREEZE			_IO(SNAPSHOT_IOC_MAGIC, 1)
-#define SNAPSHOT_UNFREEZE		_IO(SNAPSHOT_IOC_MAGIC, 2)
-#define SNAPSHOT_ATOMIC_SNAPSHOT	_IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */
-#define SNAPSHOT_ATOMIC_RESTORE		_IO(SNAPSHOT_IOC_MAGIC, 4)
-#define SNAPSHOT_FREE			_IO(SNAPSHOT_IOC_MAGIC, 5)
-#define SNAPSHOT_SET_IMAGE_SIZE		_IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */
-#define SNAPSHOT_AVAIL_SWAP		_IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */
-#define SNAPSHOT_GET_SWAP_PAGE		_IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */
-#define SNAPSHOT_FREE_SWAP_PAGES	_IO(SNAPSHOT_IOC_MAGIC, 9)
-#define SNAPSHOT_SET_SWAP_FILE		_IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */
-#define SNAPSHOT_S2RAM			_IO(SNAPSHOT_IOC_MAGIC, 11)
-#define SNAPSHOT_PMOPS			_IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */
-#define SNAPSHOT_SET_SWAP_AREA		_IOW(SNAPSHOT_IOC_MAGIC, 13, \
-							struct resume_swap_area)
-#define SNAPSHOT_IOC_MAXNR	13
-
 #define PMOPS_PREPARE	1
 #define PMOPS_ENTER	2
 #define PMOPS_FINISH	3
@@ -165,3 +142,5 @@ extern int suspend_enter(suspend_state_t
 struct timeval;
 extern void swsusp_show_speed(struct timeval *, struct timeval *,
 				unsigned int, char *);
+
+#endif

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 712+ messages in thread

* Re: [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy))
  2007-06-29 22:44                               ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
@ 2007-06-30  0:06                                 ` Adrian Bunk
  0 siblings, 0 replies; 712+ messages in thread
From: Adrian Bunk @ 2007-06-30  0:06 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Johannes Berg, Andrew Morton, Linus Torvalds, Nick Piggin,
	Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas,
	suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven,
	Rafael J. Wysocki

On Sat, Jun 30, 2007 at 12:44:22AM +0200, Pavel Machek wrote:
> Hi!
> 
> > By the way.
> > 
> > > diff --git a/kernel/power/power.h b/kernel/power/power.h
> > > index eb461b8..dc13af5 100644
> > > --- a/kernel/power/power.h
> > > +++ b/kernel/power/power.h
> >         ^^^^^^^^^^^^^^^^^^^^
> > 
> > Don't these definitions need to be exported to userspace? That
> > definitely is not a header file for userspace.
> 
> Yes, they do. Does this look like a fix?
> 									Pavel
> 
> --- 
> 
> Split userinterface part of power.h into separate file.
>...

You should also add it to include/linux/Kbuild.
 
cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 712+ messages in thread

end of thread, other threads:[~2007-06-30  0:06 UTC | newest]

Thread overview: 712+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar
2007-04-13 20:27 ` Bill Huey
2007-04-13 20:55   ` Ingo Molnar
2007-04-13 21:21     ` William Lee Irwin III
2007-04-13 21:35       ` Bill Huey
2007-04-13 21:39       ` Ingo Molnar
2007-04-13 21:50 ` Ingo Molnar
2007-04-13 21:57 ` Michal Piotrowski
2007-04-13 22:15 ` Daniel Walker
2007-04-13 22:30   ` Ingo Molnar
2007-04-13 22:37     ` Willy Tarreau
2007-04-13 23:59     ` Daniel Walker
2007-04-14 10:55       ` Ingo Molnar
2007-04-13 22:21 ` William Lee Irwin III
2007-04-13 22:52   ` Ingo Molnar
2007-04-13 23:30     ` William Lee Irwin III
2007-04-13 23:44       ` Ingo Molnar
2007-04-13 23:58         ` William Lee Irwin III
2007-04-14 22:38   ` Davide Libenzi
2007-04-14 23:26     ` Davide Libenzi
2007-04-15  4:01     ` William Lee Irwin III
2007-04-15  4:18       ` Davide Libenzi
2007-04-15 23:09     ` Pavel Pisa
2007-04-16  5:47       ` Davide Libenzi
2007-04-17  0:37         ` Pavel Pisa
2007-04-13 22:31 ` Willy Tarreau
2007-04-13 23:18   ` Ingo Molnar
2007-04-14 18:48     ` Bill Huey
2007-04-13 23:07 ` Gabriel C
2007-04-13 23:25   ` Ingo Molnar
2007-04-13 23:39     ` Gabriel C
2007-04-14  2:04 ` Nick Piggin
2007-04-14  6:32   ` Ingo Molnar
2007-04-14  6:43     ` Ingo Molnar
2007-04-14  8:08       ` Willy Tarreau
2007-04-14  8:36         ` Willy Tarreau
2007-04-14 10:53           ` Ingo Molnar
2007-04-14 13:01             ` Willy Tarreau
2007-04-14 13:27               ` Willy Tarreau
2007-04-14 14:45                 ` Willy Tarreau
2007-04-14 16:14                   ` Ingo Molnar
2007-04-14 16:19                 ` Ingo Molnar
2007-04-14 17:15                   ` Eric W. Biederman
2007-04-14 17:29                     ` Willy Tarreau
2007-04-14 17:44                       ` Eric W. Biederman
2007-04-14 17:54                         ` Ingo Molnar
2007-04-14 18:18                           ` Willy Tarreau
2007-04-14 18:40                             ` Eric W. Biederman
2007-04-14 19:01                               ` Willy Tarreau
2007-04-15 17:55                             ` Ingo Molnar
2007-04-15 18:06                               ` Willy Tarreau
2007-04-15 19:20                                 ` Ingo Molnar
2007-04-15 19:35                                   ` William Lee Irwin III
2007-04-15 19:57                                     ` Ingo Molnar
2007-04-15 23:54                                       ` William Lee Irwin III
2007-04-16 11:24                                         ` Ingo Molnar
2007-04-16 13:46                                           ` William Lee Irwin III
2007-04-15 19:37                                   ` Ingo Molnar
2007-04-14 17:50                       ` Linus Torvalds
2007-04-15  7:54               ` Mike Galbraith
2007-04-15  8:58                 ` Ingo Molnar
2007-04-15  9:11                   ` Mike Galbraith
2007-04-19  9:01               ` Ingo Molnar
2007-04-19 12:54                 ` Willy Tarreau
2007-04-19 15:18                   ` Ingo Molnar
2007-04-19 17:34                     ` Gene Heskett
2007-04-19 18:45                     ` Willy Tarreau
2007-04-21 10:31                       ` Ingo Molnar
2007-04-21 10:38                         ` Ingo Molnar
2007-04-21 10:45                         ` Ingo Molnar
2007-04-21 11:07                           ` Willy Tarreau
2007-04-21 11:29                             ` Björn Steinbrink
2007-04-21 11:51                               ` Willy Tarreau
2007-04-19 23:52                     ` Jan Knutar
2007-04-20  5:05                       ` Willy Tarreau
2007-04-19 17:32                 ` Gene Heskett
2007-04-14 15:17             ` Mark Lord
2007-04-14 19:48           ` William Lee Irwin III
2007-04-14 20:12             ` Willy Tarreau
2007-04-14 10:36         ` Ingo Molnar
2007-04-14 15:09 ` S.Çağlar Onur
2007-04-14 16:09   ` Ingo Molnar
2007-04-14 16:59     ` S.Çağlar Onur
2007-04-15 16:13       ` Kaffeine problem with CFS Ingo Molnar
2007-04-15 16:25         ` Ingo Molnar
2007-04-15 16:55           ` Christoph Pfister
2007-04-15 22:14             ` S.Çağlar Onur
2007-04-18  8:27             ` Ingo Molnar
2007-04-18  8:57               ` Ingo Molnar
2007-04-18  9:06                 ` Ingo Molnar
2007-04-18  8:57               ` Christoph Pfister
2007-04-18  9:01                 ` Ingo Molnar
2007-04-18  9:12                   ` Mike Galbraith
2007-04-18  9:13                   ` Christoph Pfister
2007-04-18  9:17                     ` Ingo Molnar
2007-04-18  9:25                       ` Christoph Pfister
2007-04-18  9:28                         ` Ingo Molnar
2007-04-18  9:52                           ` Christoph Pfister
2007-04-18 10:04                             ` Christoph Pfister
2007-04-18 10:17                             ` Ingo Molnar
2007-04-18 10:32                               ` Ingo Molnar
2007-04-18 10:37                                 ` Ingo Molnar
2007-04-18 10:49                                   ` Ingo Molnar
2007-04-18 10:53                                 ` Ingo Molnar
     [not found]             ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com>
     [not found]               ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com>
2007-04-18 13:05                 ` S.Çağlar Onur
2007-04-18 13:21                 ` Christoph Pfister
2007-04-18 13:25                   ` S.Çağlar Onur
2007-04-18 15:48                     ` Ingo Molnar
2007-04-18 16:07                       ` William Lee Irwin III
2007-04-18 16:14                         ` Ingo Molnar
2007-04-18 21:08                       ` S.Çağlar Onur
2007-04-18 21:12                         ` Ingo Molnar
2007-04-20 19:31                         ` Bill Davidsen
2007-04-21  8:36                           ` Ingo Molnar
2007-04-18 15:08                   ` Ingo Molnar
2007-04-15  3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas
2007-04-15  5:16   ` Bill Huey
2007-04-15  8:44     ` Ingo Molnar
2007-04-15  9:51       ` Bill Huey
2007-04-15 10:39         ` Pekka Enberg
2007-04-15 12:45           ` Willy Tarreau
2007-04-15 13:08             ` Pekka J Enberg
2007-04-15 17:32               ` Mike Galbraith
2007-04-15 17:59                 ` Linus Torvalds
2007-04-15 19:00                   ` Jonathan Lundell
2007-04-15 22:52                     ` Con Kolivas
2007-04-16  2:28                       ` Nick Piggin
2007-04-16  3:15                         ` Con Kolivas
2007-04-16  3:34                           ` Nick Piggin
     [not found]                         ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>
2007-04-16  6:27                           ` [ck] " Nick Piggin
2007-04-15 15:26             ` William Lee Irwin III
2007-04-16 15:55               ` Chris Friesen
2007-04-16 16:13                 ` William Lee Irwin III
2007-04-17  0:04                 ` Peter Williams
2007-04-17 13:07                 ` James Bruce
2007-04-17 20:05                   ` William Lee Irwin III
2007-04-15 15:39             ` Ingo Molnar
2007-04-15 15:47               ` William Lee Irwin III
2007-04-16  5:27               ` Peter Williams
2007-04-16  6:23                 ` Peter Williams
2007-04-16  6:40                   ` Peter Williams
2007-04-16  7:32                     ` Ingo Molnar
2007-04-16  8:54                       ` Peter Williams
2007-04-15 15:16           ` Gene Heskett
2007-04-15 16:43             ` Con Kolivas
2007-04-15 16:58               ` Gene Heskett
2007-04-15 18:00                 ` Mike Galbraith
2007-04-16  0:18                   ` Gene Heskett
2007-04-15 16:11     ` Bernd Eckenfels
2007-04-15  6:43   ` Mike Galbraith
2007-04-15  8:36     ` Bill Huey
2007-04-15  8:45       ` Mike Galbraith
2007-04-15  9:06       ` Ingo Molnar
2007-04-16 10:00         ` Ingo Molnar
2007-04-15 16:25       ` Arjan van de Ven
2007-04-16  5:36         ` Bill Huey
2007-04-16  6:17           ` Nick Piggin
2007-04-17  0:06     ` Peter Williams
2007-04-17  2:29       ` Mike Galbraith
2007-04-17  3:40         ` Nick Piggin
2007-04-17  4:01           ` Mike Galbraith
2007-04-17  3:43             ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang
2007-04-17  4:14             ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin
2007-04-17  6:26               ` Peter Williams
2007-04-17  9:51               ` Ingo Molnar
2007-04-17 13:44                 ` Peter Williams
2007-04-17 23:00                   ` Michael K. Edwards
2007-04-17 23:07                     ` William Lee Irwin III
2007-04-17 23:52                       ` Michael K. Edwards
2007-04-18  0:36                         ` Bill Huey
2007-04-18  2:39                     ` Peter Williams
2007-04-20 20:47                 ` Bill Davidsen
2007-04-21  7:39                   ` Nick Piggin
2007-04-21  8:33                   ` Ingo Molnar
2007-04-20 20:36             ` Bill Davidsen
2007-04-17  4:17           ` Peter Williams
2007-04-17  4:29             ` Nick Piggin
2007-04-17  5:53               ` Willy Tarreau
2007-04-17  6:10                 ` Nick Piggin
2007-04-17  6:09               ` William Lee Irwin III
2007-04-17  6:15                 ` Nick Piggin
2007-04-17  6:26                   ` William Lee Irwin III
2007-04-17  7:01                     ` Nick Piggin
2007-04-17  8:23                       ` William Lee Irwin III
2007-04-17 22:23                         ` Davide Libenzi
2007-04-17 21:39                       ` Matt Mackall
2007-04-17 23:23                         ` Peter Williams
2007-04-17 23:19                           ` Matt Mackall
2007-04-18  3:15                         ` Nick Piggin
2007-04-18  3:45                           ` Mike Galbraith
2007-04-18  3:56                             ` Nick Piggin
2007-04-18  4:29                               ` Mike Galbraith
2007-04-18  4:38                           ` Matt Mackall
2007-04-18  5:00                             ` Nick Piggin
2007-04-18  5:55                               ` Matt Mackall
2007-04-18  6:37                                 ` Nick Piggin
2007-04-18  6:55                                   ` Matt Mackall
2007-04-18  7:24                                     ` Nick Piggin
2007-04-21 13:33                                     ` Bill Davidsen
2007-04-18 13:08                                 ` William Lee Irwin III
2007-04-18 19:48                                   ` Davide Libenzi
2007-04-18 14:48                                 ` Linus Torvalds
2007-04-18 15:23                                   ` Matt Mackall
2007-04-18 17:22                                     ` Linus Torvalds
2007-04-18 17:48                                       ` [ck] " Mark Glines
2007-04-18 19:27                                         ` Chris Friesen
2007-04-19  0:49                                           ` Peter Williams
2007-04-18 17:49                                       ` Ingo Molnar
2007-04-18 17:59                                         ` Ingo Molnar
2007-04-18 19:40                                           ` Linus Torvalds
2007-04-18 19:43                                             ` Ingo Molnar
2007-04-18 20:07                                             ` Davide Libenzi
2007-04-18 21:48                                               ` Ingo Molnar
2007-04-18 23:30                                                 ` Davide Libenzi
2007-04-19  8:00                                                   ` Ingo Molnar
2007-04-19 15:43                                                     ` Davide Libenzi
2007-04-21 14:09                                                     ` Bill Davidsen
2007-04-19 17:39                                                   ` Bernd Eckenfels
2007-04-19  6:52                                                 ` Mike Galbraith
2007-04-19  7:09                                                   ` Ingo Molnar
2007-04-19  7:32                                                     ` Mike Galbraith
2007-04-19 16:55                                                       ` Davide Libenzi
2007-04-20  5:16                                                         ` Mike Galbraith
2007-04-19  7:14                                                   ` Mike Galbraith
2007-04-18 21:04                                             ` Ingo Molnar
2007-04-18 19:23                                         ` Linus Torvalds
2007-04-18 19:56                                           ` Davide Libenzi
2007-04-18 20:11                                             ` Linus Torvalds
2007-04-19  0:22                                               ` Davide Libenzi
2007-04-19  0:30                                                 ` Linus Torvalds
2007-04-18 18:02                                       ` William Lee Irwin III
2007-04-18 18:12                                         ` Ingo Molnar
2007-04-18 18:36                                       ` Diego Calleja
2007-04-19  0:37                                       ` Peter Williams
2007-04-18 19:05                                     ` Davide Libenzi
2007-04-18 19:13                                     ` Michael K. Edwards
2007-04-19  3:18                                   ` Nick Piggin
2007-04-19  5:14                                     ` Andrew Morton
2007-04-19  6:38                                       ` Ingo Molnar
2007-04-19  7:57                                         ` William Lee Irwin III
2007-04-19 11:50                                           ` Peter Williams
2007-04-20  5:26                                             ` William Lee Irwin III
2007-04-20  6:16                                               ` Peter Williams
2007-04-19  8:33                                         ` Nick Piggin
2007-04-19 11:59                                         ` Renice X for cpu schedulers Con Kolivas
2007-04-19 12:42                                           ` Peter Williams
2007-04-19 13:20                                             ` Peter Williams
2007-04-19 14:22                                               ` Lee Revell
2007-04-20  1:32                                                 ` Michael K. Edwards
2007-04-20  5:25                                                   ` Bill Huey
2007-04-20  7:12                                                     ` Michael K. Edwards
2007-04-20  8:21                                                       ` Bill Huey
2007-04-19 13:17                                           ` Mark Lord
2007-04-19 15:10                                             ` Con Kolivas
2007-04-19 16:15                                               ` Mark Lord
2007-04-19 18:21                                                 ` Gene Heskett
2007-04-20  0:17                                                 ` Con Kolivas
2007-04-20  1:17                                                 ` Ed Tomlinson
2007-04-20  1:27                                                   ` Linus Torvalds
2007-04-20  3:57                                             ` Nick Piggin
2007-04-21 14:55                                               ` Mark Lord
2007-04-22 12:54                                                 ` Mark Lord
2007-04-22 12:58                                                   ` Con Kolivas
2007-04-19 18:16                                           ` Gene Heskett
2007-04-19 21:35                                             ` Michael K. Edwards
2007-04-19 22:47                                             ` Con Kolivas
2007-04-20  2:00                                               ` Gene Heskett
2007-04-20  2:01                                               ` Gene Heskett
2007-04-20  5:24                                               ` Mike Galbraith
2007-04-19 19:26                                           ` Ray Lee
2007-04-19 22:56                                             ` Con Kolivas
2007-04-20  0:20                                               ` Michael K. Edwards
2007-04-20  5:34                                                 ` Bill Huey
2007-04-20  0:56                                               ` Ray Lee
2007-04-20  4:09                                             ` Nick Piggin
2007-04-24 15:50                                               ` Ray Lee
2007-04-24 16:23                                                 ` Matt Mackall
2007-04-21 13:40                                   ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen
2007-04-17  6:50                   ` Davide Libenzi
2007-04-17  7:09                     ` William Lee Irwin III
2007-04-17  7:22                       ` Peter Williams
2007-04-17  7:23                       ` Nick Piggin
2007-04-17  7:27                       ` Davide Libenzi
2007-04-17  7:33                         ` Nick Piggin
2007-04-17  7:33                       ` Ingo Molnar
2007-04-17  7:40                         ` Nick Piggin
2007-04-17  7:58                           ` Ingo Molnar
2007-04-17  9:05                         ` William Lee Irwin III
2007-04-17  9:24                           ` Ingo Molnar
2007-04-17  9:57                             ` William Lee Irwin III
2007-04-17 10:01                               ` Ingo Molnar
2007-04-17 11:31                               ` William Lee Irwin III
2007-04-17 22:08                             ` Matt Mackall
2007-04-17 22:32                               ` William Lee Irwin III
2007-04-17 22:39                                 ` Matt Mackall
2007-04-17 22:59                                   ` William Lee Irwin III
2007-04-17 22:57                                     ` Matt Mackall
2007-04-18  4:29                                       ` William Lee Irwin III
2007-04-18  4:42                                         ` Davide Libenzi
2007-04-18  7:29                                       ` James Bruce
2007-04-17  7:11                     ` Nick Piggin
2007-04-17  7:21                       ` Davide Libenzi
2007-04-17  6:23               ` Peter Williams
2007-04-17  6:44                 ` Nick Piggin
2007-04-17  7:48                   ` Peter Williams
2007-04-17  7:56                     ` Nick Piggin
2007-04-17 13:16                       ` Peter Williams
2007-04-18  4:46                         ` Nick Piggin
2007-04-17  8:44                 ` Ingo Molnar
2007-04-19  2:20                   ` Peter Williams
2007-04-15 15:05   ` Ingo Molnar
2007-04-15 20:05     ` Matt Mackall
2007-04-15 20:48       ` Ingo Molnar
2007-04-15 21:31         ` Matt Mackall
2007-04-16  3:03           ` Nick Piggin
2007-04-16 14:28             ` Matt Mackall
2007-04-17  3:31               ` Nick Piggin
2007-04-17 17:35                 ` Matt Mackall
2007-04-16 15:45           ` William Lee Irwin III
2007-04-15 23:39         ` William Lee Irwin III
2007-04-16  1:06           ` Peter Williams
2007-04-16  3:04             ` William Lee Irwin III
2007-04-16  5:09               ` Peter Williams
2007-04-16 11:04                 ` William Lee Irwin III
2007-04-16 12:55                   ` Peter Williams
2007-04-16 23:10                     ` Michael K. Edwards
2007-04-17  3:55                       ` Nick Piggin
2007-04-17  4:25                         ` Peter Williams
2007-04-17  4:34                           ` Nick Piggin
2007-04-17  6:03                             ` Peter Williams
2007-04-17  6:14                               ` William Lee Irwin III
2007-04-17  6:23                               ` Nick Piggin
2007-04-17  9:36                               ` Ingo Molnar
2007-04-17  8:24                         ` William Lee Irwin III
     [not found]                     ` <20070416135915.GK8915@holomorphy.com>
     [not found]                       ` <46241677.7060909@bigpond.net.au>
     [not found]                         ` <20070417025704.GM8915@holomorphy.com>
     [not found]                           ` <462445EC.1060306@bigpond.net.au>
     [not found]                             ` <20070417053147.GN8915@holomorphy.com>
     [not found]                               ` <46246A7C.8050501@bigpond.net.au>
     [not found]                                 ` <20070417064109.GP8915@holomorphy.com>
2007-04-17  8:00                                   ` Peter Williams
2007-04-17 10:41                                     ` William Lee Irwin III
2007-04-17 13:48                                       ` Peter Williams
2007-04-18  0:27                                         ` Peter Williams
2007-04-18  2:03                                           ` William Lee Irwin III
2007-04-18  2:31                                             ` Peter Williams
2007-04-16 17:22             ` Chris Friesen
2007-04-17  0:54               ` Peter Williams
2007-04-17 15:52                 ` Chris Friesen
2007-04-17 23:50                   ` Peter Williams
2007-04-18  5:43                     ` Chris Friesen
2007-04-18 13:00                       ` Peter Williams
2007-04-16  5:16     ` Con Kolivas
2007-04-16  5:48       ` Gene Heskett
2007-04-15 12:29 ` Esben Nielsen
2007-04-15 13:04   ` Ingo Molnar
2007-04-16  7:16     ` Esben Nielsen
2007-04-15 22:49 ` Ismail Dönmez
2007-04-15 23:23   ` Arjan van de Ven
2007-04-15 23:33     ` Ismail Dönmez
2007-04-16 11:58   ` Ingo Molnar
2007-04-16 12:02     ` Ismail Dönmez
2007-04-16 22:00 ` Andi Kleen
2007-04-16 21:05   ` Ingo Molnar
2007-04-16 21:21     ` Andi Kleen
2007-04-17  7:56 ` Andy Whitcroft
2007-04-17  9:32   ` Nick Piggin
2007-04-17  9:59     ` Ingo Molnar
2007-04-17 11:11       ` Nick Piggin
2007-04-18  8:55       ` Nick Piggin
2007-04-18  9:33         ` Con Kolivas
2007-04-18 12:14           ` Nick Piggin
2007-04-18 12:33             ` Con Kolivas
2007-04-18 21:49               ` Con Kolivas
2007-04-18  9:53         ` Ingo Molnar
2007-04-18 12:13           ` Nick Piggin
2007-04-18 12:49             ` Con Kolivas
2007-04-19  3:28               ` Nick Piggin
2007-04-18 10:22   ` Ingo Molnar
2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse
2007-04-18 16:46   ` Ingo Molnar
2007-04-18 20:45     ` CFS and suspend2: hang in atomic copy Christian Hesse
2007-04-18 21:16       ` Ingo Molnar
2007-04-18 21:57         ` Christian Hesse
2007-04-18 22:02           ` Ingo Molnar
2007-04-18 22:22             ` Christian Hesse
2007-04-19  1:37               ` [Suspend2-devel] " Nigel Cunningham
2007-04-18 22:56             ` Bob Picco
2007-04-19  1:43               ` [Suspend2-devel] " Nigel Cunningham
2007-04-19  6:29               ` Ingo Molnar
2007-04-19 11:10                 ` Bob Picco
2007-04-19  1:52             ` [Suspend2-devel] " Nigel Cunningham
2007-04-19  7:04               ` Ingo Molnar
2007-04-19  9:05                 ` Nigel Cunningham
2007-04-24 20:23                 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
2007-04-24 20:41                   ` Linus Torvalds
2007-04-24 20:51                     ` Hua Zhong
2007-04-24 20:54                     ` Ingo Molnar
2007-04-24 21:29                       ` Pavel Machek
2007-04-24 22:24                         ` Ray Lee
2007-04-25 21:41                         ` Matt Mackall
2007-04-26 11:27                           ` Pavel Machek
2007-04-26 19:04                           ` Bill Davidsen
2007-04-24 21:24                     ` Pavel Machek
2007-04-24 23:41                       ` Linus Torvalds
2007-04-25  1:06                         ` Olivier Galibert
2007-04-25  6:41                         ` Ingo Molnar
2007-04-25  7:29                           ` Pavel Machek
2007-04-25  7:48                             ` Dumitru Ciobarcianu
2007-04-25  8:10                               ` Pavel Machek
2007-04-25  8:22                                 ` Dumitru Ciobarcianu
2007-04-26 11:12                                 ` Pekka Enberg
2007-04-26 14:48                                   ` Rafael J. Wysocki
2007-04-26 16:10                                     ` Pekka Enberg
2007-04-26 19:28                                       ` Rafael J. Wysocki
2007-04-26 20:16                                         ` Nigel Cunningham
2007-04-26 20:37                                           ` Rafael J. Wysocki
2007-04-26 20:49                                             ` David Lang
2007-04-26 20:55                                             ` Nigel Cunningham
2007-04-26 21:22                                               ` Rafael J. Wysocki
2007-04-26 22:08                                                 ` Nigel Cunningham
2007-04-25  8:48                             ` Nigel Cunningham
2007-04-25 13:07                             ` Federico Heinz
2007-04-25 19:38                             ` Kenneth Crudup
2007-04-25  7:23                         ` Pavel Machek
2007-04-25  8:48                           ` Xavier Bestel
2007-04-25  8:50                             ` Nigel Cunningham
2007-04-25  9:07                               ` Xavier Bestel
2007-04-25  9:19                                 ` Nigel Cunningham
2007-04-26 18:18                                 ` Bill Davidsen
2007-04-25  9:02                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti
2007-04-25 19:16                             ` suspend2 merge Martin Steigerwald
2007-04-25 15:18                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk
2007-04-25 17:34                             ` Pavel Machek
2007-04-25 18:39                               ` Adrian Bunk
2007-04-25 18:50                                 ` Linus Torvalds
2007-04-25 19:02                                   ` Hua Zhong
2007-04-25 19:25                                   ` Adrian Bunk
2007-04-25 19:38                                     ` Linus Torvalds
2007-04-25 20:08                                       ` Pavel Machek
2007-04-25 20:33                                         ` Rafael J. Wysocki
2007-04-25 20:31                                           ` Pavel Machek
2007-04-27 10:21                                             ` driver power operations (was Re: suspend2 merge) Johannes Berg
2007-04-27 10:21                                             ` Johannes Berg
2007-04-27 12:06                                               ` Rafael J. Wysocki
2007-04-27 12:40                                                 ` Pavel Machek
2007-04-27 12:40                                                 ` Pavel Machek
2007-04-27 12:46                                                   ` Johannes Berg
2007-04-27 12:50                                                     ` Pavel Machek
2007-04-27 12:50                                                       ` Pavel Machek
2007-04-27 12:46                                                   ` Johannes Berg
2007-04-27 12:06                                               ` Rafael J. Wysocki
2007-04-27 14:34                                               ` [linux-pm] " Alan Stern
2007-04-27 14:34                                                 ` Alan Stern
2007-04-27 14:39                                                 ` [linux-pm] " Johannes Berg
2007-04-27 14:49                                                   ` Johannes Berg
2007-04-27 14:49                                                     ` Johannes Berg
2007-04-27 15:20                                                     ` [linux-pm] " Rafael J. Wysocki
2007-04-27 15:27                                                       ` Johannes Berg
2007-04-27 15:27                                                       ` Johannes Berg
2007-04-27 15:52                                                       ` [linux-pm] " Linus Torvalds
2007-04-27 15:52                                                         ` Linus Torvalds
2007-04-27 18:34                                                         ` Rafael J. Wysocki
2007-04-27 18:34                                                         ` [linux-pm] " Rafael J. Wysocki
2007-04-27 15:20                                                     ` Rafael J. Wysocki
2007-04-27 15:41                                                     ` [linux-pm] " Linus Torvalds
2007-04-27 15:41                                                       ` Linus Torvalds
2007-04-27 14:39                                                 ` Johannes Berg
2007-04-27 15:12                                                 ` [linux-pm] " Rafael J. Wysocki
2007-04-27 15:24                                                   ` Johannes Berg
2007-04-27 15:24                                                   ` Johannes Berg
2007-04-27 15:12                                                 ` Rafael J. Wysocki
2007-04-27 15:56                                               ` David Brownell
2007-04-27 15:56                                               ` [linux-pm] " David Brownell
2007-04-27 18:31                                                 ` Rafael J. Wysocki
2007-04-27 18:31                                                 ` Rafael J. Wysocki
2007-05-07 12:29                                               ` Pavel Machek
2007-05-07 12:29                                               ` Pavel Machek
2007-04-25 22:36                                         ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham
2007-04-25 20:20                                       ` Rafael J. Wysocki
2007-04-25 20:24                                         ` Linus Torvalds
2007-04-25 21:30                                           ` Pavel Machek
2007-04-25 21:40                                             ` Rafael J. Wysocki
2007-04-25 21:46                                               ` Pavel Machek
2007-04-25 22:22                                             ` Nigel Cunningham
2007-04-25 20:23                                       ` Adrian Bunk
2007-04-25 22:19                                         ` Kenneth Crudup
2007-04-27 12:36                                       ` suspend2 merge Martin Steigerwald
2007-04-25 19:41                                     ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton
2007-04-25 19:55                                     ` Pavel Machek
2007-04-25 22:13                                     ` Kenneth Crudup
2007-04-26  1:25                                     ` Antonino A. Daplas
2007-04-25 23:33                                   ` Olivier Galibert
2007-04-26  1:56                                     ` Nigel Cunningham
2007-04-26  7:27                                       ` David Lang
2007-04-26  9:45                                         ` Nigel Cunningham
2007-04-25 18:52                               ` Alon Bar-Lev
2007-04-25 22:11                               ` Kenneth Crudup
2007-04-25 19:43                           ` Kenneth Crudup
2007-04-25 20:08                             ` Linus Torvalds
2007-04-25 20:27                               ` Pavel Machek
2007-04-25 20:44                                 ` Linus Torvalds
2007-04-25 21:07                                   ` Rafael J. Wysocki
2007-04-25 21:44                                   ` Pavel Machek
2007-04-25 22:18                                     ` Linus Torvalds
2007-04-25 22:27                                       ` Nigel Cunningham
2007-04-25 22:55                                         ` Linus Torvalds
2007-04-25 23:13                                           ` Pavel Machek
2007-04-25 23:29                                             ` Linus Torvalds
2007-04-25 23:45                                               ` Pavel Machek
2007-04-26  1:48                                                 ` Nigel Cunningham
2007-04-26  1:40                                           ` Nigel Cunningham
2007-04-26  2:04                                             ` Linus Torvalds
2007-04-26  2:13                                               ` Nigel Cunningham
2007-04-26  3:03                                                 ` Linus Torvalds
2007-04-26  3:34                                                   ` Nigel Cunningham
2007-04-26  2:31                                               ` Nigel Cunningham
2007-04-26 10:39                                           ` Johannes Berg
2007-04-26 11:30                                             ` Pavel Machek
2007-04-26 11:41                                               ` Johannes Berg
2007-04-26 16:31                                               ` Johannes Berg
2007-04-26 16:31                                                 ` Johannes Berg
2007-04-26 18:40                                                 ` Rafael J. Wysocki
2007-04-26 18:40                                                   ` Rafael J. Wysocki
2007-04-26 18:40                                                   ` Johannes Berg
2007-04-26 19:02                                                     ` Rafael J. Wysocki
2007-04-27  9:41                                                       ` Johannes Berg
2007-04-27 10:09                                                         ` [linux-pm] " Johannes Berg
2007-04-27 10:09                                                         ` Johannes Berg
2007-04-27 10:18                                                         ` Rafael J. Wysocki
2007-04-27 10:18                                                         ` Rafael J. Wysocki
2007-04-27 10:19                                                           ` Johannes Berg
2007-04-27 10:19                                                           ` Johannes Berg
2007-04-27 12:09                                                             ` Rafael J. Wysocki
2007-04-27 12:07                                                               ` Johannes Berg
2007-04-27 12:07                                                               ` Johannes Berg
2007-04-27 12:09                                                             ` Rafael J. Wysocki
2007-04-27  9:41                                                       ` Johannes Berg
2007-04-26 19:02                                                     ` Rafael J. Wysocki
2007-04-26 18:40                                                   ` Johannes Berg
2007-04-29 12:48                                                 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki
2007-04-29 12:53                                                   ` Rafael J. Wysocki
2007-04-30  8:29                                                   ` Johannes Berg
2007-04-30 14:51                                                     ` Rafael J. Wysocki
2007-04-30 14:59                                                       ` Johannes Berg
2007-05-01 14:05                                                         ` Rafael J. Wysocki
2007-05-01 22:02                                                           ` Rafael J. Wysocki
2007-05-02  5:13                                                             ` Alexey Starikovskiy
2007-05-02 13:42                                                               ` Rafael J. Wysocki
2007-05-02 14:11                                                                 ` Alexey Starikovskiy
2007-05-02 19:26                                                                   ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki
2007-05-02 19:26                                                                   ` Rafael J. Wysocki
2007-05-03 22:48                                                                     ` Pavel Machek
2007-05-03 22:48                                                                     ` Pavel Machek
2007-05-03 23:14                                                                       ` Rafael J. Wysocki
2007-05-03 23:14                                                                       ` Rafael J. Wysocki
2007-05-04 10:54                                                                       ` Johannes Berg
2007-05-04 12:08                                                                         ` Pavel Machek
2007-05-04 12:08                                                                         ` Pavel Machek
2007-05-04 12:29                                                                           ` Rafael J. Wysocki
2007-05-04 12:29                                                                           ` Rafael J. Wysocki
2007-05-04 10:54                                                                       ` Johannes Berg
2007-05-02  8:21                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
2007-05-02  9:02                                                               ` Rafael J. Wysocki
2007-05-02  9:16                                                               ` Pavel Machek
2007-05-02  9:25                                                                 ` Johannes Berg
2007-05-03 14:00                                                                   ` Alan Stern
2007-05-03 17:17                                                                     ` Rafael J. Wysocki
2007-05-03 18:33                                                                       ` Alan Stern
2007-05-03 19:47                                                                         ` Rafael J. Wysocki
2007-05-03 19:59                                                                           ` Alan Stern
2007-05-03 20:21                                                                             ` Rafael J. Wysocki
2007-05-04 14:40                                                                               ` Alan Stern
2007-05-04 20:20                                                                                 ` Rafael J. Wysocki
2007-05-04 20:21                                                                                   ` Johannes Berg
2007-05-04 20:55                                                                                     ` Pavel Machek
2007-05-04 21:08                                                                                       ` Johannes Berg
2007-05-04 21:15                                                                                         ` Pavel Machek
2007-05-04 21:53                                                                                           ` Rafael J. Wysocki
2007-05-04 21:53                                                                                             ` Johannes Berg
2007-05-04 22:25                                                                                               ` Rafael J. Wysocki
2007-05-05 15:52                                                                                             ` Alan Stern
2007-05-07  1:16                                                                                               ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
2007-05-07 21:00                                                                                                 ` Rafael J. Wysocki
2007-05-07 21:45                                                                                                   ` David Brownell
2007-05-07 22:16                                                                                                     ` Rafael J. Wysocki
2007-05-09 19:23                                                                                                       ` David Brownell
2007-05-04 21:06                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki
2007-05-04 20:58                                                                                 ` Pavel Machek
2007-05-04 21:24                                                                                   ` Rafael J. Wysocki
2007-05-05 16:19                                                                                     ` Alan Stern
2007-05-05 17:46                                                                                       ` Rafael J. Wysocki
2007-05-05 21:42                                                                                         ` Alan Stern
2007-05-05 22:14                                                                                           ` Rafael J. Wysocki
2007-05-04 21:40                                                                                 ` David Brownell
2007-05-04 22:19                                                                                   ` Rafael J. Wysocki
2007-05-07  1:05                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell
2007-05-05 16:08                                                                                   ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern
2007-05-05 17:50                                                                                     ` Rafael J. Wysocki
2007-05-05 21:43                                                                                       ` Alan Stern
2007-05-05 22:16                                                                                         ` Rafael J. Wysocki
2007-05-07  1:31                                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell
2007-05-07 16:33                                                                                       ` Alan Stern
2007-05-07 20:49                                                                                         ` Pavel Machek
2007-05-07 21:38                                                                                           ` Alan Stern
2007-05-08  0:30                                                                                             ` Pavel Machek
2007-05-03 20:33                                                                         ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell
2007-05-03 20:33                                                                     ` David Brownell
2007-05-03 20:51                                                                       ` Rafael J. Wysocki
2007-05-04 14:51                                                                       ` Alan Stern
2007-05-04 14:56                                                                         ` Johannes Berg
2007-05-04 20:27                                                                           ` Rafael J. Wysocki
2007-05-04 22:00                                                                         ` David Brownell
2007-05-05 15:49                                                                           ` Alan Stern
2007-05-07  1:10                                                                             ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell
2007-05-07 18:46                                                                               ` Alan Stern
2007-05-07 21:29                                                                                 ` Rafael J. Wysocki
2007-05-07 22:22                                                                                   ` Alan Stern
2007-05-07 22:47                                                                                     ` Rafael J. Wysocki
2007-05-08 14:56                                                                                       ` Alan Stern
2007-05-08 19:59                                                                                         ` Rafael J. Wysocki
2007-05-08 21:26                                                                                           ` Alan Stern
2007-05-09  8:17                                                                                         ` Pavel Machek
2007-05-09 15:21                                                                                           ` Alan Stern
2007-05-09 19:35                                                                                         ` David Brownell
2007-05-09 20:04                                                                                           ` Alan Stern
2007-05-09 20:21                                                                                             ` David Brownell
2007-05-10 15:17                                                                                               ` Alan Stern
2007-05-09 21:07                                                                                             ` Pavel Machek
2007-05-07 21:43                                                                                 ` David Brownell
2007-05-07 22:41                                                                                   ` Alan Stern
2007-05-03 22:18                                                                     ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
2007-05-04 14:57                                                                       ` Alan Stern
2007-05-04 20:50                                                                         ` Rafael J. Wysocki
2007-05-04 20:49                                                                           ` Johannes Berg
2007-05-04 21:11                                                                             ` Rafael J. Wysocki
2007-05-04 21:23                                                                               ` Johannes Berg
2007-05-04 21:55                                                                                 ` Rafael J. Wysocki
2007-05-04 21:54                                                                                   ` Johannes Berg
2007-05-04 22:21                                                                                     ` Rafael J. Wysocki
2007-05-05 15:37                                                                                       ` Alan Stern
2007-05-05 18:49                                                                                         ` Rafael J. Wysocki
2007-05-05 21:44                                                                                           ` Alan Stern
2007-05-05 22:36                                                                                             ` Rafael J. Wysocki
2007-05-06 22:01                                                                                               ` Alan Stern
2007-05-06 22:31                                                                                                 ` Rafael J. Wysocki
2007-05-07  1:37                                                                                                 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell
2007-05-08  2:57                                                                                                   ` Greg KH
2007-05-07  8:51                                                                                           ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg
2007-05-04 22:12                                                                                   ` David Brownell
2007-05-04 22:31                                                                                     ` Rafael J. Wysocki
2007-05-05 16:15                                                                                 ` Alan Stern
2007-05-02 13:43                                                                 ` Rafael J. Wysocki
2007-04-25 22:42                                       ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek
2007-04-25 22:58                                         ` Linus Torvalds
2007-04-25 22:43                                       ` Chuck Ebbert
2007-04-25 23:00                                         ` Linus Torvalds
2007-04-25 22:49                                       ` Pavel Machek
2007-04-25 23:10                                         ` Linus Torvalds
2007-04-25 23:28                                           ` Pavel Machek
2007-04-25 23:57                                             ` Linus Torvalds
2007-04-25 22:57                                       ` Alan Cox
2007-04-25 23:20                                         ` Linus Torvalds
2007-04-25 23:52                                           ` Pavel Machek
2007-04-26  0:05                                             ` Linus Torvalds
2007-04-26  0:14                                               ` Pavel Machek
2007-04-25 23:51                                                 ` David Lang
2007-04-26  0:38                                                 ` Linus Torvalds
2007-04-26  2:04                                                   ` H. Peter Anvin
2007-04-26  2:32                                                     ` Linus Torvalds
2007-04-26 13:14                                                       ` Alan Cox
2007-04-26 16:02                                                         ` Linus Torvalds
2007-04-26  0:34                                               ` Linus Torvalds
2007-04-26 20:12                                                 ` Rafael J. Wysocki
2007-04-26  0:24                                           ` Alan Cox
2007-04-26  1:10                                             ` Linus Torvalds
2007-04-26 14:04                                               ` Mark Lord
2007-04-26 16:10                                                 ` Linus Torvalds
2007-04-26 21:00                                                   ` Pavel Machek
2007-04-26  7:08                                             ` Andy Grover
2007-04-26  0:41                               ` Thomas Orgis
2007-05-26 17:37                           ` Martin Steigerwald
2007-05-26 20:35                             ` Rafael J. Wysocki
2007-05-26 22:23                               ` Martin Steigerwald
2007-04-26 10:17                       ` Johannes Berg
2007-04-26 10:30                         ` Pavel Machek
2007-04-26 10:40                           ` Pavel Machek
2007-04-26 11:11                             ` Johannes Berg
2007-04-26 11:16                               ` Pavel Machek
2007-04-26 11:27                                 ` Johannes Berg
2007-04-26 11:26                                   ` Pavel Machek
2007-04-26 11:35                                     ` Johannes Berg
2007-04-26 11:33                                       ` Pavel Machek
2007-04-26 16:14                                       ` Chris Friesen
2007-04-26 16:27                                         ` Linus Torvalds
2007-04-26 17:11                                         ` Johannes Berg
2007-04-26 15:56                                     ` Linus Torvalds
2007-04-26 21:06                                       ` Theodore Tso
2007-04-26 21:12                                         ` Nigel Cunningham
2007-04-26 13:45                             ` Johannes Berg
2007-06-29 22:44                               ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek
2007-06-30  0:06                                 ` Adrian Bunk
2007-04-26 11:04                           ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg
2007-04-26 11:09                             ` Pavel Machek
2007-04-26 15:53                               ` Linus Torvalds
2007-04-26 18:21                               ` Olivier Galibert
2007-04-26 21:30                                 ` Pavel Machek
2007-04-26 11:35                         ` Christoph Hellwig
2007-04-26 12:15                           ` Ingo Molnar
2007-04-26 12:41                             ` Pavel Machek
2007-04-18 22:16           ` CFS and suspend2: hang in atomic copy Ingo Molnar
2007-04-18 23:12             ` Christian Hesse
2007-04-19  6:28               ` Ingo Molnar
2007-04-19 20:32                 ` Christian Hesse
2007-04-19  6:41             ` Ingo Molnar
2007-04-19  9:32     ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen
2007-04-19 10:11       ` Ingo Molnar
2007-04-19 10:18         ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.