* [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] @ 2007-04-13 20:21 Ingo Molnar 2007-04-13 20:27 ` Bill Huey ` (14 more replies) 0 siblings, 15 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 20:21 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] i'm pleased to announce the first release of the "Modular Scheduler Core and Completely Fair Scheduler [CFS]" patchset: http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch This project is a complete rewrite of the Linux task scheduler. My goal is to address various feature requests and to fix deficiencies in the vanilla scheduler that were suggested/found in the past few years, both for desktop scheduling and for server scheduling workloads. [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The new scheduler will be active by default and all tasks will default to the new SCHED_FAIR interactive scheduling class. ] Highlights are: - the introduction of Scheduling Classes: an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming about them too much. - sched_fair.c implements the 'CFS desktop scheduler': it is a replacement for the vanilla scheduler's SCHED_OTHER interactivity code. i'd like to give credit to Con Kolivas for the general approach here: he has proven via RSDL/SD that 'fair scheduling' is possible and that it results in better desktop scheduling. Kudos Con! The CFS patch uses a completely different approach and implementation from RSDL/SD. My goal was to make CFS's interactivity quality exceed that of RSDL/SD, which is a high standard to meet :-) Testing feedback is welcome to decide this one way or another. [ and, in any case, all of SD's logic could be added via a kernel/sched_sd.c module as well, if Con is interested in such an approach. ] CFS's design is quite radical: it does not use runqueues, it uses a time-ordered rbtree to build a 'timeline' of future task execution, and thus has no 'array switch' artifacts (by which both the vanilla scheduler and RSDL/SD are affected). CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of 'timeslices' and has no heuristics whatsoever. There is only one central tunable: /proc/sys/kernel/sched_granularity_ns which can be used to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads. It defaults to a setting suitable for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. due to its design, the CFS scheduler is not prone to any of the 'attacks' that exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all work fine and do not impact interactivity and produce the expected behavior. the CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH: both types of workloads should be isolated much more agressively than under the vanilla scheduler. ( another rdetail: due to nanosec accounting and timeline sorting, sched_yield() support is very simple under CFS, and in fact under CFS sched_yield() behaves much better than under any other scheduler i have tested so far. ) - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than the vanilla scheduler does. It uses 100 runqueues (for all 100 RT priority levels, instead of 140 in the vanilla scheduler) and it needs no expired array. - reworked/sanitized SMP load-balancing: the runqueue-walking assumptions are gone from the load-balancing code now, and iterators of the scheduling modules are used. The balancing code got quite a bit simpler as a result. the core scheduler got smaller by more than 700 lines: kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------ 1 file changed, 372 insertions(+), 1082 deletions(-) and even adding all the scheduling modules, the total size impact is relatively small: 18 files changed, 1454 insertions(+), 1133 deletions(-) most of the increase is due to extensive comments. The kernel size impact is in fact a small negative: text data bss dec hex filename 23366 4001 24 27391 6aff kernel/sched.o.vanilla 24159 2705 56 26920 6928 kernel/sched.o.CFS (this is mainly due to the benefit of getting rid of the expired array and its data structure overhead.) thanks go to Thomas Gleixner and Arjan van de Ven for review of this patchset. as usual, any sort of feedback, bugreports, fixes and suggestions are more than welcome, Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar @ 2007-04-13 20:27 ` Bill Huey 2007-04-13 20:55 ` Ingo Molnar 2007-04-13 21:50 ` Ingo Molnar ` (13 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Bill Huey @ 2007-04-13 20:27 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] ... > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] Ingo, Con has been asking for module support for years if I understand your patch corectly. You'll also need this for -rt as well with regards to bandwidth scheduling. Good to see that you're moving in this direction. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:27 ` Bill Huey @ 2007-04-13 20:55 ` Ingo Molnar 2007-04-13 21:21 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 20:55 UTC (permalink / raw) To: Bill Huey Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > Con has been asking for module support for years if I understand your > patch corectly. [...] Yeah. Note that there are some subtle but crutial differences between PlugSched (which Con used, and which i opposed in the past) and this approach. PlugSched cuts the interfaces at a high level in a monolithic way and introduces kernel/scheduler.c that uses one pluggable scheduler (represented via the 'scheduler' global template) at a time. while in this CFS patchset i'm using modularization ('scheduler classes') to simplify the _existing_ multi-policy implementation of the scheduler. These 'scheduler classes' are in a hierarchy and are stacked on top of each other. They are in use at once. Currently there's two of them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the performance impact is minimal. So scheduler classes are mainly a simplification of the design of the scheduler - not just a mere facility to select multiple schedulers. Their ability to also facilitate easier experimentation with schedulers is 'just' a happy side-effect. So all in one: it's a fairly different model from PlugSched (and that's why i didnt reuse PlugSched) - but there's indeed overlap. > [...] You'll also need this for -rt as well with regards to bandwidth > scheduling. yeah. scheduler classes are also useful for other purposes like containers and virtualization, hierarchical/group scheduling, security encapsulation, etc. - features that can be on-demand layered, and which we dont necessarily want to have enabled all the time. > [...] Good to see that you're moving in this direction. thanks! :) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:55 ` Ingo Molnar @ 2007-04-13 21:21 ` William Lee Irwin III 2007-04-13 21:35 ` Bill Huey 2007-04-13 21:39 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-13 21:21 UTC (permalink / raw) To: Ingo Molnar Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote: > Yeah. Note that there are some subtle but crutial differences between > PlugSched (which Con used, and which i opposed in the past) and this > approach. > PlugSched cuts the interfaces at a high level in a monolithic way and > introduces kernel/scheduler.c that uses one pluggable scheduler > (represented via the 'scheduler' global template) at a time. What I originally did did so for a good reason, which was that it was intended to support far more radical reorganizations, for instance, things that changed the per-cpu runqueue affairs for gang scheduling. I wrote a top-level driver that did support scheduling classes in a similar fashion, though it didn't survive others maintaining the patches. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 21:21 ` William Lee Irwin III @ 2007-04-13 21:35 ` Bill Huey 2007-04-13 21:39 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Bill Huey @ 2007-04-13 21:35 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote: > On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote: > > Yeah. Note that there are some subtle but crutial differences between > > PlugSched (which Con used, and which i opposed in the past) and this > > approach. > > PlugSched cuts the interfaces at a high level in a monolithic way and > > introduces kernel/scheduler.c that uses one pluggable scheduler > > (represented via the 'scheduler' global template) at a time. > > What I originally did did so for a good reason, which was that it was > intended to support far more radical reorganizations, for instance, > things that changed the per-cpu runqueue affairs for gang scheduling. > I wrote a top-level driver that did support scheduling classes in a > similar fashion, though it didn't survive others maintaining the patches. Also, gang scheduling is needed to solve virtualization issues regarding spinlocks in a guest image. You could potentally be spinning on a thread that isn't currently running which, needless to say, is very bad. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 21:21 ` William Lee Irwin III 2007-04-13 21:35 ` Bill Huey @ 2007-04-13 21:39 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 21:39 UTC (permalink / raw) To: William Lee Irwin III Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > What I originally did did so for a good reason, which was that it was > intended to support far more radical reorganizations, for instance, > things that changed the per-cpu runqueue affairs for gang scheduling. > I wrote a top-level driver that did support scheduling classes in a > similar fashion, though it didn't survive others maintaining the > patches. yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar 2007-04-13 20:27 ` Bill Huey @ 2007-04-13 21:50 ` Ingo Molnar 2007-04-13 21:57 ` Michal Piotrowski ` (12 subsequent siblings) 14 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 21:50 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > and even adding all the scheduling modules, the total size impact is > relatively small: > > 18 files changed, 1454 insertions(+), 1133 deletions(-) > > most of the increase is due to extensive comments. The kernel size > impact is in fact a small negative: > > text data bss dec hex filename > 23366 4001 24 27391 6aff kernel/sched.o.vanilla > 24159 2705 56 26920 6928 kernel/sched.o.CFS update: these were older numbers, here are the stats redone with the latest patch: text data bss dec hex filename 23366 4001 24 27391 6aff kernel/sched.o.vanilla 23671 4548 24 28243 6e53 kernel/sched.o.sd.v40 23349 2705 24 26078 65de kernel/sched.o.cfs so CFS is now a win both for text and for data size :) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar 2007-04-13 20:27 ` Bill Huey 2007-04-13 21:50 ` Ingo Molnar @ 2007-04-13 21:57 ` Michal Piotrowski 2007-04-13 22:15 ` Daniel Walker ` (11 subsequent siblings) 14 siblings, 0 replies; 713+ messages in thread From: Michal Piotrowski @ 2007-04-13 21:57 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar napisał(a): > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > Friday the 13th, my lucky day :). /mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'. Stop. make[2]: *** [linux] Error 2 make[1]: *** [headers_check] Error 2 make: *** [vmlinux] Error 2 Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> --- linux-msc-cfs-clean/include/linux/Kbuild 2007-04-13 23:52:47.000000000 +0200 +++ linux-msc-cfs/include/linux/Kbuild 2007-04-13 23:49:41.000000000 +0200 @@ -133,6 +133,7 @@ header-y += quotaio_v1.h header-y += quotaio_v2.h header-y += radeonfb.h header-y += raw.h +header-y += rbtree.h header-y += resource.h header-y += rose.h header-y += smbno.h ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (2 preceding siblings ...) 2007-04-13 21:57 ` Michal Piotrowski @ 2007-04-13 22:15 ` Daniel Walker 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:21 ` William Lee Irwin III ` (10 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Daniel Walker @ 2007-04-13 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. I'm not in love with the current or other schedulers, so I'm indifferent to this change. However, I was reviewing your release notes and the patch and found myself wonder what the logarithmic complexity of this new scheduler is .. I assumed it would also be constant time , but the __enqueue_task_fair doesn't appear to be constant time (rbtree insert complexity).. Maybe that's not a critical path , but I thought I would at least comment on it. Daniel ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:15 ` Daniel Walker @ 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:37 ` Willy Tarreau 2007-04-13 23:59 ` Daniel Walker 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 22:30 UTC (permalink / raw) To: Daniel Walker Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Daniel Walker <dwalker@mvista.com> wrote: > I'm not in love with the current or other schedulers, so I'm > indifferent to this change. However, I was reviewing your release > notes and the patch and found myself wonder what the logarithmic > complexity of this new scheduler is .. I assumed it would also be > constant time , but the __enqueue_task_fair doesn't appear to be > constant time (rbtree insert complexity).. [...] i've been worried about that myself and i've done extensive measurements before choosing this implementation. The rbtree turned out to be a quite compact data structure: we get it quite cheaply as part of the task structure cachemisses - which have to be touched anyway. For 1000 tasks it's a loop of ~10 - that's still very fast and bound in practice. here's a test i did under CFS. Lets take some ridiculous load: 1000 infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted into the same rbtree), and lets run lat_ctx: neptune:~/l> uptime 22:51:23 up 8 min, 2 users, load average: 713.06, 254.64, 91.51 neptune:~/l> ./lat_ctx -s 0 2 "size=0k ovr=1.61 2 1.41 lets stop the 1000 tasks and only have ~2 tasks in the runqueue: neptune:~/l> ./lat_ctx -s 0 2 "size=0k ovr=1.70 2 1.16 so the overhead is 0.25 usecs. Considering the load (1000 tasks trash the cache like crazy already), this is more than acceptable. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:30 ` Ingo Molnar @ 2007-04-13 22:37 ` Willy Tarreau 2007-04-13 23:59 ` Daniel Walker 1 sibling, 0 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-13 22:37 UTC (permalink / raw) To: Ingo Molnar Cc: Daniel Walker, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote: > > * Daniel Walker <dwalker@mvista.com> wrote: > > > I'm not in love with the current or other schedulers, so I'm > > indifferent to this change. However, I was reviewing your release > > notes and the patch and found myself wonder what the logarithmic > > complexity of this new scheduler is .. I assumed it would also be > > constant time , but the __enqueue_task_fair doesn't appear to be > > constant time (rbtree insert complexity).. [...] > > i've been worried about that myself and i've done extensive measurements > before choosing this implementation. The rbtree turned out to be a quite > compact data structure: we get it quite cheaply as part of the task > structure cachemisses - which have to be touched anyway. For 1000 tasks > it's a loop of ~10 - that's still very fast and bound in practice. I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n) than dumb O(1). In a userland TCP stack I started to write 2 years ago, I used a comparable scheduler and could reach a sustained rate of 145000 connections/s at 4 millions of concurrent connections. And yes, each time a packet was sent or received, a task was queued/dequeued (so about 450k/s with 4 million tasks, on an athlon 1.5 GHz). So that seems much higher than what we currently need. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:37 ` Willy Tarreau @ 2007-04-13 23:59 ` Daniel Walker 2007-04-14 10:55 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Daniel Walker @ 2007-04-13 23:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner One other thing, what happens in the case of slow, frequency changing, are/or inaccurate clocks .. Is the old sched_clock behavior still tolerated? Daniel ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:59 ` Daniel Walker @ 2007-04-14 10:55 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 10:55 UTC (permalink / raw) To: Daniel Walker Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Daniel Walker <dwalker@mvista.com> wrote: > One other thing, what happens in the case of slow, frequency changing, > are/or inaccurate clocks .. Is the old sched_clock behavior still > tolerated? yeah, good question. Yesterday i did a quick testboot with that too, and it seemed to behave pretty OK with the low-res [jiffies based] sched_clock() too. Although in that case things are much more of an approximation and rounding/arithmetics artifacts are possible. CFS works best with a high-resolution cycle counter. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (3 preceding siblings ...) 2007-04-13 22:15 ` Daniel Walker @ 2007-04-13 22:21 ` William Lee Irwin III 2007-04-13 22:52 ` Ingo Molnar 2007-04-14 22:38 ` Davide Libenzi 2007-04-13 22:31 ` Willy Tarreau ` (9 subsequent siblings) 14 siblings, 2 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-13 22:21 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] A pleasant surprise, though I did see it coming. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > Highlights are: > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. It probably needs further clarification that they're things on the order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst the classes is furthermore assumed, and so on. They're not quite capable of being full-blown alternative policies, though quite a bit can be crammed into them. There are issues with the per- scheduling class data not being very well-abstracted. A union for per-class data might help, if not a dynamically allocated scheduling class -private structure. Getting an alternative policy floating around that actually clashes a little with the stock data in the task structure would help clarify what's needed. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! Bob Mullens banged out a virtual deadline interactive task scheduler for Multics back in 1976 or thereabouts. ISTR the name Ferranti in connection with deadline task scheduling for UNIX in particular. I've largely seen deadline schedulers as a realtime topic, though. In any event, it's not so radical as to lack a fair number of precedents. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). A binomial heap would likely serve your purposes better than rbtrees. It's faster to have the next item to dequeue at the root of the tree structure rather than a leaf, for one. There are, of course, other priority queue structures (e.g. van Emde Boas) able to exploit the limited precision of the priority key for faster asymptotics, though actual performance is an open question. Another advantage of heaps is that they support decreasing priorities directly, so that instead of removal and reinsertion, a less invasive movement within the tree is possible. This nets additional constant factor improvements beyond those for the next item to dequeue for the case where a task remains runnable, but is preempted and its priority decreased while it remains runnable. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > /proc/sys/kernel/sched_granularity_ns > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. I like not relying on timeslices. Timeslices ultimately get you into a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque artifacts therefore. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. I'm always suspicious of these claims. A moderately formal regression test suite needs to be assembled and the testcases rather seriously cleaned up so they e.g. run for a deterministic period of time, have their parameters passable via command-line options instead of editing and recompiling, don't need Lindenting to be legible, and so on. With that in hand, a battery of regression tests can be run against scheduler modifications to verify their correctness and to detect any disturbance in scheduling semantics they might cause. A very serious concern is that while a fresh scheduler may pass all these tests, later modifications may later cause failures unnoticed because no one's doing the regression tests and there's no obvious test suite for testing types to latch onto. Another is that the testcases themselves may bitrot if they're not maintainable code. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. Speaking of regression tests, let's please at least state intended nice semantics and get a regression test for CPU bandwidth distribution by nice levels going. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) And there's another one. sched_yield() semantics need a regression test more transparent than VolanoMark or other macrobenchmarks. At some point we really need to decide what our sched_yield() is intended to do and get something out there to detect whether it's behaving as intended. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. The SMP load balancing class operations strike me as unusual and likely to trip over semantic issues in alternative scheduling classes. Getting some alternative scheduling classes out there to clarify the issues would help here, too. A more general question here is what you mean by "completely fair;" there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or inter-user fairness going on, though one might argue those are relatively obscure notions of fairness. Complete fairness arguably precludes static prioritization by nice levels, so there is also that. There is also the issue of what a fair CPU bandwidth distribution between tasks of varying desired in-isolation CPU utilization might be. I suppose my thorniest point is where the demonstration of fairness is as, say, a testcase. Perhaps it's fair now; when will we find out when that fairness has been disturbed? What these things mean when there are multiple CPU's to schedule across may also be of concern. I propose the following two testcases: (1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels Create a number of tasks at varying nice levels. Measure the CPU bandwidth allocated to each. Success depends on intent: we decide up-front that a given nice level should correspond to a given share of CPU bandwidth. Check to see how far from the intended distribution of CPU bandwidth according to those decided-up-front shares the actual distribution of CPU bandwidth is for the test. (2) CPU bandwidth distribution of tasks with varying CPU demands Create a number of tasks that would in isolation consume varying %cpu. Measure the CPU bandwidth allocated to each. Success depends on intent here, too. Decide up-front that a given %cpu that would be consumed in isolation should correspond to a given share of CPU bandwidth and check the actual distribution of CPU bandwidth vs. what was intended. Note that the shares need not linearly correspond to the %cpu; various sorts of things related to interactivity will make this nonlinear. A third testcase for sched_yield() should be brewed up. These testcases are oblivious to SMP. This will demand that a scheduling policy integrate with load balancing to the extent that load balancing occurs for the sake of distributing CPU bandwidth according to nice level. Some explicit decision should be made regarding that. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:21 ` William Lee Irwin III @ 2007-04-13 22:52 ` Ingo Molnar 2007-04-13 23:30 ` William Lee Irwin III 2007-04-14 22:38 ` Davide Libenzi 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 22:52 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > > is to address various feature requests and to fix deficiencies in the > > vanilla scheduler that were suggested/found in the past few years, both > > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > > new scheduler will be active by default and all tasks will default > > to the new SCHED_FAIR interactive scheduling class. ] > > A pleasant surprise, though I did see it coming. hey ;) > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > > scheduler modules. These modules encapsulate scheduling policy > > details and are handled by the scheduler core without the core > > code assuming about them too much. > > It probably needs further clarification that they're things on the > order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization > amongst the classes is furthermore assumed, and so on. [...] yep - they are linked via sched_ops->next pointer, with NULL delimiting the last one. > [...] They're not quite capable of being full-blown alternative > policies, though quite a bit can be crammed into them. yeah, they are not full-blown: i extended them on-demand, for the specific purposes of sched_fair.c and sched_rt.c. More can be done too. > There are issues with the per- scheduling class data not being very > well-abstracted. [...] yes. It's on my TODO list: i'll work more on extending the cleanups to those fields too. > A binomial heap would likely serve your purposes better than rbtrees. > It's faster to have the next item to dequeue at the root of the tree > structure rather than a leaf, for one. There are, of course, other > priority queue structures (e.g. van Emde Boas) able to exploit the > limited precision of the priority key for faster asymptotics, though > actual performance is an open question. i'm caching the leftmost leaf, which serves as an alternate, task-pick centric root in essence. > Another advantage of heaps is that they support decreasing priorities > directly, so that instead of removal and reinsertion, a less invasive > movement within the tree is possible. This nets additional constant > factor improvements beyond those for the next item to dequeue for the > case where a task remains runnable, but is preempted and its priority > decreased while it remains runnable. yeah. (Note that in CFS i'm not decreasing priorities anywhere though - all the priority levels in CFS stay constant, fairness is not achieved via rotating priorities or similar, it is achieved via the accounting code.) > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > due to its design, the CFS scheduler is not prone to any of the > > 'attacks' that exist today against the heuristics of the stock > > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > > work fine and do not impact interactivity and produce the expected > > behavior. > > I'm always suspicious of these claims. [...] hey, sure - but please give it a go nevertheless, i _did_ test all these ;) > A moderately formal regression test suite needs to be assembled [...] by all means feel free! ;) > A more general question here is what you mean by "completely fair;" by that i mean the most common-sense definition: with N tasks running each gets 1/N CPU time if observed for a reasonable amount of time. Now extend this to arbitrary scheduling patterns, the end result should still be completely fair, according to the fundamental 1/N(time) rule individually applied to all the small scheduling patterns that the scheduling patterns give. (this assumes that the scheduling patterns are reasonably independent of each other - if they are not then there's no reasonable definition of fairness that makes sense, and we might as well use the 1/N rule for those cases too.) > there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or > inter-user fairness going on, though one might argue those are > relatively obscure notions of fairness. [...] sure, i mainly concentrated on what we have in Linux today. The things you mention are add-ons that i can see handling via new scheduling classes: all the CKRM and containers type of CPU time management facilities. > What these things mean when there are multiple CPU's to schedule > across may also be of concern. that is handled by the existing smp-nice load balancer, that logic is preserved under CFS. > These testcases are oblivious to SMP. This will demand that a > scheduling policy integrate with load balancing to the extent that > load balancing occurs for the sake of distributing CPU bandwidth > according to nice level. Some explicit decision should be made > regarding that. this should already work reasonably fine with CFS: try massive_intr.c on an SMP box. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:52 ` Ingo Molnar @ 2007-04-13 23:30 ` William Lee Irwin III 2007-04-13 23:44 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-13 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> A binomial heap would likely serve your purposes better than rbtrees. >> It's faster to have the next item to dequeue at the root of the tree >> structure rather than a leaf, for one. There are, of course, other >> priority queue structures (e.g. van Emde Boas) able to exploit the >> limited precision of the priority key for faster asymptotics, though >> actual performance is an open question. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > i'm caching the leftmost leaf, which serves as an alternate, task-pick > centric root in essence. I noticed that, yes. It seemed a better idea to me to use a data structure that has what's needed built-in, but I suppose it's not gospel. * William Lee Irwin III <wli@holomorphy.com> wrote: >> Another advantage of heaps is that they support decreasing priorities >> directly, so that instead of removal and reinsertion, a less invasive >> movement within the tree is possible. This nets additional constant >> factor improvements beyond those for the next item to dequeue for the >> case where a task remains runnable, but is preempted and its priority >> decreased while it remains runnable. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > yeah. (Note that in CFS i'm not decreasing priorities anywhere though - > all the priority levels in CFS stay constant, fairness is not achieved > via rotating priorities or similar, it is achieved via the accounting > code.) Sorry, "priority" here would be from the POV of the queue data structure. From the POV of the scheduler it would be resetting the deadline or whatever the nomenclature cooked up for things is, most obviously in requeue_task_fair() and task_tick_fair(). * William Lee Irwin III <wli@holomorphy.com> wrote: >> I'm always suspicious of these claims. [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > hey, sure - but please give it a go nevertheless, i _did_ test all these > ;) The suspicion essentially centers around how long the state of affairs will hold up because comprehensive re-testing is not noticeably done upon updates to scheduling code or kernel point releases. * William Lee Irwin III <wli@holomorphy.com> wrote: >> A moderately formal regression test suite needs to be assembled [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > by all means feel free! ;) I can only do so much, but I have done work to clean up other testcases going around. I'm mostly looking at testcases as I go over them or develop some interest in the subject and rewriting those that already exist or hammering out new ones as I need them. The main contribution toward this is that I've sort of made a mental note to stash the results of the effort somewhere and pass them along to those who do regular testing on kernels or otherwise import test suites into their collections. * William Lee Irwin III <wli@holomorphy.com> wrote: >> A more general question here is what you mean by "completely fair;" On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > by that i mean the most common-sense definition: with N tasks running > each gets 1/N CPU time if observed for a reasonable amount of time. Now > extend this to arbitrary scheduling patterns, the end result should > still be completely fair, according to the fundamental 1/N(time) rule > individually applied to all the small scheduling patterns that the > scheduling patterns give. (this assumes that the scheduling patterns are > reasonably independent of each other - if they are not then there's no > reasonable definition of fairness that makes sense, and we might as well > use the 1/N rule for those cases too.) I'd start with identically-behaving CPU-bound tasks here. It's easy enough to hammer out a testcase that starts up N CPU-bound tasks, runs them for a few minutes, stops them, collects statistics on their runtime, and gives us an idea of whether 1/N came out properly. I'll get around to that at some point. Where it gets complex is when the behavior patterns vary, e.g. they're not entirely CPU-bound and their desired in-isolation CPU utilization varies, or when nice levels vary, or both vary. I went on about testcases for those in particular in the prior post, though not both at once. The nice level one in particular needs an up-front goal for distribution of CPU bandwidth in a mixture of competing tasks with varying nice levels. There are different ways to define fairness, but a uniform distribution of CPU bandwidth across a set of identical competing tasks is a good, testable definition. * William Lee Irwin III <wli@holomorphy.com> wrote: >> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or >> inter-user fairness going on, though one might argue those are >> relatively obscure notions of fairness. [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > sure, i mainly concentrated on what we have in Linux today. The things > you mention are add-ons that i can see handling via new scheduling > classes: all the CKRM and containers type of CPU time management > facilities. At some point the CKRM and container people should be pinged to see what (if anything) they need to achieve these sorts of things. It's not clear to me that the specific cases I cited are considered relevant to anyone. I presume that if they are, someone will pipe up with a feature request. It was more a sort of catalogue of different notions of fairness that could arise than any sort of suggestion. * William Lee Irwin III <wli@holomorphy.com> wrote: >> What these things mean when there are multiple CPU's to schedule >> across may also be of concern. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > that is handled by the existing smp-nice load balancer, that logic is > preserved under CFS. Given the things going wrong, I'm curious as to whether that works, and if so, how well. I'll drop that into my list of testcases that should be arranged for, though I won't guarantee that I'll get to it myself in any sort of timely fashion. What this ultimately needs is specifying the semantics of nice levels so that we can say that a mixture of competing tasks with varying nice levels should have an ideal distribution of CPU bandwidth to check for. * William Lee Irwin III <wli@holomorphy.com> wrote: >> These testcases are oblivious to SMP. This will demand that a >> scheduling policy integrate with load balancing to the extent that >> load balancing occurs for the sake of distributing CPU bandwidth >> according to nice level. Some explicit decision should be made >> regarding that. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > this should already work reasonably fine with CFS: try massive_intr.c on > an SMP box. Where is massive_intr.c, BTW? -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:30 ` William Lee Irwin III @ 2007-04-13 23:44 ` Ingo Molnar 2007-04-13 23:58 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 23:44 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1358 bytes --] * William Lee Irwin III <wli@holomorphy.com> wrote: > Where it gets complex is when the behavior patterns vary, e.g. they're > not entirely CPU-bound and their desired in-isolation CPU utilization > varies, or when nice levels vary, or both vary. [...] yes. I tested things like 'massive_intr.c' (attached, written by Satoru Takeuchi) which starts N tasks which each work for 8msec then sleep 1msec: from its output, the second column is the CPU time each thread got, the more even, the fairer the scheduling. On vanilla i get: mercury:~> ./massive_intr 10 10 024873 00000150 024874 00000123 024870 00000069 024868 00000068 024866 00000051 024875 00000206 024872 00000093 024869 00000138 024867 00000078 024871 00000223 on CFS i get: neptune:~> ./massive_intr 10 10 002266 00000112 002260 00000113 002261 00000112 002267 00000112 002269 00000112 002265 00000112 002262 00000113 002268 00000113 002264 00000112 002263 00000113 so it is quite a bit more even ;) another related test-utility is one i wrote: http://people.redhat.com/mingo/scheduler-patches/ring-test.c this is a ring of 100 tasks each doing work for 100 msecs and then sleeping for 1 msec. I usually test this by also running a CPU hog in parallel to it, and checking whether it gets ~50.0% of CPU time under CFS. (it does) Ingo [-- Attachment #2: massive_intr.c --] [-- Type: text/plain, Size: 9833 bytes --] #if 0 Hi Ingo and all, When I was executing massive interactive processes, I found that some of them occupy CPU time and the others hardly run. It seems that some of processes which occupy CPU time always has max effective prio (default+5) and the others have max - 1. What happen here is... 1. If there are moderate number of max interactive processes, they can be re-inserted into active queue without falling down its priority again and again. 2. In this case, the others seldom run, and can't get max effective priority at next exhausting because scheduler considers them to sleep too long. 3. Goto 1, OOPS! Unfortunately I haven't been able to make the patch resolving this problem yet. Any idea? I also attach the test program which easily recreates this problem. Test program flow: 1. First process starts child proesses and wait for 5 minutes. 2. Each child process executes "work 8 msec and sleep 1 msec" loop continuously. 3. After 3 minits have passed, each child processes prints the # of loops which executed. What expected: Each child processes execute nearly equal # of loops. Test environment: - kernel: 2.6.20(*1) - # of CPUs: 1 or 2 - # of child processes: 200 or 400 - nice value: 0 or 20(*2) *1) I confirmed that 2.6.21-rc5 has no change regarding this problem. *2) If a process have nice 20, scheduler never regards it as interactive. Test results: -----------+----------------+------+------------------------------------ # of CPUs | # of processes | nice | result -----------+----------------+------+------------------------------------ | | 20 | looks good 1(i386) | +------+------------------------------------ | | 0 | 4 processes occupy 98% of CPU time -----------+ 200 +------+------------------------------------ | | 20 | looks good | +------+------------------------------------ | | 0 | 8 processes occupy 72% of CPU time 2(ia64) +----------------+------+------------------------------------ | 400 | 20 | looks good | +------+------------------------------------ | | 0 | 8 processes occupy 98% of CPU time -----------+----------------+------+------------------------------------ FYI. 2.6.21-rc3-mm1 (enabling RSDL scheduler) works fine in the all casees :-) Thanks, Satoru ------------------------------------------------------------------------------- #endif /* * massive_intr - run @nproc interactive processes and print the number of * loops(*1) each process executes in @runtime secs. * * *1) "work 8 msec and sleep 1msec" loop * * Usage: massive_intr <nproc> <runtime> * * @nproc: number of processes * @runtime: execute time[sec] * * ex) If you want to run 300 processes for 5 mins, issue the * command as follows: * * $ massive_intr 300 300 * * How to build: * * cc -o massive_intr massive_intr.c -lrt * * * Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or (at * your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License along * with this program; if not, write to the Free Software Foundation, Inc., * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ #include <sys/time.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/wait.h> #include <fcntl.h> #include <unistd.h> #include <semaphore.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <string.h> #include <errno.h> #include <err.h> #define WORK_MSECS 8 #define SLEEP_MSECS 1 #define MAX_PROC 1024 #define SAMPLE_COUNT 1000000000 #define USECS_PER_SEC 1000000 #define USECS_PER_MSEC 1000 #define NSECS_PER_MSEC 1000000 #define SHMEMSIZE 4096 static const char *shmname = "/sched_interactive_shmem"; static void *shmem; static sem_t *printsem; static int nproc; static int runtime; static int fd; static time_t *first; static pid_t pid[MAX_PROC]; static int return_code; static void cleanup_resources(void) { if (sem_destroy(printsem) < 0) warn("sem_destroy() failed"); if (munmap(shmem, SHMEMSIZE) < 0) warn("munmap() failed"); if (close(fd) < 0) warn("close() failed"); } static void abnormal_exit(void) { if (kill(getppid(), SIGUSR2) < 0) err(EXIT_FAILURE, "kill() failed"); } static void sighandler(int signo) { } static void sighandler2(int signo) { return_code = EXIT_FAILURE; } static void loopfnc(int nloop) { int i; for (i = 0; i < nloop; i++) ; } static int loop_per_msec(void) { struct timeval tv[2]; int before, after; if (gettimeofday(&tv[0], NULL) < 0) return -1; loopfnc(SAMPLE_COUNT); if (gettimeofday(&tv[1], NULL) < 0) return -1; before = tv[0].tv_sec*USECS_PER_SEC+tv[0].tv_usec; after = tv[1].tv_sec*USECS_PER_SEC+tv[1].tv_usec; return SAMPLE_COUNT/(after - before)*USECS_PER_MSEC; } static void *test_job(void *arg) { int l = (int)arg; int count = 0; time_t current; sigset_t sigset; struct sigaction sa; struct timespec ts = { 0, NSECS_PER_MSEC*SLEEP_MSECS}; sa.sa_handler = sighandler; if (sigemptyset(&sa.sa_mask) < 0) { warn("sigemptyset() failed"); abnormal_exit(); } sa.sa_flags = 0; if (sigaction(SIGUSR1, &sa, NULL) < 0) { warn("sigaction() failed"); abnormal_exit(); } if (sigemptyset(&sigset) < 0) { warn("sigfillset() failed"); abnormal_exit(); } sigsuspend(&sigset); if (errno != EINTR) { warn("sigsuspend() failed"); abnormal_exit(); } /* main loop */ do { loopfnc(WORK_MSECS*l); if (nanosleep(&ts, NULL) < 0) { warn("nanosleep() failed"); abnormal_exit(); } count++; if (time(¤t) == -1) { warn("time() failed"); abnormal_exit(); } } while (difftime(current, *first) < runtime); if (sem_wait(printsem) < 0) { warn("sem_wait() failed"); abnormal_exit(); } printf("%06d\t%08d\n", getpid(), count); if (sem_post(printsem) < 0) { warn("sem_post() failed"); abnormal_exit(); } exit(EXIT_SUCCESS); } static void usage(void) { fprintf(stderr, "Usage : massive_intr <nproc> <runtime>\n" "\t\tnproc : number of processes\n" "\t\truntime : execute time[sec]\n"); exit(EXIT_FAILURE); } int main(int argc, char **argv) { int i, j; int status; sigset_t sigset; struct sigaction sa; int c; if (argc != 3) usage(); nproc = strtol(argv[1], NULL, 10); if (errno || nproc < 1 || nproc > MAX_PROC) err(EXIT_FAILURE, "invalid multinum"); runtime = strtol(argv[2], NULL, 10); if (errno || runtime <= 0) err(EXIT_FAILURE, "invalid runtime"); sa.sa_handler = sighandler2; if (sigemptyset(&sa.sa_mask) < 0) err(EXIT_FAILURE, "sigemptyset() failed"); sa.sa_flags = 0; if (sigaction(SIGUSR2, &sa, NULL) < 0) err(EXIT_FAILURE, "sigaction() failed"); if (sigemptyset(&sigset) < 0) err(EXIT_FAILURE, "sigemptyset() failed"); if (sigaddset(&sigset, SIGUSR1) < 0) err(EXIT_FAILURE, "sigaddset() failed"); if (sigaddset(&sigset, SIGUSR2) < 0) err(EXIT_FAILURE, "sigaddset() failed"); if (sigprocmask(SIG_BLOCK, &sigset, NULL) < 0) err(EXIT_FAILURE, "sigprocmask() failed"); /* setup shared memory */ if ((fd = shm_open(shmname, O_CREAT | O_RDWR, 0644)) < 0) err(EXIT_FAILURE, "shm_open() failed"); if (shm_unlink(shmname) < 0) { warn("shm_unlink() failed"); goto err_close; } if (ftruncate(fd, SHMEMSIZE) < 0) { warn("ftruncate() failed"); goto err_close; } shmem = mmap(NULL, SHMEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (shmem == (void *)-1) { warn("mmap() failed"); goto err_unmap; } printsem = shmem; first = shmem + sizeof(*printsem); /* initialize semaphore */ if ((sem_init(printsem, 1, 1)) < 0) { warn("sem_init() failed"); goto err_unmap; } if ((c = loop_per_msec()) < 0) { fprintf(stderr, "loop_per_msec() failed\n"); goto err_sem; } for (i = 0; i < nproc; i++) { pid[i] = fork(); if (pid[i] == -1) { warn("fork() failed\n"); for (j = 0; j < i; j++) if (kill(pid[j], SIGKILL) < 0) warn("kill() failed"); goto err_sem; } if (pid[i] == 0) test_job((void *)c); } if (sigemptyset(&sigset) < 0) { warn("sigemptyset() failed"); goto err_proc; } if (sigaddset(&sigset, SIGUSR2) < 0) { warn("sigaddset() failed"); goto err_proc; } if (sigprocmask(SIG_UNBLOCK, &sigset, NULL) < 0) { warn("sigprocmask() failed"); goto err_proc; } if (time(first) < 0) { warn("time() failed"); goto err_proc; } if ((kill(0, SIGUSR1)) == -1) { warn("kill() failed"); goto err_proc; } for (i = 0; i < nproc; i++) { if (wait(&status) < 0) { warn("wait() failed"); goto err_proc; } } cleanup_resources(); exit(return_code); err_proc: for (i = 0; i < nproc; i++) if (kill(pid[i], SIGKILL) < 0) if (errno != ESRCH) warn("kill() failed"); err_sem: if (sem_destroy(printsem) < 0) warn("sem_destroy() failed"); err_unmap: if (munmap(shmem, SHMEMSIZE) < 0) warn("munmap() failed"); err_close: if (close(fd) < 0) warn("close() failed"); exit(EXIT_FAILURE); } ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:44 ` Ingo Molnar @ 2007-04-13 23:58 ` William Lee Irwin III 0 siblings, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-13 23:58 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> Where it gets complex is when the behavior patterns vary, e.g. they're >> not entirely CPU-bound and their desired in-isolation CPU utilization >> varies, or when nice levels vary, or both vary. [...] On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote: > yes. I tested things like 'massive_intr.c' (attached, written by Satoru > Takeuchi) which starts N tasks which each work for 8msec then sleep > 1msec: [...] > another related test-utility is one i wrote: > http://people.redhat.com/mingo/scheduler-patches/ring-test.c > this is a ring of 100 tasks each doing work for 100 msecs and then > sleeping for 1 msec. I usually test this by also running a CPU hog in > parallel to it, and checking whether it gets ~50.0% of CPU time under > CFS. (it does) These are both tremendously useful. The code is also in rather good shape so only minimal modifications (for massive_intr.c I'm not even sure if any are needed at all) are needed to plug them into the test harness I'm aware of. I'll queue them both for me to adjust and send over to testers I don't want to burden with hacking on testcases I myself am asking them to add to their suites. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:21 ` William Lee Irwin III 2007-04-13 22:52 ` Ingo Molnar @ 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi ` (2 more replies) 1 sibling, 3 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-14 22:38 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, William Lee Irwin III wrote: > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > The CFS patch uses a completely different approach and implementation > > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > > that of RSDL/SD, which is a high standard to meet :-) Testing > > feedback is welcome to decide this one way or another. [ and, in any > > case, all of SD's logic could be added via a kernel/sched_sd.c module > > as well, if Con is interested in such an approach. ] > > CFS's design is quite radical: it does not use runqueues, it uses a > > time-ordered rbtree to build a 'timeline' of future task execution, > > and thus has no 'array switch' artifacts (by which both the vanilla > > scheduler and RSDL/SD are affected). > > A binomial heap would likely serve your purposes better than rbtrees. > It's faster to have the next item to dequeue at the root of the tree > structure rather than a leaf, for one. There are, of course, other > priority queue structures (e.g. van Emde Boas) able to exploit the > limited precision of the priority key for faster asymptotics, though > actual performance is an open question. Haven't looked at the scheduler code yet, but for a similar problem I use a time ring. The ring has Ns (2 power is better) slots (where tasks are queued - in my case they were som sort of timers), and it has a current base index (Ib), a current base time (Tb) and a time granularity (Tg). It also has a bitmap with bits telling you which slots contains queued tasks. An item (task) that has to be scheduled at time T, will be queued in the slot: S = Ib + min((T - Tb) / Tg, Ns - 1); Items with T longer than Ns*Tg will be scheduled in the relative last slot (chosing a proper Ns and Tg can minimize this). Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to suite to your needs. This is a simple bench between time-ring (TR) and CFS queueing: http://www.xmailserver.org/smart-queue.c In my box (Dual Opteron 252): davide@alien:~$ ./smart-queue -n 8 CFS = 142.21 cycles/loop TR = 72.33 cycles/loop davide@alien:~$ ./smart-queue -n 16 CFS = 188.74 cycles/loop TR = 83.79 cycles/loop davide@alien:~$ ./smart-queue -n 32 CFS = 221.36 cycles/loop TR = 75.93 cycles/loop davide@alien:~$ ./smart-queue -n 64 CFS = 242.89 cycles/loop TR = 81.29 cycles/loop - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi @ 2007-04-14 23:26 ` Davide Libenzi 2007-04-15 4:01 ` William Lee Irwin III 2007-04-15 23:09 ` Pavel Pisa 2 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-14 23:26 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, 14 Apr 2007, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the slot: > > S = Ib + min((T - Tb) / Tg, Ns - 1); ... mod Ns, of course ;) - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi @ 2007-04-15 4:01 ` William Lee Irwin III 2007-04-15 4:18 ` Davide Libenzi 2007-04-15 23:09 ` Pavel Pisa 2 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 4:01 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, William Lee Irwin III wrote: >> A binomial heap would likely serve your purposes better than rbtrees. [...] On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the slot: > S = Ib + min((T - Tb) / Tg, Ns - 1); > Items with T longer than Ns*Tg will be scheduled in the relative last slot > (chosing a proper Ns and Tg can minimize this). > Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to > suite to your needs. I used a similar sort of queue in the virtual deadline scheduler I wrote in 2003 or thereabouts. CFS uses queue priorities with too high a precision to map directly to this (queue priorities are marked as "key" in the cfs code and should not be confused with task priorities). The elder virtual deadline scheduler used millisecond resolution and a rather different calculation for its equivalent of ->key, which explains how it coped with a limited priority space. The two basic attacks on such large priority spaces are the near future vs. far future subdivisions and subdividing the priority space into (most often regular) intervals. Subdividing the priority space into intervals is the most obvious; you simply use some O(lg(n)) priority queue as the bucket discipline in the "time ring," queue by the upper bits of the queue priority in the time ring, and by the lower bits in the O(lg(n)) bucket discipline. The near future vs. far future subdivision is maintaining the first N tasks in a low-constant-overhead structure like a sorted list and the remainder in some other sort of queue structure intended to handle large numbers of elements gracefully. The distribution of queue priorities strongly influences which of the methods is most potent, though it should be clear the methods can be used in combination. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 4:01 ` William Lee Irwin III @ 2007-04-15 4:18 ` Davide Libenzi 0 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-15 4:18 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, 14 Apr 2007, William Lee Irwin III wrote: > The two basic attacks on such large priority spaces are the near future > vs. far future subdivisions and subdividing the priority space into > (most often regular) intervals. Subdividing the priority space into > intervals is the most obvious; you simply use some O(lg(n)) priority > queue as the bucket discipline in the "time ring," queue by the upper > bits of the queue priority in the time ring, and by the lower bits in > the O(lg(n)) bucket discipline. Sure. If you really need sub-millisecond precision, you can replace the bucket's list_head with an rb_root. It may be not necessary though for a cpu scheduler (still, didn't read Ingo's code yet). - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi 2007-04-15 4:01 ` William Lee Irwin III @ 2007-04-15 23:09 ` Pavel Pisa 2007-04-16 5:47 ` Davide Libenzi 2 siblings, 1 reply; 713+ messages in thread From: Pavel Pisa @ 2007-04-15 23:09 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007 00:38, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the > slot: > > S = Ib + min((T - Tb) / Tg, Ns - 1); > > Items with T longer than Ns*Tg will be scheduled in the relative last slot > (chosing a proper Ns and Tg can minimize this). > Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to > suite to your needs. > This is a simple bench between time-ring (TR) and CFS queueing: > > http://www.xmailserver.org/smart-queue.c > > In my box (Dual Opteron 252): > > davide@alien:~$ ./smart-queue -n 8 > CFS = 142.21 cycles/loop > TR = 72.33 cycles/loop > davide@alien:~$ ./smart-queue -n 16 > CFS = 188.74 cycles/loop > TR = 83.79 cycles/loop > davide@alien:~$ ./smart-queue -n 32 > CFS = 221.36 cycles/loop > TR = 75.93 cycles/loop > davide@alien:~$ ./smart-queue -n 64 > CFS = 242.89 cycles/loop > TR = 81.29 cycles/loop Hello all, I cannot help myself to not report results with GAVL tree algorithm there as an another race competitor. I believe, that it is better solution for large priority queues than RB-tree and even heap trees. It could be disputable if the scheduler needs such scalability on the other hand. The AVL heritage guarantees lower height which results in shorter search times which could be profitable for other uses in kernel. GAVL algorithm is AVL tree based, so it does not suffer from "infinite" priorities granularity there as TR does. It allows use for generalized case where tree is not fully balanced. This allows to cut the first item withour rebalancing. This leads to the degradation of the tree by one more level (than non degraded AVL gives) in maximum, which is still considerably better than RB-trees maximum. http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c The description behind the code is there http://cmp.felk.cvut.cz/~pisa/ulan/gavl.pdf The code is part of much more covering uLUt library http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf http://sourceforge.net/project/showfiles.php?group_id=118937&package_id=130840 I have included all required GAVL code directly into smart-queue-v-gavl.c to provide it for easy testing. There are tests run on my little dated computer - Duron 600 MHz. Test are run twice to suppress run order influence. ./smart-queue-v-gavl -n 1 -l 2000000 gavl_cfs = 55.66 cycles/loop CFS = 88.33 cycles/loop TR = 141.78 cycles/loop CFS = 90.45 cycles/loop gavl_cfs = 55.38 cycles/loop ./smart-queue-v-gavl -n 2 -l 2000000 gavl_cfs = 82.85 cycles/loop CFS = 104.18 cycles/loop TR = 145.21 cycles/loop CFS = 102.74 cycles/loop gavl_cfs = 82.05 cycles/loop ./smart-queue-v-gavl -n 4 -l 2000000 gavl_cfs = 137.45 cycles/loop CFS = 156.47 cycles/loop TR = 142.00 cycles/loop CFS = 152.65 cycles/loop gavl_cfs = 139.38 cycles/loop ./smart-queue-v-gavl -n 10 -l 2000000 gavl_cfs = 229.22 cycles/loop (WORSE) CFS = 206.26 cycles/loop TR = 140.81 cycles/loop CFS = 208.29 cycles/loop gavl_cfs = 223.62 cycles/loop (WORSE) ./smart-queue-v-gavl -n 100 -l 2000000 gavl_cfs = 257.66 cycles/loop CFS = 329.68 cycles/loop TR = 142.20 cycles/loop CFS = 319.34 cycles/loop gavl_cfs = 260.02 cycles/loop ./smart-queue-v-gavl -n 1000 -l 2000000 gavl_cfs = 258.41 cycles/loop CFS = 393.04 cycles/loop TR = 134.76 cycles/loop CFS = 392.20 cycles/loop gavl_cfs = 260.93 cycles/loop ./smart-queue-v-gavl -n 10000 -l 2000000 gavl_cfs = 259.45 cycles/loop CFS = 605.89 cycles/loop TR = 196.69 cycles/loop CFS = 622.60 cycles/loop gavl_cfs = 262.72 cycles/loop ./smart-queue-v-gavl -n 100000 -l 2000000 gavl_cfs = 258.21 cycles/loop CFS = 845.62 cycles/loop TR = 315.37 cycles/loop CFS = 860.21 cycles/loop gavl_cfs = 258.94 cycles/loop The GAVL code has not been tuned by any "likely"/"unlikely" constructs. It brings even some other overhead from it generic design which is not necessary for this use - it keeps permanently even pointer to the last element, ensures, that the insertion order is preserved for same key values etc. But it still proves much better scalability then kernel used RB-tree code. On the other hand, it does not encode color/height in one of the pointers and requires additional field for height. May it be, that difference is due some bug in my testing, then I would be interrested in correction. The test case is oversimplified probably. I have already run more different tests against GAVL code in the past to compare it with different tree and queues implementations and I have not found case with real performance degradation. On the other hand, there are cases for small items counts where GAVL is sometimes a little worse than others (array based heap-tree for example). The GAVL code itself is used in more opensource and commercial projects and we have noticed no problems after one small fix at the time of the first release in 2004. Best wishes Pavel Pisa e-mail: pisa@cmp.felk.cvut.cz www: http://cmp.felk.cvut.cz/~pisa work: http://www.pikron.com ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:09 ` Pavel Pisa @ 2007-04-16 5:47 ` Davide Libenzi 2007-04-17 0:37 ` Pavel Pisa 0 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-16 5:47 UTC (permalink / raw) To: Pavel Pisa Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, 16 Apr 2007, Pavel Pisa wrote: > I cannot help myself to not report results with GAVL > tree algorithm there as an another race competitor. > I believe, that it is better solution for large priority > queues than RB-tree and even heap trees. It could be > disputable if the scheduler needs such scalability on > the other hand. The AVL heritage guarantees lower height > which results in shorter search times which could > be profitable for other uses in kernel. > > GAVL algorithm is AVL tree based, so it does not suffer from > "infinite" priorities granularity there as TR does. It allows > use for generalized case where tree is not fully balanced. > This allows to cut the first item withour rebalancing. > This leads to the degradation of the tree by one more level > (than non degraded AVL gives) in maximum, which is still > considerably better than RB-trees maximum. > > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c Here are the results on my Opteron 252: Testing N=1 gavl_cfs = 187.20 cycles/loop CFS = 194.16 cycles/loop TR = 314.87 cycles/loop CFS = 194.15 cycles/loop gavl_cfs = 187.15 cycles/loop Testing N=2 gavl_cfs = 268.94 cycles/loop CFS = 305.53 cycles/loop TR = 313.78 cycles/loop CFS = 289.58 cycles/loop gavl_cfs = 266.02 cycles/loop Testing N=4 gavl_cfs = 452.13 cycles/loop CFS = 518.81 cycles/loop TR = 311.54 cycles/loop CFS = 516.23 cycles/loop gavl_cfs = 450.73 cycles/loop Testing N=8 gavl_cfs = 609.29 cycles/loop CFS = 644.65 cycles/loop TR = 308.11 cycles/loop CFS = 667.01 cycles/loop gavl_cfs = 592.89 cycles/loop Testing N=16 gavl_cfs = 686.30 cycles/loop CFS = 807.41 cycles/loop TR = 317.20 cycles/loop CFS = 810.24 cycles/loop gavl_cfs = 688.42 cycles/loop Testing N=32 gavl_cfs = 756.57 cycles/loop CFS = 852.14 cycles/loop TR = 301.22 cycles/loop CFS = 876.12 cycles/loop gavl_cfs = 758.46 cycles/loop Testing N=64 gavl_cfs = 831.97 cycles/loop CFS = 997.16 cycles/loop TR = 304.74 cycles/loop CFS = 1003.26 cycles/loop gavl_cfs = 832.83 cycles/loop Testing N=128 gavl_cfs = 897.33 cycles/loop CFS = 1030.36 cycles/loop TR = 295.65 cycles/loop CFS = 1035.29 cycles/loop gavl_cfs = 892.51 cycles/loop Testing N=256 gavl_cfs = 963.17 cycles/loop CFS = 1146.04 cycles/loop TR = 295.35 cycles/loop CFS = 1162.04 cycles/loop gavl_cfs = 966.31 cycles/loop Testing N=512 gavl_cfs = 1029.82 cycles/loop CFS = 1218.34 cycles/loop TR = 288.78 cycles/loop CFS = 1257.97 cycles/loop gavl_cfs = 1029.83 cycles/loop Testing N=1024 gavl_cfs = 1091.76 cycles/loop CFS = 1318.47 cycles/loop TR = 287.74 cycles/loop CFS = 1311.72 cycles/loop gavl_cfs = 1093.29 cycles/loop Testing N=2048 gavl_cfs = 1153.03 cycles/loop CFS = 1398.84 cycles/loop TR = 286.75 cycles/loop CFS = 1438.68 cycles/loop gavl_cfs = 1149.97 cycles/loop There seem to be some difference from your numbers. This is with: gcc version 4.1.2 and -O2. But then and Opteron can behave quite differentyl than a Duron on a bench like this ;) - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:47 ` Davide Libenzi @ 2007-04-17 0:37 ` Pavel Pisa 0 siblings, 0 replies; 713+ messages in thread From: Pavel Pisa @ 2007-04-17 0:37 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 07:47, Davide Libenzi wrote: > On Mon, 16 Apr 2007, Pavel Pisa wrote: > > I cannot help myself to not report results with GAVL > > tree algorithm there as an another race competitor. > > I believe, that it is better solution for large priority > > queues than RB-tree and even heap trees. It could be > > disputable if the scheduler needs such scalability on > > the other hand. The AVL heritage guarantees lower height > > which results in shorter search times which could > > be profitable for other uses in kernel. > > > > GAVL algorithm is AVL tree based, so it does not suffer from > > "infinite" priorities granularity there as TR does. It allows > > use for generalized case where tree is not fully balanced. > > This allows to cut the first item withour rebalancing. > > This leads to the degradation of the tree by one more level > > (than non degraded AVL gives) in maximum, which is still > > considerably better than RB-trees maximum. > > > > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c > > Here are the results on my Opteron 252: > > Testing N=1 > gavl_cfs = 187.20 cycles/loop > CFS = 194.16 cycles/loop > TR = 314.87 cycles/loop > CFS = 194.15 cycles/loop > gavl_cfs = 187.15 cycles/loop > > Testing N=2 > gavl_cfs = 268.94 cycles/loop > CFS = 305.53 cycles/loop > TR = 313.78 cycles/loop > CFS = 289.58 cycles/loop > gavl_cfs = 266.02 cycles/loop > > Testing N=4 > gavl_cfs = 452.13 cycles/loop > CFS = 518.81 cycles/loop > TR = 311.54 cycles/loop > CFS = 516.23 cycles/loop > gavl_cfs = 450.73 cycles/loop > > Testing N=8 > gavl_cfs = 609.29 cycles/loop > CFS = 644.65 cycles/loop > TR = 308.11 cycles/loop > CFS = 667.01 cycles/loop > gavl_cfs = 592.89 cycles/loop > > Testing N=16 > gavl_cfs = 686.30 cycles/loop > CFS = 807.41 cycles/loop > TR = 317.20 cycles/loop > CFS = 810.24 cycles/loop > gavl_cfs = 688.42 cycles/loop > > Testing N=32 > gavl_cfs = 756.57 cycles/loop > CFS = 852.14 cycles/loop > TR = 301.22 cycles/loop > CFS = 876.12 cycles/loop > gavl_cfs = 758.46 cycles/loop > > Testing N=64 > gavl_cfs = 831.97 cycles/loop > CFS = 997.16 cycles/loop > TR = 304.74 cycles/loop > CFS = 1003.26 cycles/loop > gavl_cfs = 832.83 cycles/loop > > Testing N=128 > gavl_cfs = 897.33 cycles/loop > CFS = 1030.36 cycles/loop > TR = 295.65 cycles/loop > CFS = 1035.29 cycles/loop > gavl_cfs = 892.51 cycles/loop > > Testing N=256 > gavl_cfs = 963.17 cycles/loop > CFS = 1146.04 cycles/loop > TR = 295.35 cycles/loop > CFS = 1162.04 cycles/loop > gavl_cfs = 966.31 cycles/loop > > Testing N=512 > gavl_cfs = 1029.82 cycles/loop > CFS = 1218.34 cycles/loop > TR = 288.78 cycles/loop > CFS = 1257.97 cycles/loop > gavl_cfs = 1029.83 cycles/loop > > Testing N=1024 > gavl_cfs = 1091.76 cycles/loop > CFS = 1318.47 cycles/loop > TR = 287.74 cycles/loop > CFS = 1311.72 cycles/loop > gavl_cfs = 1093.29 cycles/loop > > Testing N=2048 > gavl_cfs = 1153.03 cycles/loop > CFS = 1398.84 cycles/loop > TR = 286.75 cycles/loop > CFS = 1438.68 cycles/loop > gavl_cfs = 1149.97 cycles/loop > > > There seem to be some difference from your numbers. This is with: > > gcc version 4.1.2 > > and -O2. But then and Opteron can behave quite differentyl than a Duron on > a bench like this ;) Thanks for testing, but yours numbers are more correct than my first report. My numbers seemed to be over-optimistic even to me, In the fact I have been surprised that difference is so high. But I have tested bad version of code without GAVL_FAFTER option set. The code pushed to the web page has been the correct one. I have not get to look into case until now because I have busy day to prepare some Linux based labs at university. Without GAVL_FAFTER option, insert operation does fail if item with same key is already inserted (intended feature of the code) and as result of that, not all items have been inserted in the test. The meaning of GAVL_FAFTER is find/insert after all items with the same key value. Default behavior is operate on unique keys in tree and reject duplicates. My results are even worse for GAVL than yours. It is possible to try tweak code and optimize it more (likely/unlikely/do not keep last ptr etc) for this actual usage. May it be, that I try this exercise, but I do not expect that the result after tuning would be so much better, that it would outweight some redesign work. I could see some advantages of AVL still, but it has its own drawbacks with need of separate height field and little worse delete in the middle timing. So excuse me for disturbance. I have been only curious how GAVL code would behave in the comparison of other algorithms and I did not kept my premature enthusiasm under the lock. Best wishes Pavel Pisa ./smart-queue-v-gavl -n 4 gavl_cfs = 279.02 cycles/loop CFS = 200.87 cycles/loop TR = 229.55 cycles/loop CFS = 201.23 cycles/loop gavl_cfs = 276.08 cycles/loop ./smart-queue-v-gavl -n 8 gavl_cfs = 310.92 cycles/loop CFS = 288.45 cycles/loop TR = 192.46 cycles/loop CFS = 284.94 cycles/loop gavl_cfs = 357.02 cycles/loop ./smart-queue-v-gavl -n 16 gavl_cfs = 350.45 cycles/loop CFS = 354.01 cycles/loop TR = 189.79 cycles/loop CFS = 320.08 cycles/loop gavl_cfs = 387.43 cycles/loop ./smart-queue-v-gavl -n 32 gavl_cfs = 419.23 cycles/loop CFS = 406.88 cycles/loop TR = 198.10 cycles/loop CFS = 398.15 cycles/loop gavl_cfs = 412.57 cycles/loop ./smart-queue-v-gavl -n 64 gavl_cfs = 442.81 cycles/loop CFS = 429.62 cycles/loop TR = 235.40 cycles/loop CFS = 389.54 cycles/loop gavl_cfs = 433.56 cycles/loop ./smart-queue-v-gavl -n 128 gavl_cfs = 358.20 cycles/loop CFS = 605.49 cycles/loop TR = 236.01 cycles/loop CFS = 458.50 cycles/loop gavl_cfs = 455.05 cycles/loop ./smart-queue-v-gavl -n 256 gavl_cfs = 529.72 cycles/loop CFS = 530.98 cycles/loop TR = 193.75 cycles/loop CFS = 533.75 cycles/loop gavl_cfs = 471.47 cycles/loop ./smart-queue-v-gavl -n 512 gavl_cfs = 525.80 cycles/loop CFS = 550.63 cycles/loop TR = 188.71 cycles/loop CFS = 549.81 cycles/loop gavl_cfs = 494.73 cycles/loop ./smart-queue-v-gavl -n 1024 gavl_cfs = 544.91 cycles/loop CFS = 561.68 cycles/loop TR = 230.97 cycles/loop CFS = 522.68 cycles/loop gavl_cfs = 542.40 cycles/loop ./smart-queue-v-gavl -n 2048 gavl_cfs = 567.46 cycles/loop CFS = 581.85 cycles/loop TR = 229.69 cycles/loop CFS = 585.41 cycles/loop gavl_cfs = 563.22 cycles/loop ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (4 preceding siblings ...) 2007-04-13 22:21 ` William Lee Irwin III @ 2007-04-13 22:31 ` Willy Tarreau 2007-04-13 23:18 ` Ingo Molnar 2007-04-13 23:07 ` Gabriel C ` (8 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-13 22:31 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Hi Ingo, On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] (...) > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). I have a high confidence this will work better : I've been using time-ordered trees in userland projects for several years, and never found anything better. To be honnest, I never understood the concept behind the array switch, but as I never felt brave enough to hack something in this kernel area, I simply preferred to shut up (not enough knowledge and not enough time). However, I have been using a very fast struct timeval-ordered RADIX tree. I found generic rbtree code to generally be slower, certainly because of the call to a function with arguments on every node. Both trees are O(log(n)), the rbtree being balanced and the radix tree being unbalanced. If you're interested, I can try to see how that would fit (but not this week-end). Also, I had spent much time in the past doing paper work on how to improve fairness between interactive tasks and batch tasks. I came up with the conclusion that for perfectness, tasks should not be ordered by their expected wakeup time, but by their expected completion time, which automatically takes account of their allocated and used timeslice. It would also allow both types of workloads to share equal CPU time with better responsiveness for the most interactive one through the reallocation of a "credit" for the tasks which have not consumed all of their timeslices. I remember we had discussed this with Mike about one year ago when he fixed lots of problems in mainline scheduler. The downside is that I never found how to make this algo fit in O(log(n)). I always ended in something like O(n.log(n)) IIRC. But maybe this is overkill for real life anyway. Given that a basic two arrays switch (which I never understood) was sufficient for many people, probably that a basic tree will be an order of magnitude better. > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. I find this useful, but to be fair with Mike and Con, they both have proposed similar tuning knobs in the past and you said you did not want to add that complexity for admins. People can sometimes be demotivated by seeing their proposals finally used by people who first rejected them. And since both Mike and Con both have done a wonderful job in that area, we need their experience and continued active participation more than ever. > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. I'm very pleased to read this. Because as I have already said it, my major concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the basic problems for me to the point that I switched the default lilo entry on my notebook to 2.6 ! I hope that whatever the next scheduler will be, we'll definitely get rid of any heuristics. Heuristics are good in 95% of situations and extremely bad in the remaining 5%. I prefer something reasonably good in 100% of situations. > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. > > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) > > - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler > way than the vanilla scheduler does. It uses 100 runqueues (for all > 100 RT priority levels, instead of 140 in the vanilla scheduler) > and it needs no expired array. > > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. Will this have any impact on NUMA/HT/multi-core/etc... ? > the core scheduler got smaller by more than 700 lines: Well done ! Cheers, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:31 ` Willy Tarreau @ 2007-04-13 23:18 ` Ingo Molnar 2007-04-14 18:48 ` Bill Huey 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 23:18 UTC (permalink / raw) To: Willy Tarreau Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > > central tunable: > > > > /proc/sys/kernel/sched_granularity_ns > > > > which can be used to tune the scheduler from 'desktop' (low > > latencies) to 'server' (good batching) workloads. It defaults to a > > setting suitable for desktop workloads. SCHED_BATCH is handled by the > > CFS scheduler module too. > > I find this useful, but to be fair with Mike and Con, they both have > proposed similar tuning knobs in the past and you said you did not > want to add that complexity for admins. [...] yeah. [ Note that what i opposed in the past was mostly the 'export all the zillion of sched.c knobs to /sys and let people mess with them' kind of patches which did exist and still exist. A _single_ knob, which represents basically the totality of parameters within sched_fair.c is less of a problem. I dont think i ever objected to this knob within staircase/SD. (If i did then i was dead wrong.) ] > [...] People can sometimes be demotivated by seeing their proposals > finally used by people who first rejected them. And since both Mike > and Con both have done a wonderful job in that area, we need their > experience and continued active participation more than ever. very much so! Both Con and Mike has contributed regularly to upstream sched.c: $ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l 19 $ git-log kernel/sched.c | grep 'by: Mike' | wc -l 6 and i'd very much like both counts to increase steadily in the future too :) > > - reworked/sanitized SMP load-balancing: the runqueue-walking > > assumptions are gone from the load-balancing code now, and > > iterators of the scheduling modules are used. The balancing code > > got quite a bit simpler as a result. > > Will this have any impact on NUMA/HT/multi-core/etc... ? it will inevitably have some sort of effect - and if it's negative, i'll try to fix it. I got rid of the explicit cache-hot tracking code and replaced it with a more natural pure 'pick the next-to-run task first, that is likely the most cache-cold one' logic. That just derives naturally from the rbtree approach. > > the core scheduler got smaller by more than 700 lines: > > Well done ! thanks :) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:18 ` Ingo Molnar @ 2007-04-14 18:48 ` Bill Huey 0 siblings, 0 replies; 713+ messages in thread From: Bill Huey @ 2007-04-14 18:48 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote: > very much so! Both Con and Mike has contributed regularly to upstream > sched.c: The problem here is tha Con can get demotivated (and rather upset) when an idea gets proposed, like SchedPlug, only to have people be hostile to it and then sudden turn around an adopt this idea. It give the impression that you, in this specific case, were more interested in controlling a situation and the track of development instead of actually being inclusive of the development process with discussion and serious consideration, etc... This is how the Linux community can be perceived as elitist. The old guard would serve the community better if people were more mindful and sensitive to developer issues. There was a particular speech that I was turned off by at OLS 2006 that pretty much pandering to the "old guard's" needs over newer developers. Since I'm a some what established engineer in -rt (being the only other person that mapped the lock hierarchy out for full preemptibility), I had the confidence to pretty much ignored it while previously this could have really upset me and be highly discouraging to a relatively new developer. As Linux gets larger and larger this is going to be an increasing problem when folks come into the community with new ideas and the community will need to change if it intends to integrate these folks. IMO, a lot of these flame ware wouldn't need to exist if folks listent ot each other better and permit co-ownership of code like the scheduler since it needs multipule hands in it adapt to new loads and situations, etc... I'm saying this nicely now since I can be nasty about it. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (5 preceding siblings ...) 2007-04-13 22:31 ` Willy Tarreau @ 2007-04-13 23:07 ` Gabriel C 2007-04-13 23:25 ` Ingo Molnar 2007-04-14 2:04 ` Nick Piggin ` (7 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Gabriel C @ 2007-04-13 23:07 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > [...] > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, > Compile error here. ... CC kernel/sched.o kernel/sched.c: In function '__rq_clock': kernel/sched.c:219: error: 'struct rq' has no member named 'cpu' kernel/sched.c:219: warning: type defaults to 'int' in declaration of '__ret_warn_once' kernel/sched.c:219: error: 'struct rq' has no member named 'cpu' kernel/sched.c: In function 'rq_clock': kernel/sched.c:230: error: 'struct rq' has no member named 'cpu' kernel/sched.c: In function 'sched_init': kernel/sched.c:6013: warning: unused variable 'j' make[1]: *** [kernel/sched.o] Error 1 make: *** [kernel] Error 2 ==> ERROR: Build Failed. Aborting... ... There the config : http://frugalware.org/~crazy/other/kernel/config > Ingo > - > > Regards, Gabriel ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:07 ` Gabriel C @ 2007-04-13 23:25 ` Ingo Molnar 2007-04-13 23:39 ` Gabriel C 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-13 23:25 UTC (permalink / raw) To: Gabriel C Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Gabriel C <nix.or.die@googlemail.com> wrote: > > as usual, any sort of feedback, bugreports, fixes and suggestions > > are more than welcome, > > Compile error here. ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also updated the full patch at the cfs-scheduler URL) Ingo -----------------------> From: Ingo Molnar <mingo@elte.hu> Subject: [cfs] fix !CONFIG_SMP build fix the !CONFIG_SMP build error reported by Gabriel C Signed-off-by: Ingo Molnar <mingo@elte.hu> Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl return rq->rq_clock; } -static inline unsigned long long rq_clock(struct rq *rq) -{ - int this_cpu = smp_processor_id(); - - if (this_cpu == rq->cpu) - return __rq_clock(rq); - - return rq->rq_clock; -} - static inline int cpu_of(struct rq *rq) { #ifdef CONFIG_SMP @@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq) #endif } +static inline unsigned long long rq_clock(struct rq *rq) +{ + int this_cpu = smp_processor_id(); + + if (this_cpu == cpu_of(rq)) + return __rq_clock(rq); + + return rq->rq_clock; +} + /* * The domain tree (rq->sd) is protected by RCU's quiescent state transition. * See detach_destroy_domains: synchronize_sched for details. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:25 ` Ingo Molnar @ 2007-04-13 23:39 ` Gabriel C 0 siblings, 0 replies; 713+ messages in thread From: Gabriel C @ 2007-04-13 23:39 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Gabriel C <nix.or.die@googlemail.com> wrote: > > >>> as usual, any sort of feedback, bugreports, fixes and suggestions >>> are more than welcome, >>> >> Compile error here. >> > > ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also > updated the full patch at the cfs-scheduler URL) > Yes it does , thx :) , only the " warning: unused variable 'j' " left. > Ingo > Regards, Gabriel ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (6 preceding siblings ...) 2007-04-13 23:07 ` Gabriel C @ 2007-04-14 2:04 ` Nick Piggin 2007-04-14 6:32 ` Ingo Molnar 2007-04-14 15:09 ` S.Çağlar Onur ` (6 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-14 2:04 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch Always good to see another contender ;) > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] I don't know why there is such noise about fairness right now... I thought fairness was one of the fundamental properties of a good CPU scheduler, and my scheduler definitely always aims for that above most other things. Why not just keep SCHED_OTHER? > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. Don't really like this, but anyway... > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! I guess the 2.4 and earlier scheduler kind of did that as well. > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] Comment about the code: shouldn't you be requeueing the task in the rbtree wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick look so far). > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). > > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. Well, I guess there is still some mechanism to decide which process is most eligible to run? ;) Considering that question has no "right" answer for SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted they are obviously fairly elegant in contrast to the O(1) scheduler ;) > There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns Suppose you have 2 CPU hogs running, is sched_granularity_ns the frequency at which they will context switch? > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) What is better behaviour for sched_yield? Thanks, Nick ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 2:04 ` Nick Piggin @ 2007-04-14 6:32 ` Ingo Molnar 2007-04-14 6:43 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 6:32 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > The CFS patch uses a completely different approach and implementation > > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > > that of RSDL/SD, which is a high standard to meet :-) Testing > > feedback is welcome to decide this one way or another. [ and, in any > > case, all of SD's logic could be added via a kernel/sched_sd.c module > > as well, if Con is interested in such an approach. ] > > Comment about the code: shouldn't you be requeueing the task in the > rbtree wherever you change wait_runtime? eg. task_new_fair? [...] yes: the task's position within the rbtree is updated every time wherever wait_runtime is change. task_new_fair is the method during new task creation, but indeed i forgot to requeue the parent. I've fixed this in my tree (see the delta patch below) - thanks! Ingo -----------> From: Ingo Molnar <mingo@elte.hu> Subject: [cfs] fix parent's rbtree position Nick noticed that upon fork we change parent->wait_runtime but we do not requeue it within the rbtree. Signed-off-by: Ingo Molnar <mingo@elte.hu> Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq, p->wait_runtime = parent->wait_runtime/2; parent->wait_runtime /= 2; + requeue_task_fair(rq, parent); + /* * For the first timeslice we allow child threads * to move their parent-inherited fairness back ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 6:32 ` Ingo Molnar @ 2007-04-14 6:43 ` Ingo Molnar 2007-04-14 8:08 ` Willy Tarreau 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 6:43 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > Nick noticed that upon fork we change parent->wait_runtime but we do > not requeue it within the rbtree. this fix is not complete - because the child runqueue is locked here, not the parent's. I've fixed this properly in my tree and have uploaded a new sched-modular+cfs.patch. (the effects of the original bug are mostly harmless, the rbtree position gets corrected the first time the parent reschedules. The fix might improve heavy forker handling.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 6:43 ` Ingo Molnar @ 2007-04-14 8:08 ` Willy Tarreau 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:36 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 8:08 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > Nick noticed that upon fork we change parent->wait_runtime but we do > > not requeue it within the rbtree. > > this fix is not complete - because the child runqueue is locked here, > not the parent's. I've fixed this properly in my tree and have uploaded > a new sched-modular+cfs.patch. (the effects of the original bug are > mostly harmless, the rbtree position gets corrected the first time the > parent reschedules. The fix might improve heavy forker handling.) It looks like it did not reach your public dir yet. BTW, I've given it a try. It seems pretty usable. I have also tried the usual meaningless "glxgears" test with 12 of them at the same time, and they rotate very smoothly, there is absolutely no pause in any of them. But they don't all run at same speed, and top reports their CPU load varying from 3.4 to 10.8%, with what looks like more CPU is assigned to the first processes, and less CPU for the last ones. But this is just a rough observation on a stupid test, I would not call that one scientific in any way (and X has its share in the test too). I'll perform other tests when I can rebuild with your fixed patch. Cheers, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:08 ` Willy Tarreau @ 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 19:48 ` William Lee Irwin III 2007-04-14 10:36 ` Ingo Molnar 1 sibling, 2 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 8:36 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote: > On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote: > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > > Nick noticed that upon fork we change parent->wait_runtime but we do > > > not requeue it within the rbtree. > > > > this fix is not complete - because the child runqueue is locked here, > > not the parent's. I've fixed this properly in my tree and have uploaded > > a new sched-modular+cfs.patch. (the effects of the original bug are > > mostly harmless, the rbtree position gets corrected the first time the > > parent reschedules. The fix might improve heavy forker handling.) > > It looks like it did not reach your public dir yet. > > BTW, I've given it a try. It seems pretty usable. I have also tried > the usual meaningless "glxgears" test with 12 of them at the same time, > and they rotate very smoothly, there is absolutely no pause in any of > them. But they don't all run at same speed, and top reports their CPU > load varying from 3.4 to 10.8%, with what looks like more CPU is > assigned to the first processes, and less CPU for the last ones. But > this is just a rough observation on a stupid test, I would not call > that one scientific in any way (and X has its share in the test too). Follow-up: I think this is mostly X-related. I've started 100 scheddos, and all get the same CPU percentage. Interestingly, mpg123 in parallel does never skip at all because it needs quite less than 1% CPU and gets its fair share at a load of 112. Xterms are slow to respond to typing with the 12 gears and 100 scheddos, and expectedly it was X which was starving. renicing it to -5 restores normal feeling with very slow but smooth gear rotations. Leaving X niced at 0 and killing the gears also restores normal behaviour. All in all, it seems logical that processes which serve many others become a bottleneck for them. Forking becomes very slow above a load of 100 it seems. Sometimes, the shell takes 2 or 3 seconds to return to prompt after I run "scheddos &" Those are very promising results, I nearly observe the same responsiveness as I had on a solaris 10 with 10k running processes on a bigger machine. I would be curious what a mysql test result would look like now. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:36 ` Willy Tarreau @ 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 15:17 ` Mark Lord 2007-04-14 19:48 ` William Lee Irwin III 1 sibling, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 10:53 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Forking becomes very slow above a load of 100 it seems. Sometimes, the > shell takes 2 or 3 seconds to return to prompt after I run "scheddos > &" this might be changed/impacted by the parent-requeue fix that is in the updated (for real, promise! ;) patch. Right now on CFS a forking parent shares its own run stats with the child 50%/50%. This means that heavy forkers are indeed penalized. Another logical choice would be 100%/0%: a child has to earn its own right. i kept the 50%/50% rule from the old scheduler, but maybe it's a more pristine (and smaller/faster) approach to just not give new children any stats history to begin with. I've implemented an add-on patch that implements this, you can find it at: http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch > Those are very promising results, I nearly observe the same > responsiveness as I had on a solaris 10 with 10k running processes on > a bigger machine. cool and thanks for the feedback! (Btw., as another test you could also try to renice "scheddos" to +19. While that does not push the scheduler nearly as hard as nice 0, it is perhaps more indicative of how a truly abusive many-tasks workload would be run in practice.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 10:53 ` Ingo Molnar @ 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau ` (2 more replies) 2007-04-14 15:17 ` Mark Lord 1 sibling, 3 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 13:01 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Forking becomes very slow above a load of 100 it seems. Sometimes, the > > shell takes 2 or 3 seconds to return to prompt after I run "scheddos > > &" > > this might be changed/impacted by the parent-requeue fix that is in the > updated (for real, promise! ;) patch. Right now on CFS a forking parent > shares its own run stats with the child 50%/50%. This means that heavy > forkers are indeed penalized. Another logical choice would be 100%/0%: a > child has to earn its own right. > > i kept the 50%/50% rule from the old scheduler, but maybe it's a more > pristine (and smaller/faster) approach to just not give new children any > stats history to begin with. I've implemented an add-on patch that > implements this, you can find it at: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch Not tried yet, it already looks better with the update and sched-fair-hog. Now xterm open "instantly" even with 1000 running processes. > > Those are very promising results, I nearly observe the same > > responsiveness as I had on a solaris 10 with 10k running processes on > > a bigger machine. > > cool and thanks for the feedback! (Btw., as another test you could also > try to renice "scheddos" to +19. While that does not push the scheduler > nearly as hard as nice 0, it is perhaps more indicative of how a truly > abusive many-tasks workload would be run in practice.) Good idea. The machine I'm typing from now has 1000 scheddos running at +19, and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears, but I'm pretty sure that it's a top artifact now because the cumulated times are roughly identical : 14:33:13 up 13 min, 7 users, load average: 900.30, 443.75, 177.70 1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped CPU0 states: 56.0% user 43.0% system 23.0% nice 0.0% iowait 0.0% idle CPU1 states: 94.0% user 5.0% system 0.0% nice 0.0% iowait 0.0% idle Mem: 1034764k av, 223788k used, 810976k free, 0k shrd, 7192k buff 104400k active, 51904k inactive Swap: 497972k av, 0k used, 497972k free 68020k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 1325 root 20 0 69240 9400 3740 R 27.6 0.9 4:46 1 X 1412 willy 20 0 6284 2552 1740 R 14.2 0.2 1:09 1 glxgears 1419 willy 20 0 6256 2384 1612 R 10.7 0.2 1:09 1 glxgears 1409 willy 20 0 2824 1940 788 R 8.9 0.1 0:25 1 top 1414 willy 20 0 6280 2544 1728 S 8.9 0.2 1:08 0 glxgears 1415 willy 20 0 6256 2376 1600 R 8.9 0.2 1:07 1 glxgears 1417 willy 20 0 6256 2384 1612 S 8.9 0.2 1:05 1 glxgears 1420 willy 20 0 6284 2552 1740 R 8.9 0.2 1:07 1 glxgears 1410 willy 20 0 6256 2372 1600 S 7.1 0.2 1:11 1 glxgears 1413 willy 20 0 6260 2388 1612 S 7.1 0.2 1:08 0 glxgears 1416 willy 20 0 6284 2544 1728 S 6.2 0.2 1:06 0 glxgears 1418 willy 20 0 6252 2384 1612 S 6.2 0.2 1:09 0 glxgears 1411 willy 20 0 6280 2548 1740 S 5.3 0.2 1:15 1 glxgears 1421 willy 20 0 6280 2536 1728 R 5.3 0.2 1:05 1 glxgears >From time to time, one of the 12 aligned gears will quickly perform a full quarter of round while others slowly turn by a few degrees. In fact, while I don't know this process's CPU usage pattern, there's something useful in it : it allows me to visually see when process accelerate/deceleraet. What would be best would be just a clock requiring low X ressources and eating vast amounts of CPU between movements. It will help visually monitor CPU distribution without being too much impacted by X. I've just added another 100 scheddos at nice 0, and the system is still amazingly usable. I just tried exchanging a 1-byte token between 188 "dd" processes which communicate through circular pipes. The context switch rate is rather high but this has no impact on the rest : willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1105 0 1 0 781108 8364 68180 0 0 0 12 5 82187 59 41 0 1114 0 1 0 781108 8364 68180 0 0 0 0 0 81528 58 42 0 1112 0 1 0 781108 8364 68180 0 0 0 0 1 80899 58 42 0 1113 0 1 0 781108 8364 68180 0 0 0 0 26 83466 58 42 0 1106 0 2 0 781108 8376 68168 0 0 0 8 91 83193 58 42 0 1107 0 1 0 781108 8376 68180 0 0 0 4 7 79951 58 42 0 1106 0 1 0 781108 8376 68180 0 0 0 0 46 80939 57 43 0 1114 0 1 0 781108 8376 68180 0 0 0 0 21 82019 56 44 0 1116 0 1 0 781108 8376 68180 0 0 0 0 16 85134 56 44 0 1114 0 3 0 781108 8388 68168 0 0 0 16 20 85871 56 44 0 1112 0 1 0 781108 8388 68168 0 0 0 0 15 80412 57 43 0 1112 0 1 0 781108 8388 68180 0 0 0 0 101 83002 58 42 0 1113 0 1 0 781108 8388 68180 0 0 0 0 25 82230 56 44 0 Playing with the sched_max_hog_history_ns does not seem to change anything. Maybe it's useful for other workloads. Anyway, I have nothing to complain about, because it's not common for me to be able to normally type a mail on a system with more than 1000 running processes ;-) Also, mixed with this load, I have started injecting HTTP requests between two local processes. The load is stable at 7700 req/s (11800 when alone), and what I was interested in is the response time. It's perfectly stable between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were varying a lot under stock scheduler, with some sessions sometimes pausing for seconds. (RSDL fixed this though). Well, I'll stop heating the room for now as I get out of ideas about how to defeat it. I'm convinced. I'm impatient to read about Mike's feedback with his workload which behaves strangely on RSDL. If it works OK here, it will be the proof that heuristics should not be needed. Congrats ! Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau @ 2007-04-14 13:27 ` Willy Tarreau 2007-04-14 14:45 ` Willy Tarreau 2007-04-14 16:19 ` Ingo Molnar 2007-04-15 7:54 ` Mike Galbraith 2007-04-19 9:01 ` Ingo Molnar 2 siblings, 2 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 13:27 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > Well, I'll stop heating the room for now as I get out of ideas about how > to defeat it. Ah, I found something nasty. If I start large batches of processes like this : $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done the ramp up slows down after 700-800 processes, but something very strange happens. If I'm under X, I can switch the focus to all xterms (the WM is still alive) but all xterms are frozen. On the console, after one moment I simply cannot switch to another VT anymore while I can still start commands locally. But "chvt 2" simply blocks. SysRq-K killed everything and restored full control. Dmesg shows lots of : SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. I wonder if part of the problem would be too many processes bound to the same tty :-/ I'll investigate a bit. Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:27 ` Willy Tarreau @ 2007-04-14 14:45 ` Willy Tarreau 2007-04-14 16:14 ` Ingo Molnar 2007-04-14 16:19 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 14:45 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote: > On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > > > Well, I'll stop heating the room for now as I get out of ideas about how > > to defeat it. > > Ah, I found something nasty. > If I start large batches of processes like this : > > $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done > > the ramp up slows down after 700-800 processes, but something very > strange happens. If I'm under X, I can switch the focus to all xterms > (the WM is still alive) but all xterms are frozen. On the console, > after one moment I simply cannot switch to another VT anymore while > I can still start commands locally. But "chvt 2" simply blocks. > SysRq-K killed everything and restored full control. Dmesg shows lots > of : > SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > I wonder if part of the problem would be too many processes bound to > the same tty :-/ Does not seem easy to reproduce, it looks like some resource pools are kept pre-allocated after a first run, because if I kill scheddos during the ramp up then start it again, it can go further. The problem happens when the parent is forking. Also, I modified scheddos to close(0,1,2) and to perform the forks itself and it does not cause any problem, even with 4000 processes running. So I really suspect that the problem I encountered above was tty-related. BTW, I've tried your fork patch. It definitely helps forking because it takes below one second to create 4000 processes, then the load slowly increases. As you said, the children have to earn their share, and I find that it makes it easier to conserve control of the whole system's stability. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 14:45 ` Willy Tarreau @ 2007-04-14 16:14 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 16:14 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > BTW, I've tried your fork patch. It definitely helps forking because > it takes below one second to create 4000 processes, then the load > slowly increases. As you said, the children have to earn their share, > and I find that it makes it easier to conserve control of the whole > system's stability. ok, thanks for testing this out, i think i'll integrate this one back into the core. (I'm still unsure about the cpu-hog one.) And it saves some code-size too: text data bss dec hex filename 23349 2705 24 26078 65de kernel/sched.o.cfs-v1 23189 2705 24 25918 653e kernel/sched.o.cfs-before 23052 2705 24 25781 64b5 kernel/sched.o.cfs-after 23366 4001 24 27391 6aff kernel/sched.o.vanilla 23671 4548 24 28243 6e53 kernel/sched.o.sd.v40 Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:27 ` Willy Tarreau 2007-04-14 14:45 ` Willy Tarreau @ 2007-04-14 16:19 ` Ingo Molnar 2007-04-14 17:15 ` Eric W. Biederman 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 16:19 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Eric W. Biederman, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > > > Well, I'll stop heating the room for now as I get out of ideas about how > > to defeat it. > > Ah, I found something nasty. > If I start large batches of processes like this : > > $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done > > the ramp up slows down after 700-800 processes, but something very > strange happens. If I'm under X, I can switch the focus to all xterms > (the WM is still alive) but all xterms are frozen. On the console, > after one moment I simply cannot switch to another VT anymore while I > can still start commands locally. But "chvt 2" simply blocks. SysRq-K > killed everything and restored full control. Dmesg shows lots of : > SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > I wonder if part of the problem would be too many processes bound to > the same tty :-/ hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), maybe this description rings a bell with them? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 16:19 ` Ingo Molnar @ 2007-04-14 17:15 ` Eric W. Biederman 2007-04-14 17:29 ` Willy Tarreau 0 siblings, 1 reply; 713+ messages in thread From: Eric W. Biederman @ 2007-04-14 17:15 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Ingo Molnar <mingo@elte.hu> writes: > * Willy Tarreau <w@1wt.eu> wrote: > >> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: >> > >> > Well, I'll stop heating the room for now as I get out of ideas about how >> > to defeat it. >> >> Ah, I found something nasty. >> If I start large batches of processes like this : >> >> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done >> >> the ramp up slows down after 700-800 processes, but something very >> strange happens. If I'm under X, I can switch the focus to all xterms >> (the WM is still alive) but all xterms are frozen. On the console, >> after one moment I simply cannot switch to another VT anymore while I >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K >> killed everything and restored full control. Dmesg shows lots of : > >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. This. Yes. SAK is noisy and tells you everything it kills. >> I wonder if part of the problem would be too many processes bound to >> the same tty :-/ > > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), > maybe this description rings a bell with them? Is there any swapping going on? I'm inclined to suspect that it is a problem that has more to do with the number of processes and has nothing to do with ttys. Anyway you can easily rule out ttys by having your startup program detach from a controlling tty before you start everything. I'm more inclined to guess something is reading /proc a lot, or doing something that holds the tasklist lock, a lot or something like that, if the problem isn't that you are being kicked into swap. Eric ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:15 ` Eric W. Biederman @ 2007-04-14 17:29 ` Willy Tarreau 2007-04-14 17:44 ` Eric W. Biederman 2007-04-14 17:50 ` Linus Torvalds 0 siblings, 2 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 17:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Hi Eric, [...] > >> the ramp up slows down after 700-800 processes, but something very > >> strange happens. If I'm under X, I can switch the focus to all xterms > >> (the WM is still alive) but all xterms are frozen. On the console, > >> after one moment I simply cannot switch to another VT anymore while I > >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K > >> killed everything and restored full control. Dmesg shows lots of : > > > >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > This. Yes. SAK is noisy and tells you everything it kills. OK, that's what I suspected, but I did not know if the fact that it talked about the session was systematic or related to any particular state when it killed the task. > >> I wonder if part of the problem would be too many processes bound to > >> the same tty :-/ > > > > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), > > maybe this description rings a bell with them? > > Is there any swapping going on? Not at all. > I'm inclined to suspect that it is a problem that has more to do with the > number of processes and has nothing to do with ttys. It is clearly possible. What I found strange is that I could still fork processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. It first happened under X with frozen xterms but a perfectly usable WM, then I reproduced it on pure console to rule out any potential X problem. > Anyway you can easily rule out ttys by having your startup program > detach from a controlling tty before you start everything. > > I'm more inclined to guess something is reading /proc a lot, or doing > something that holds the tasklist lock, a lot or something like that, > if the problem isn't that you are being kicked into swap. Oh I'm sorry you were invited into the discussion without a first description of the context. I was giving a try to Ingo's new scheduler, and trying to reach corner cases with lots of processes competing for CPU. I simply used a "for" loop in bash to fork 1000 processes, and this problem happened between 700-800 children. The program only uses a busy loop and a pause. I then changed my program to close 0,1,2 and perform the fork itself, and the problem vanished. So there are two differences here : - bash not forking anymore - far less FDs on /dev/tty1 At first, I had around 2200 fds on /dev/tty1, reason why I suspected something in this area. I agree that this is not normal usage at all, I'm just trying to attack Ingo's scheduler to ensure it is more robust than the stock one. But sometimes brute force methods can make other sleeping problems pop up. Thinking about it, I don't know if there are calls to schedule() while switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2" simply blocked. It would have been possible that a schedule() call somewhere got starved due to the load, I don't know. Thanks, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:29 ` Willy Tarreau @ 2007-04-14 17:44 ` Eric W. Biederman 2007-04-14 17:54 ` Ingo Molnar 2007-04-14 17:50 ` Linus Torvalds 1 sibling, 1 reply; 713+ messages in thread From: Eric W. Biederman @ 2007-04-14 17:44 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Willy Tarreau <w@1wt.eu> writes: > Hi Eric, > > [...] >> >> the ramp up slows down after 700-800 processes, but something very >> >> strange happens. If I'm under X, I can switch the focus to all xterms >> >> (the WM is still alive) but all xterms are frozen. On the console, >> >> after one moment I simply cannot switch to another VT anymore while I >> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K >> >> killed everything and restored full control. Dmesg shows lots of : >> > >> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. >> >> This. Yes. SAK is noisy and tells you everything it kills. > > OK, that's what I suspected, but I did not know if the fact that it talked > about the session was systematic or related to any particular state when it > killed the task. > >> >> I wonder if part of the problem would be too many processes bound to >> >> the same tty :-/ >> > >> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), >> > maybe this description rings a bell with them? >> >> Is there any swapping going on? > > Not at all. > >> I'm inclined to suspect that it is a problem that has more to do with the >> number of processes and has nothing to do with ttys. > > It is clearly possible. What I found strange is that I could still fork > processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. > It first happened under X with frozen xterms but a perfectly usable WM, > then I reproduced it on pure console to rule out any potential X problem. > >> Anyway you can easily rule out ttys by having your startup program >> detach from a controlling tty before you start everything. >> >> I'm more inclined to guess something is reading /proc a lot, or doing >> something that holds the tasklist lock, a lot or something like that, >> if the problem isn't that you are being kicked into swap. > > Oh I'm sorry you were invited into the discussion without a first description > of the context. I was giving a try to Ingo's new scheduler, and trying to > reach corner cases with lots of processes competing for CPU. > > I simply used a "for" loop in bash to fork 1000 processes, and this problem > happened between 700-800 children. The program only uses a busy loop and a > pause. I then changed my program to close 0,1,2 and perform the fork itself, > and the problem vanished. So there are two differences here : > > - bash not forking anymore > - far less FDs on /dev/tty1 Yes. But with /dev/tty1 being the controlling terminal in both cases, as you haven't dropped your session, or disassociated your tty. The bash problem may have something to setpgid or scheduling effects. Hmm. I just looked and setpgid does grab the tasklist lock for writing so we may possibly have some contention there. > At first, I had around 2200 fds on /dev/tty1, reason why I suspected something > in this area. > > I agree that this is not normal usage at all, I'm just trying to attack > Ingo's scheduler to ensure it is more robust than the stock one. But > sometimes brute force methods can make other sleeping problems pop up. Yep. If we can narrow it down to one that would be interesting. Of course that also means when we start finding other possibly sleeping problems people are working in areas of code the don't normally touch, so we must investigate. > Thinking about it, I don't know if there are calls to schedule() while > switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2" > simply blocked. It would have been possible that a schedule() call > somewhere got starved due to the load, I don't know. It looks like there is a call to schedule_work. There are two pieces of the path. If you are switching in and out of a tty controlled by something like X. User space has to grant permission before the operation happens. Where there isn't a gate keeper I know it is cheaper but I don't know by how much, I suspect there is still a schedule happening in there. Eric ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:44 ` Eric W. Biederman @ 2007-04-14 17:54 ` Ingo Molnar 2007-04-14 18:18 ` Willy Tarreau 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 17:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Eric W. Biederman <ebiederm@xmission.com> wrote: > > Thinking about it, I don't know if there are calls to schedule() > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > > "chvt 2" simply blocked. It would have been possible that a > > schedule() call somewhere got starved due to the load, I don't know. > > It looks like there is a call to schedule_work. so this goes over keventd, right? > There are two pieces of the path. If you are switching in and out of a > tty controlled by something like X. User space has to grant > permission before the operation happens. Where there isn't a gate > keeper I know it is cheaper but I don't know by how much, I suspect > there is still a schedule happening in there. Could keventd perhaps be starved? Willy, to exclude this possibility, could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then the command to set it to SCHED_FIFO:50 would be: chrt -f -p 50 5 but ... events/0 is reniced to -5 by default, so it should definitely not be starved. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:54 ` Ingo Molnar @ 2007-04-14 18:18 ` Willy Tarreau 2007-04-14 18:40 ` Eric W. Biederman 2007-04-15 17:55 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 18:18 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: > > * Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > Thinking about it, I don't know if there are calls to schedule() > > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > > > "chvt 2" simply blocked. It would have been possible that a > > > schedule() call somewhere got starved due to the load, I don't know. > > > > It looks like there is a call to schedule_work. > > so this goes over keventd, right? > > > There are two pieces of the path. If you are switching in and out of a > > tty controlled by something like X. User space has to grant > > permission before the operation happens. Where there isn't a gate > > keeper I know it is cheaper but I don't know by how much, I suspect > > there is still a schedule happening in there. > > Could keventd perhaps be starved? Willy, to exclude this possibility, > could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then > the command to set it to SCHED_FIFO:50 would be: > > chrt -f -p 50 5 > > but ... events/0 is reniced to -5 by default, so it should definitely > not be starved. Well, since I merged the fair-fork patch, I cannot reproduce (in fact, bash forks 1000 processes, then progressively execs scheddos, but it takes some time). So I'm rebuilding right now. But I think that Linus has an interesting clue about GPM and notification before switching the terminal. I think it was enabled in console mode. I don't know how that translates to frozen xterms, but let's attack the problems one at a time. Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:18 ` Willy Tarreau @ 2007-04-14 18:40 ` Eric W. Biederman 2007-04-14 19:01 ` Willy Tarreau 2007-04-15 17:55 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Eric W. Biederman @ 2007-04-14 18:40 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Willy Tarreau <w@1wt.eu> writes: > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: >> >> * Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> > > Thinking about it, I don't know if there are calls to schedule() >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and >> > > "chvt 2" simply blocked. It would have been possible that a >> > > schedule() call somewhere got starved due to the load, I don't know. >> > >> > It looks like there is a call to schedule_work. >> >> so this goes over keventd, right? >> >> > There are two pieces of the path. If you are switching in and out of a >> > tty controlled by something like X. User space has to grant >> > permission before the operation happens. Where there isn't a gate >> > keeper I know it is cheaper but I don't know by how much, I suspect >> > there is still a schedule happening in there. >> >> Could keventd perhaps be starved? Willy, to exclude this possibility, >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then >> the command to set it to SCHED_FIFO:50 would be: >> >> chrt -f -p 50 5 >> >> but ... events/0 is reniced to -5 by default, so it should definitely >> not be starved. > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > bash forks 1000 processes, then progressively execs scheddos, but it > takes some time). So I'm rebuilding right now. But I think that Linus > has an interesting clue about GPM and notification before switching > the terminal. I think it was enabled in console mode. I don't know > how that translates to frozen xterms, but let's attack the problems > one at a time. I think it is a good clue. However the intention of the mechanism is that only processes that change the video mode on a VT are supposed to use it. So I really don't think gpm is the culprit. However it easily could be something else that has similar characteristics. I just realized we do have proof that schedule_work is actually working because SAK works, and we can't sanely do SAK from interrupt context so we call schedule work. Eric ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:40 ` Eric W. Biederman @ 2007-04-14 19:01 ` Willy Tarreau 0 siblings, 0 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 19:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote: > Willy Tarreau <w@1wt.eu> writes: > > > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: > >> > >> * Eric W. Biederman <ebiederm@xmission.com> wrote: > >> > >> > > Thinking about it, I don't know if there are calls to schedule() > >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > >> > > "chvt 2" simply blocked. It would have been possible that a > >> > > schedule() call somewhere got starved due to the load, I don't know. > >> > > >> > It looks like there is a call to schedule_work. > >> > >> so this goes over keventd, right? > >> > >> > There are two pieces of the path. If you are switching in and out of a > >> > tty controlled by something like X. User space has to grant > >> > permission before the operation happens. Where there isn't a gate > >> > keeper I know it is cheaper but I don't know by how much, I suspect > >> > there is still a schedule happening in there. > >> > >> Could keventd perhaps be starved? Willy, to exclude this possibility, > >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then > >> the command to set it to SCHED_FIFO:50 would be: > >> > >> chrt -f -p 50 5 > >> > >> but ... events/0 is reniced to -5 by default, so it should definitely > >> not be starved. > > > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > > bash forks 1000 processes, then progressively execs scheddos, but it > > takes some time). So I'm rebuilding right now. But I think that Linus > > has an interesting clue about GPM and notification before switching > > the terminal. I think it was enabled in console mode. I don't know > > how that translates to frozen xterms, but let's attack the problems > > one at a time. > > I think it is a good clue. However the intention of the mechanism is > that only processes that change the video mode on a VT are supposed to > use it. So I really don't think gpm is the culprit. However it easily could > be something else that has similar characteristics. > > I just realized we do have proof that schedule_work is actually working > because SAK works, and we can't sanely do SAK from interrupt context > so we call schedule work. Eric, I can say that Linus, Ingo and you all got on the right track. I could reproduce, I got a hung tty around 1400 running processes. Fortunately, it was the one with the root shell which was reniced to -19. I could strace chvt 2 : 20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000> 20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002> 20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000> 20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...> Then I applied Ingo's suggestion about changing keventd prio : root@pcw:~# ps auxw|grep event root 8 0.0 0.0 0 0 ? SW< 20:31 0:00 [events/0] root 9 0.0 0.0 0 0 ? RW< 20:31 0:00 [events/1] root@pcw:~# rtprio -s 1 -p 50 8 9 (I don't have chrt but it does the same) My VT immediately switched as soon as I hit Enter. Everything's working fine again now. So the good news is that it's not a bug in the tty code, nor a deadlock. Now, maybe keventd should get a higher prio ? It seems worrying to me that it may starve when it seems so much sensible. Also, that may explain why I couldn't reproduce with the fork patch. Since all new processes got no runtime at first, their impact on existing ones must have been lower. But I think that if I had waited longer, I would have had the problem again (though I did not see it even under a load of 7800). Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:18 ` Willy Tarreau 2007-04-14 18:40 ` Eric W. Biederman @ 2007-04-15 17:55 ` Ingo Molnar 2007-04-15 18:06 ` Willy Tarreau 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 17:55 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > bash forks 1000 processes, then progressively execs scheddos, but it > takes some time). So I'm rebuilding right now. But I think that Linus > has an interesting clue about GPM and notification before switching > the terminal. I think it was enabled in console mode. I don't know how > that translates to frozen xterms, but let's attack the problems one at > a time. to debug this, could you try to apply this add-on as well: http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch with this patch applied you should have a /proc/sched_debug file that prints all runnable tasks and other interesting info from the runqueue. [ i've refreshed all the patches on the CFS webpage, so if this doesnt apply cleanly to your current tree then you'll probably have to refresh one of the patches.] The output should look like this: Sched Debug Version: v0.01 now at 226761724575 nsecs cpu: 0 .nr_running : 3 .raw_weighted_load : 384 .nr_switches : 13666 .nr_uninterruptible : 0 .next_balance : 4294947416 .curr->pid : 2179 .rq_clock : 241337421233 .fair_clock : 7503791206 .wait_runtime : 2269918379 runnable tasks: task | PID | tree-key | -delta | waiting | switches ----------------------------------------------------------------- + cat 2179 7501930066 -1861140 1861140 2 loop_silent 2149 7503010354 -780852 0 911 loop_silent 2148 7503510048 -281158 280753 918 now for your workload the list should be considerably larger. If there's starvation going on then the 'switches' field (number of context switches) of one of the tasks would never increase while you have this 'cannot switch consoles' problem. maybe you'll have to unapply the fair-fork patch to make it trigger again. (fair-fork does not fix anything, so it probably just hides a real bug.) (i'm meanwhile busy running your scheddos utilities to reproduce it locally as well :) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:55 ` Ingo Molnar @ 2007-04-15 18:06 ` Willy Tarreau 2007-04-15 19:20 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-15 18:06 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Hi Ingo, On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > > bash forks 1000 processes, then progressively execs scheddos, but it > > takes some time). So I'm rebuilding right now. But I think that Linus > > has an interesting clue about GPM and notification before switching > > the terminal. I think it was enabled in console mode. I don't know how > > that translates to frozen xterms, but let's attack the problems one at > > a time. > > to debug this, could you try to apply this add-on as well: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch > > with this patch applied you should have a /proc/sched_debug file that > prints all runnable tasks and other interesting info from the runqueue. I don't know if you have seen my mail from yesterday evening (here). I found that changing keventd prio fixed the problem. You may be interested in the description. I sent it at 21:01 (+200). > [ i've refreshed all the patches on the CFS webpage, so if this doesnt > apply cleanly to your current tree then you'll probably have to > refresh one of the patches.] Fine, I'll have a look. I already had to rediff the sched-fair-fork patch last time. > The output should look like this: > > Sched Debug Version: v0.01 > now at 226761724575 nsecs > > cpu: 0 > .nr_running : 3 > .raw_weighted_load : 384 > .nr_switches : 13666 > .nr_uninterruptible : 0 > .next_balance : 4294947416 > .curr->pid : 2179 > .rq_clock : 241337421233 > .fair_clock : 7503791206 > .wait_runtime : 2269918379 > > runnable tasks: > task | PID | tree-key | -delta | waiting | switches > ----------------------------------------------------------------- > + cat 2179 7501930066 -1861140 1861140 2 > loop_silent 2149 7503010354 -780852 0 911 > loop_silent 2148 7503510048 -281158 280753 918 Nice. > now for your workload the list should be considerably larger. If there's > starvation going on then the 'switches' field (number of context > switches) of one of the tasks would never increase while you have this > 'cannot switch consoles' problem. > > maybe you'll have to unapply the fair-fork patch to make it trigger > again. (fair-fork does not fix anything, so it probably just hides a > real bug.) > > (i'm meanwhile busy running your scheddos utilities to reproduce it > locally as well :) I discovered I had the frame-buffer enabled (I did not notice it first because I do not have the logo and the resolution is the same as text). It's matroxfb with a G400, if that can help. It may be possible that it needs some CPU that it cannot get to clear the display before switching, I don't know. However I won't try this right now, I'm deep in userland at the moment. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 18:06 ` Willy Tarreau @ 2007-04-15 19:20 ` Ingo Molnar 2007-04-15 19:35 ` William Lee Irwin III 2007-04-15 19:37 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 19:20 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > > to debug this, could you try to apply this add-on as well: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch > > > > with this patch applied you should have a /proc/sched_debug file > > that prints all runnable tasks and other interesting info from the > > runqueue. > > I don't know if you have seen my mail from yesterday evening (here). I > found that changing keventd prio fixed the problem. You may be > interested in the description. I sent it at 21:01 (+200). ah, indeed i missed that mail - the response to the patches was quite overwhelming (and i naively thought people dont do Linux hacking over the weekends anymore ;). so Linus was right: this was caused by scheduler starvation. I can see one immediate problem already: the 'nice offset' is not divided by nr_running as it should. The patch below should fix this but i have yet to test it accurately, this change might as well render nice levels unacceptably ineffective under high loads. Ingo ---------> --- kernel/sched_fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r int leftmost = 1; long long key; - key = rq->fair_clock - p->wait_runtime + p->nice_offset; + key = rq->fair_clock - p->wait_runtime; + if (unlikely(p->nice_offset)) + key += p->nice_offset / rq->nr_running; p->fair_key = key; ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:20 ` Ingo Molnar @ 2007-04-15 19:35 ` William Lee Irwin III 2007-04-15 19:57 ` Ingo Molnar 2007-04-15 19:37 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 19:35 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote: > so Linus was right: this was caused by scheduler starvation. I can see > one immediate problem already: the 'nice offset' is not divided by > nr_running as it should. The patch below should fix this but i have yet > to test it accurately, this change might as well render nice levels > unacceptably ineffective under high loads. I've been suggesting testing CPU bandwidth allocation as influenced by nice numbers for a while now for a reason. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:35 ` William Lee Irwin III @ 2007-04-15 19:57 ` Ingo Molnar 2007-04-15 23:54 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 19:57 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: > On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote: > > so Linus was right: this was caused by scheduler starvation. I can > > see one immediate problem already: the 'nice offset' is not divided > > by nr_running as it should. The patch below should fix this but i > > have yet to test it accurately, this change might as well render > > nice levels unacceptably ineffective under high loads. > > I've been suggesting testing CPU bandwidth allocation as influenced by > nice numbers for a while now for a reason. Oh I was very much testing "CPU bandwidth allocation as influenced by nice numbers" - it's one of the basic things i do when modifying the scheduler. An automated tool, while nice (all automation is nice) wouldnt necessarily show such bugs though, because here too it needed thousands of running tasks to trigger in practice. Any volunteers? ;) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:57 ` Ingo Molnar @ 2007-04-15 23:54 ` William Lee Irwin III 2007-04-16 11:24 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 23:54 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: >> I've been suggesting testing CPU bandwidth allocation as influenced by >> nice numbers for a while now for a reason. On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote: > Oh I was very much testing "CPU bandwidth allocation as influenced by > nice numbers" - it's one of the basic things i do when modifying the > scheduler. An automated tool, while nice (all automation is nice) > wouldnt necessarily show such bugs though, because here too it needed > thousands of running tasks to trigger in practice. Any volunteers? ;) Worse comes to worse I might actually get around to doing it myself. Any more detailed descriptions of the test for a rainy day? -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:54 ` William Lee Irwin III @ 2007-04-16 11:24 ` Ingo Molnar 2007-04-16 13:46 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-16 11:24 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: > On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote: > > Oh I was very much testing "CPU bandwidth allocation as influenced by > > nice numbers" - it's one of the basic things i do when modifying the > > scheduler. An automated tool, while nice (all automation is nice) > > wouldnt necessarily show such bugs though, because here too it needed > > thousands of running tasks to trigger in practice. Any volunteers? ;) > > Worse comes to worse I might actually get around to doing it myself. > Any more detailed descriptions of the test for a rainy day? the main complication here is that the handling of nice levels is still typically a 2nd or 3rd degree design factor when writing schedulers. The reason isnt carelessness, the reason is simply that users typically only care about a single nice level: the one that all tasks run under by default. Also, often there's just one or two good ways to attack the problem within a given scheduler approach and the quality of nice levels often suffers under other, more important design factors like performance. This means that for example for the vanilla scheduler the distribution of CPU power depends on HZ, on the number of tasks and on the scheduling pattern. The distribution of CPU power amongst nice levels is basically a function of _everything_. That makes any automated test pretty challenging. Both with SD and with CFS there's a good chance to actually formalize the meaning of nice levels, but i'd not go as far as to mandate any particular behavior by rigidly saying "pass this automated tool, else ...", other than "make nice levels resonable". All the other more formal CPU resource limitation techniques are then a matter of CKRM-alike patches, which offer much more finegrained mechanisms than pure nice levels anyway. so to answer your question: it's pretty much freely defined. Make up your mind about it and figure out the ways how people use nice levels and think about which aspects of that experience are worth testing for intelligently. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:24 ` Ingo Molnar @ 2007-04-16 13:46 ` William Lee Irwin III 0 siblings, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-16 13:46 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: >> Worse comes to worse I might actually get around to doing it myself. >> Any more detailed descriptions of the test for a rainy day? On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > the main complication here is that the handling of nice levels is still > typically a 2nd or 3rd degree design factor when writing schedulers. The > reason isnt carelessness, the reason is simply that users typically only > care about a single nice level: the one that all tasks run under by > default. I'm a bit unconvinced here. Support for prioritization is a major scheduler feature IMHO. On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > Also, often there's just one or two good ways to attack the problem > within a given scheduler approach and the quality of nice levels often > suffers under other, more important design factors like performance. > This means that for example for the vanilla scheduler the distribution > of CPU power depends on HZ, on the number of tasks and on the scheduling > pattern. The distribution of CPU power amongst nice levels is basically > a function of _everything_. That makes any automated test pretty > challenging. Both with SD and with CFS there's a good chance to actually > formalize the meaning of nice levels, but i'd not go as far as to > mandate any particular behavior by rigidly saying "pass this automated > tool, else ...", other than "make nice levels resonable". All the other > more formal CPU resource limitation techniques are then a matter of > CKRM-alike patches, which offer much more finegrained mechanisms than > pure nice levels anyway. Some of the issues with respect to the number of tasks and scheduling patterns can be made part of the test; one can furthermore insist that the system be quiescent in a variety of ways. I'm not convinced that formalization of nice levels is a bad idea. They're the standard UNIX prioritization facility, and it should work with some definite value of "work." Even supposing one doesn't care to bolt down the semantics of nice levels, there should at least be some awareness of what those semantics are and when and how they're changing. So in that respect a test for CPU bandwidth distribution according to nice level remains valuable even supposing that the semantics aren't required to be rigidly fixed. As far as CKRM goes, I'm wild about it. I wish things would get in shape to be merged (if they're not already) and merged ASAP on that front. I think with so much agreement in concept we can work with changing out implementations as-needed with it sitting in mainline once the the user API/ABI is decided upon, and I think it already is. I'm not entirely convinced CKRM answers this, though. If the scheduler can't support nice levels, how is it supposed to support prioritization or CPU bandwidth allocation according to CKRM configurations? I'm relatively certain schedulers must be able to support prioritization with deterministic CPU bandwidth as essential functionality. This is, of course, not to say my certainty about things sets the standards for what testcases are considered meaningful and valid. On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > so to answer your question: it's pretty much freely defined. Make up > your mind about it and figure out the ways how people use nice levels > and think about which aspects of that experience are worth testing for > intelligently. Looking for usage cases is a good idea; I'll do that before coding any testcase for nice semantics. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:20 ` Ingo Molnar 2007-04-15 19:35 ` William Lee Irwin III @ 2007-04-15 19:37 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 19:37 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Ingo Molnar <mingo@elte.hu> wrote: > so Linus was right: this was caused by scheduler starvation. I can see > one immediate problem already: the 'nice offset' is not divided by > nr_running as it should. The patch below should fix this but i have > yet to test it accurately, this change might as well render nice > levels unacceptably ineffective under high loads. erm, rather the updated patch below if you want to use this on a 32-bit system. But ... i think you should wait until i have all this re-tested. Ingo --- include/linux/sched.h | 2 +- kernel/sched_fair.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -839,7 +839,7 @@ struct task_struct { s64 wait_runtime; u64 exec_runtime, fair_key; - s64 nice_offset, hog_limit; + s32 nice_offset, hog_limit; unsigned long policy; cpumask_t cpus_allowed; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r int leftmost = 1; long long key; - key = rq->fair_clock - p->wait_runtime + p->nice_offset; + key = rq->fair_clock - p->wait_runtime; + if (unlikely(p->nice_offset)) + key += p->nice_offset / (rq->nr_running + 1); p->fair_key = key; ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:29 ` Willy Tarreau 2007-04-14 17:44 ` Eric W. Biederman @ 2007-04-14 17:50 ` Linus Torvalds 1 sibling, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-14 17:50 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Ingo Molnar, Nick Piggin, linux-kernel, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, 14 Apr 2007, Willy Tarreau wrote: > > It is clearly possible. What I found strange is that I could still fork > processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. Considering the patches in question, it's almost definitely just a CPU scheduling problem with starvation. The VT switching is obviously done by the kernel, but the kernel will signal and wait for the "controlling process" for the VT. The most obvious case of that is X, of course, but even in text mode I think gpm will have taken control of the VT's it runs on (all of them), which means that when you initiate a VT switch, the kernel will actually signal the controlling process (gpm), and wait for it to acknowledge the switch. If gpm doesn't get a timeslice for some reason (and it sounds like there may be some serious unfairness after "fork()"), your behaviour is explainable. (NOTE! I've never actually looked at gpm sources or what it really does, so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and something else is going on, but quite frankly, it sounds like the obvious candidate for this bug. Explaining it with some non-scheduler-related thing sounds unlikely, considering the patch in question). Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau @ 2007-04-15 7:54 ` Mike Galbraith 2007-04-15 8:58 ` Ingo Molnar 2007-04-19 9:01 ` Ingo Molnar 2 siblings, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 7:54 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote: > Well, I'll stop heating the room for now as I get out of ideas about how > to defeat it. I'm convinced. I'm impatient to read about Mike's feedback > with his workload which behaves strangely on RSDL. If it works OK here, > it will be the proof that heuristics should not be needed. You mean the X + mp3 player + audio visualization test? X+Gforce visualization have problems getting half of my box in the presence of two other heavy cpu using tasks. Behavior is _much_ better than RSDL/SD, but the synchronous nature of X/client seems to be a problem. With this scheduler, renicing X/client does cure it, whereas with SD it did not help one bit. (I know a trivial way to cure that, and this framework makes that possible without dorking up fairness as a general policy.) -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 7:54 ` Mike Galbraith @ 2007-04-15 8:58 ` Ingo Molnar 2007-04-15 9:11 ` Mike Galbraith 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 8:58 UTC (permalink / raw) To: Mike Galbraith Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner * Mike Galbraith <efault@gmx.de> wrote: > On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote: > > > Well, I'll stop heating the room for now as I get out of ideas about > > how to defeat it. I'm convinced. I'm impatient to read about Mike's > > feedback with his workload which behaves strangely on RSDL. If it > > works OK here, it will be the proof that heuristics should not be > > needed. > > You mean the X + mp3 player + audio visualization test? X+Gforce > visualization have problems getting half of my box in the presence of > two other heavy cpu using tasks. Behavior is _much_ better than > RSDL/SD, but the synchronous nature of X/client seems to be a problem. > > With this scheduler, renicing X/client does cure it, whereas with SD > it did not help one bit. [...] thanks for testing it! I was quite worried about your setup - two tasks using up 50%/50% of CPU time, pitted against a kernel rebuild workload seems to be a hard workload to get right. > [...] (I know a trivial way to cure that, and this framework makes > that possible without dorking up fairness as a general policy.) great! Please send patches so i can add them (once you are happy with the solution) - i think your workload isnt special in any way and could hit other people too. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:58 ` Ingo Molnar @ 2007-04-15 9:11 ` Mike Galbraith 0 siblings, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 9:11 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > [...] (I know a trivial way to cure that, and this framework makes > > that possible without dorking up fairness as a general policy.) > > great! Please send patches so i can add them (once you are happy with > the solution) - i think your workload isnt special in any way and could > hit other people too. I'll give it a shot. (have to read and actually understand your new code first though, then see if it's really viable) -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau 2007-04-15 7:54 ` Mike Galbraith @ 2007-04-19 9:01 ` Ingo Molnar 2007-04-19 12:54 ` Willy Tarreau 2007-04-19 17:32 ` Gene Heskett 2 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 9:01 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Good idea. The machine I'm typing from now has 1000 scheddos running > at +19, and 12 gears at nice 0. [...] > From time to time, one of the 12 aligned gears will quickly perform a > full quarter of round while others slowly turn by a few degrees. In > fact, while I don't know this process's CPU usage pattern, there's > something useful in it : it allows me to visually see when process > accelerate/decelerate. [...] cool idea - i have just tried this and it rocks - you can easily see the 'nature' of CPU time distribution just via visual feedback. (Is there any easy way to start up 12 glxgears fully aligned, or does one always have to mouse around to get them into proper position?) btw., i am using another method to quickly judge X's behavior: i started the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth opengl-rendered snow fall on the desktop background. That gives me an idea about how well X is scheduling under various workloads, without having to instrument it explicitly. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 9:01 ` Ingo Molnar @ 2007-04-19 12:54 ` Willy Tarreau 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:32 ` Gene Heskett 1 sibling, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-19 12:54 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Hi Ingo, On Thu, Apr 19, 2007 at 11:01:44AM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Good idea. The machine I'm typing from now has 1000 scheddos running > > at +19, and 12 gears at nice 0. [...] > > > From time to time, one of the 12 aligned gears will quickly perform a > > full quarter of round while others slowly turn by a few degrees. In > > fact, while I don't know this process's CPU usage pattern, there's > > something useful in it : it allows me to visually see when process > > accelerate/decelerate. [...] > > cool idea - i have just tried this and it rocks - you can easily see the > 'nature' of CPU time distribution just via visual feedback. (Is there > any easy way to start up 12 glxgears fully aligned, or does one always > have to mouse around to get them into proper position?) -- Replying quickly, I'm short in time -- You can certainly script it with -geometry. But it is the wrong application for this matter, because you benchmark X more than glxgears itself. What would be better is something like a line rotating 360 degrees and doing some short stuff between each degree, so that X is not much sollicitated, but the CPU would be spent more on the processes themselves. Benchmarking interactions between X and multiple clients is a completely different test IMHO. Glxgears is between those two, making it inappropriate for scheduler tuning. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 12:54 ` Willy Tarreau @ 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 15:18 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > You can certainly script it with -geometry. But it is the wrong > application for this matter, because you benchmark X more than > glxgears itself. What would be better is something like a line > rotating 360 degrees and doing some short stuff between each degree, > so that X is not much sollicitated, but the CPU would be spent more on > the processes themselves. at least on my setup glxgears goes via DRI/DRM so there's no X scheduling inbetween at all, and the visual appearance of glxgears is a direct function of its scheduling. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar @ 2007-04-19 17:34 ` Gene Heskett 2007-04-19 18:45 ` Willy Tarreau 2007-04-19 23:52 ` Jan Knutar 2 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-19 17:34 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Ingo Molnar wrote: >* Willy Tarreau <w@1wt.eu> wrote: >> You can certainly script it with -geometry. But it is the wrong >> application for this matter, because you benchmark X more than >> glxgears itself. What would be better is something like a line >> rotating 360 degrees and doing some short stuff between each degree, >> so that X is not much sollicitated, but the CPU would be spent more on >> the processes themselves. > >at least on my setup glxgears goes via DRI/DRM so there's no X >scheduling inbetween at all, and the visual appearance of glxgears is a >direct function of its scheduling. > > Ingo That doesn't appear to be the case here Ingo. Even when I know the rest of the system is lagged, glxgears continues to show very smooth and steady movement. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yow! I just went below the poverty line! ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett @ 2007-04-19 18:45 ` Willy Tarreau 2007-04-21 10:31 ` Ingo Molnar 2007-04-19 23:52 ` Jan Knutar 2 siblings, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-19 18:45 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 05:18:03PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > You can certainly script it with -geometry. But it is the wrong > > application for this matter, because you benchmark X more than > > glxgears itself. What would be better is something like a line > > rotating 360 degrees and doing some short stuff between each degree, > > so that X is not much sollicitated, but the CPU would be spent more on > > the processes themselves. > > at least on my setup glxgears goes via DRI/DRM so there's no X > scheduling inbetween at all, and the visual appearance of glxgears is a > direct function of its scheduling. OK, I thought that somethink looking like a clock would be useful, especially if we could tune the amount of CPU spent per task instead of being limited by graphics drivers. I searched freashmeat for a clock and found "orbitclock" by Jeremy Weatherford, which was exactly what I was looking for : - small - C only - X11 only - needed less than 5 minutes and no knowledge of X11 for the complete hack ! => Kudos to its author, sincerely ! I hacked it a bit to make it accept two parameters : -R <run_time_in_microsecond> : time spent burning CPU cycles at each round -S <sleep_time_in_microsecond> : time spent getting a rest It now advances what it thinks is a second at each iteration, so that it makes it easy to compare its progress with other instances (there are seconds, minutes and hours, so it's easy to visually count up to around 43200). The modified code is here : http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz What is interesting to note is that it's easy to make X work a lot (99%) by using 0 as the sleeping time, and it's easy to make the process work a lot by using large values for the running time associated with very low values (or 0) for the sleep time. Ah, and it supports -geometry ;-) It could become a useful scheduler benchmark ! Have fun ! Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 18:45 ` Willy Tarreau @ 2007-04-21 10:31 ` Ingo Molnar 2007-04-21 10:38 ` Ingo Molnar 2007-04-21 10:45 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-21 10:31 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Willy Tarreau <w@1wt.eu> wrote: > I hacked it a bit to make it accept two parameters : > -R <run_time_in_microsecond> : time spent burning CPU cycles at each round > -S <sleep_time_in_microsecond> : time spent getting a rest > > It now advances what it thinks is a second at each iteration, so that > it makes it easy to compare its progress with other instances (there > are seconds, minutes and hours, so it's easy to visually count up to > around 43200). > > The modified code is here : > > http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz > > What is interesting to note is that it's easy to make X work a lot > (99%) by using 0 as the sleeping time, and it's easy to make the > process work a lot by using large values for the running time > associated with very low values (or 0) for the sleep time. > > Ah, and it supports -geometry ;-) > > It could become a useful scheduler benchmark ! i just tried ocbench-0.3, and it is indeed very nice! Would it make sense perhaps to (optionally?) also log some sort of periodic text feedback to stdout, about the quality of scheduling? Maybe even a 'run this many seconds' option plus a summary text output at the end (which would output measured runtime, observed longest/smallest latency and standard deviation of latencies maybe)? That would make it directly usable both as a 'consistency of X app scheduling' visual test and as an easily shareable benchmark with an objective numeric result as well. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:31 ` Ingo Molnar @ 2007-04-21 10:38 ` Ingo Molnar 2007-04-21 10:45 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-21 10:38 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > The modified code is here : > > > > http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz > > > > What is interesting to note is that it's easy to make X work a lot > > (99%) by using 0 as the sleeping time, and it's easy to make the > > process work a lot by using large values for the running time > > associated with very low values (or 0) for the sleep time. > > > > Ah, and it supports -geometry ;-) > > > > It could become a useful scheduler benchmark ! > > i just tried ocbench-0.3, and it is indeed very nice! another thing i just noticed: when starting up lots of ocbench tasks (say -x 6 -y 6) then they (naturally) get started up with an already visible offset. It's nice to observe the startup behavior, but after that it would be useful if it were possible to 'resync' all those ocbench tasks so that they start at the same offset. [ Maybe a "killall -SIGUSR1 ocbench" could serve this purpose, without having to synchronize the tasks explicitly? ] Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:31 ` Ingo Molnar 2007-04-21 10:38 ` Ingo Molnar @ 2007-04-21 10:45 ` Ingo Molnar 2007-04-21 11:07 ` Willy Tarreau 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-21 10:45 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > It could become a useful scheduler benchmark ! > > i just tried ocbench-0.3, and it is indeed very nice! another thing i noticed: when using a -y larger then 1, then the window title (at least on Metacity) overlaps and thus the ocbench tasks have different X overhead and get scheduled a bit assymetrically as well. Is there any way to start them up title-less perhaps? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:45 ` Ingo Molnar @ 2007-04-21 11:07 ` Willy Tarreau 2007-04-21 11:29 ` Björn Steinbrink 0 siblings, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-21 11:07 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Hi Ingo, I'm replying to your 3 mails at once. On Sat, Apr 21, 2007 at 12:45:22PM +0200, Ingo Molnar wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > It could become a useful scheduler benchmark ! > > > > i just tried ocbench-0.3, and it is indeed very nice! So as you've noticed just one minute after I put it there, I've updated the tool and renamed it ocbench. For others, it's here : http://linux.1wt.eu/sched/ Useful news are proper positionning, automatic forking, and more visible progress with smaller windows, which eat less of X ressources. Now about your idea of making it report information on stdout, I don't know if it would be that useful. There are many other command line tools for this purpose. This one's goal is to eat CPU with a visual control of CPU distribution only. Concerning your idea of using a signal to resync every process, I agree with you. Running at 8x8 shows a noticeable offset. I've just uploaded v0.4 which supports your idea of sending USR1. > another thing i noticed: when using a -y larger then 1, then the window > title (at least on Metacity) overlaps and thus the ocbench tasks have > different X overhead and get scheduled a bit assymetrically as well. Is > there any way to start them up title-less perhaps? It has annoyed me a bit too, but I'm no X developer at all, so I don't know at all if it's possible nor how to do this. I know that my window manager even adds title bars to xeyes, so I'm not sure we can do this. Right now, I've added a "-B <border size>" argument so that you can skip the size of your title bar. It's dirty but it's not my main job :-) Thanks for your feedback Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 11:07 ` Willy Tarreau @ 2007-04-21 11:29 ` Björn Steinbrink 2007-04-21 11:51 ` Willy Tarreau 0 siblings, 1 reply; 713+ messages in thread From: Björn Steinbrink @ 2007-04-21 11:29 UTC (permalink / raw) To: Willy Tarreau; +Cc: Ingo Molnar, linux-kernel Hi, On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote: > > another thing i noticed: when using a -y larger then 1, then the window > > title (at least on Metacity) overlaps and thus the ocbench tasks have > > different X overhead and get scheduled a bit assymetrically as well. Is > > there any way to start them up title-less perhaps? > > It has annoyed me a bit too, but I'm no X developer at all, so I don't > know at all if it's possible nor how to do this. I know that my window > manager even adds title bars to xeyes, so I'm not sure we can do this. > > Right now, I've added a "-B <border size>" argument so that you can > skip the size of your title bar. It's dirty but it's not my main job :-) Here's a small patch that makes the windows unmanaged, which also causes ocbench to start up quite a bit faster on my box with larger number of windows, so it probably avoids some window manager overhead, which is a nice side-effect. Björn -- diff -u ocbench-0.4/ocbench.c ocbench-0.4.1/ocbench.c --- ocbench-0.4/ocbench.c 2007-04-21 13:05:55.000000000 +0200 +++ ocbench-0.4.1/ocbench.c 2007-04-21 13:24:01.000000000 +0200 @@ -213,6 +213,7 @@ int main(int argc, char *argv[]) { Window root; XGCValues gc_setup; + XSetWindowAttributes swa; int c, index, proc_x, proc_y, pid; int *pcount[] = {&HOUR, &MIN, &SEC}; char *p, *q; @@ -342,8 +343,11 @@ alloc_color(fg, &orange); alloc_color(fg2, &blue); - win = XCreateSimpleWindow(dpy, root, X, Y, width, height, 0, - black.pixel, black.pixel); + swa.override_redirect = 1; + + win = XCreateWindow(dpy, root, X, Y, width, height, 0, + CopyFromParent, InputOutput, CopyFromParent, + CWOverrideRedirect, &swa); XStoreName(dpy, win, "ocbench"); XSelectInput(dpy, win, ExposureMask | StructureNotifyMask); Only in ocbench-0.4.1/: .README.swp ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 11:29 ` Björn Steinbrink @ 2007-04-21 11:51 ` Willy Tarreau 0 siblings, 0 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-21 11:51 UTC (permalink / raw) To: Björn Steinbrink, Ingo Molnar, linux-kernel Hi Björn, On Sat, Apr 21, 2007 at 01:29:41PM +0200, Björn Steinbrink wrote: > Hi, > > On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote: > > > another thing i noticed: when using a -y larger then 1, then the window > > > title (at least on Metacity) overlaps and thus the ocbench tasks have > > > different X overhead and get scheduled a bit assymetrically as well. Is > > > there any way to start them up title-less perhaps? > > > > It has annoyed me a bit too, but I'm no X developer at all, so I don't > > know at all if it's possible nor how to do this. I know that my window > > manager even adds title bars to xeyes, so I'm not sure we can do this. > > > > Right now, I've added a "-B <border size>" argument so that you can > > skip the size of your title bar. It's dirty but it's not my main job :-) > > Here's a small patch that makes the windows unmanaged, which also causes > ocbench to start up quite a bit faster on my box with larger number of > windows, so it probably avoids some window manager overhead, which is a > nice side-effect. Excellent ! I've just merged it but conditionned it to a "-u" argument so that we can keep previous behaviour (moving the windows is useful especially when there are few of them). So the new version 0.5 is available there : http://linux.1wt.eu/sched/ I believe it's the last one for today as I'm late on some work. Thanks ! Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett 2007-04-19 18:45 ` Willy Tarreau @ 2007-04-19 23:52 ` Jan Knutar 2007-04-20 5:05 ` Willy Tarreau 2 siblings, 1 reply; 713+ messages in thread From: Jan Knutar @ 2007-04-19 23:52 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Willy Tarreau, Nick Piggin, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007 18:18, Ingo Molnar wrote: > * Willy Tarreau <w@1wt.eu> wrote: > > You can certainly script it with -geometry. But it is the wrong > > application for this matter, because you benchmark X more than > > glxgears itself. What would be better is something like a line > > rotating 360 degrees and doing some short stuff between each > > degree, so that X is not much sollicitated, but the CPU would be > > spent more on the processes themselves. > > at least on my setup glxgears goes via DRI/DRM so there's no X > scheduling inbetween at all, and the visual appearance of glxgears is > a direct function of its scheduling. How much of the subjective interactiveness-feel of the desktop is at the mercy of the X server's scheduling and not the cpu scheduler? I've noticed that video playback is significantly smoother and resistant to other load, when using MPlayer's opengl output, especially if "heavy" programs are running at the same time. Especially firefox and ksysguard seem to have found a way to cause video through Xv to look annoyingly jittery. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 23:52 ` Jan Knutar @ 2007-04-20 5:05 ` Willy Tarreau 0 siblings, 0 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-20 5:05 UTC (permalink / raw) To: Jan Knutar Cc: linux-kernel, Ingo Molnar, Nick Piggin, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 20, 2007 at 02:52:38AM +0300, Jan Knutar wrote: > On Thursday 19 April 2007 18:18, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > You can certainly script it with -geometry. But it is the wrong > > > application for this matter, because you benchmark X more than > > > glxgears itself. What would be better is something like a line > > > rotating 360 degrees and doing some short stuff between each > > > degree, so that X is not much sollicitated, but the CPU would be > > > spent more on the processes themselves. > > > > at least on my setup glxgears goes via DRI/DRM so there's no X > > scheduling inbetween at all, and the visual appearance of glxgears is > > a direct function of its scheduling. > > How much of the subjective interactiveness-feel of the desktop is at the > mercy of the X server's scheduling and not the cpu scheduler? probably a lot. Hence the reason why I wanted something visually noticeable but using far less X resources than glxgears. The modified orbitclock is perfect IMHO. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 9:01 ` Ingo Molnar 2007-04-19 12:54 ` Willy Tarreau @ 2007-04-19 17:32 ` Gene Heskett 1 sibling, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-19 17:32 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Ingo Molnar wrote: >* Willy Tarreau <w@1wt.eu> wrote: >> Good idea. The machine I'm typing from now has 1000 scheddos running >> at +19, and 12 gears at nice 0. [...] >> >> From time to time, one of the 12 aligned gears will quickly perform a >> full quarter of round while others slowly turn by a few degrees. In >> fact, while I don't know this process's CPU usage pattern, there's >> something useful in it : it allows me to visually see when process >> accelerate/decelerate. [...] > >cool idea - i have just tried this and it rocks - you can easily see the >'nature' of CPU time distribution just via visual feedback. (Is there >any easy way to start up 12 glxgears fully aligned, or does one always >have to mouse around to get them into proper position?) > >btw., i am using another method to quickly judge X's behavior: i started >the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth >opengl-rendered snow fall on the desktop background. That gives me an >idea about how well X is scheduling under various workloads, without >having to instrument it explicitly. > yes, its a cute idea, till you switch away from that screen to check progress on something else, like to compose this message. =========== 5913 frames in 5.0 seconds = 1182.499 FPS 6238 frames in 5.0 seconds = 1247.556 FPS 11380 frames in 5.0 seconds = 2275.905 FPS 10691 frames in 5.0 seconds = 2138.173 FPS 8707 frames in 5.0 seconds = 1741.305 FPS 10669 frames in 5.0 seconds = 2133.708 FPS 11392 frames in 5.0 seconds = 2278.037 FPS 11379 frames in 5.0 seconds = 2275.711 FPS 11310 frames in 5.0 seconds = 2261.861 FPS 11386 frames in 5.0 seconds = 2277.081 FPS 11292 frames in 5.0 seconds = 2258.353 FPS 11352 frames in 5.0 seconds = 2270.297 FPS 11415 frames in 5.0 seconds = 2282.886 FPS 11406 frames in 5.0 seconds = 2281.037 FPS 11483 frames in 5.0 seconds = 2296.533 FPS 11510 frames in 5.0 seconds = 2301.883 FPS 11123 frames in 5.0 seconds = 2224.266 FPS 8980 frames in 5.0 seconds = 1795.861 FPS ======= The over 2000fps reports were while I was either looking at htop, or starting this message, both on different screens. htop said it was using 95+ % of the cpu even when its display was going to /dev/null. So 'Kewl' doesn't seem to get us apples to apples numbers we can go to the window and bet win-place-show based on them alone. FWIW, running the nvidia-9755 drivers here. So if we are going to use that as a judgement operator, it obviously needs some intelligently applied scaling before they are worth more than a subjective feel is. > Ingo >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The confusion of a staff member is measured by the length of his memos. -- New York Times, Jan. 20, 1981 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 13:01 ` Willy Tarreau @ 2007-04-14 15:17 ` Mark Lord 1 sibling, 0 replies; 713+ messages in thread From: Mark Lord @ 2007-04-14 15:17 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > i kept the 50%/50% rule from the old scheduler, but maybe it's a more > pristine (and smaller/faster) approach to just not give new children any > stats history to begin with. I've implemented an add-on patch that > implements this, you can find it at: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper) with the new CFS for much of this morning now, with the odd switch back to the stock scheduler for comparison. Here, CFS really works and feels better than the stock scheduler. Even with a "make -j2" kernel rebuild happening (no manual renice, either!) things "just work" about as smoothly as ever. That's something which RSDL never achieved for me, though I have not retested RSDL beyond v0.34 or so. Well done, Ingo! I *want* this as my default scheduler. Things seemed slightly less smooth when I had the CPU hogs and fair-fork extension patches both applied. I'm going to try again now with just the fair-fork added on. Cheers Mark ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:53 ` Ingo Molnar @ 2007-04-14 19:48 ` William Lee Irwin III 2007-04-14 20:12 ` Willy Tarreau 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-14 19:48 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote: > Forking becomes very slow above a load of 100 it seems. Sometimes, > the shell takes 2 or 3 seconds to return to prompt after I run > "scheddos &" > Those are very promising results, I nearly observe the same responsiveness > as I had on a solaris 10 with 10k running processes on a bigger machine. > I would be curious what a mysql test result would look like now. Where is scheddos? -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 19:48 ` William Lee Irwin III @ 2007-04-14 20:12 ` Willy Tarreau 0 siblings, 0 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-14 20:12 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote: > On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote: > > Forking becomes very slow above a load of 100 it seems. Sometimes, > > the shell takes 2 or 3 seconds to return to prompt after I run > > "scheddos &" > > Those are very promising results, I nearly observe the same responsiveness > > as I had on a solaris 10 with 10k running processes on a bigger machine. > > I would be curious what a mysql test result would look like now. > > Where is scheddos? I will send it to you off-list. I've been avoiding to publish it for a long time because the stock scheduler was *very* sensible to trivial attacks (freezes larger than 30s, impossible to log in). It's very basic, and I have no problem sending it to anyone who requests it, it's just that as long as some distros ship early 2.6 kernels I do not want it to appear on mailing list archives for anyone to grab it and annoy their admins for free. Cheers, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:08 ` Willy Tarreau 2007-04-14 8:36 ` Willy Tarreau @ 2007-04-14 10:36 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 10:36 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > > this fix is not complete - because the child runqueue is locked > > here, not the parent's. I've fixed this properly in my tree and have > > uploaded a new sched-modular+cfs.patch. (the effects of the original > > bug are mostly harmless, the rbtree position gets corrected the > > first time the parent reschedules. The fix might improve heavy > > forker handling.) > > It looks like it did not reach your public dir yet. oops, forgot to do the last step - should be fixed now. > BTW, I've given it a try. It seems pretty usable. I have also tried > the usual meaningless "glxgears" test with 12 of them at the same > time, and they rotate very smoothly, there is absolutely no pause in > any of them. But they don't all run at same speed, and top reports > their CPU load varying from 3.4 to 10.8%, with what looks like more > CPU is assigned to the first processes, and less CPU for the last > ones. But this is just a rough observation on a stupid test, I would > not call that one scientific in any way (and X has its share in the > test too). ok, i'll try that too - there should be nothing particularly special about glxgears. there's another tweak you could try: echo 500000 > /proc/sys/kernel/sched_granularity_ns note that this causes preemption to be done as fast as the scheduler can do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the best results a CONFIG_HZ of 1000 is useful.) plus there's an add-on to CFS at: http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch this makes the 'CPU usage history cutoff' configurable and sets it to a default of 100 msecs. This means that CPU hogs (tasks which actively kept other tasks from running) will be remembered, for up to 100 msecs of their 'hogness'. Setting this limit back to 0 gives the 'vanilla' CFS scheduler's behavior: echo 0 > /proc/sys/kernel/sched_max_hog_history_ns (So when trying this you dont have to reboot with this patch applied/unapplied, just set this value.) > I'll perform other tests when I can rebuild with your fixed patch. cool, thanks! Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (7 preceding siblings ...) 2007-04-14 2:04 ` Nick Piggin @ 2007-04-14 15:09 ` S.Çağlar Onur 2007-04-14 16:09 ` Ingo Molnar 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas ` (5 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-14 15:09 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1018 bytes --] 13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: Currently im using Linus's current git + your extra patches + CFS for a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek forward/backward while its playing a video with some workload (checking out SVN repositories, compiling something). Stopping other process didn't help kaffeine so it stays freezed stated until i kill it. I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have that problem) but i can't reproduce this with mainline or mainline + sd-0.39. [1] http://cekirdek.pardus.org.tr/~caglar/psaux -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 15:09 ` S.Çağlar Onur @ 2007-04-14 16:09 ` Ingo Molnar 2007-04-14 16:59 ` S.Çağlar Onur 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-14 16:09 UTC (permalink / raw) To: S.Çağlar Onur Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > > i'm pleased to announce the first release of the "Modular Scheduler > > Core and Completely Fair Scheduler [CFS]" patchset: > > Currently im using Linus's current git + your extra patches + CFS for > a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i > seek forward/backward while its playing a video with some workload > (checking out SVN repositories, compiling something). Stopping other > process didn't help kaffeine so it stays freezed stated until i kill > it. hm, could you try to strace it and/or attach gdb to it and figure out what's wrong? (perhaps involving the Kaffeine developers too?) As long as it's not a kernel level crash i cannot see how the scheduler could directly cause this - other than by accident creating a scheduling pattern that triggers a user-space bug more often than with other schedulers. > [1] http://cekirdek.pardus.org.tr/~caglar/psaux looks quite weird! Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 16:09 ` Ingo Molnar @ 2007-04-14 16:59 ` S.Çağlar Onur 2007-04-15 16:13 ` Kaffeine problem with CFS Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-14 16:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1180 bytes --] 14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı: > hm, could you try to strace it and/or attach gdb to it and figure out > what's wrong? (perhaps involving the Kaffeine developers too?) As long > as it's not a kernel level crash i cannot see how the scheduler could > directly cause this - other than by accident creating a scheduling > pattern that triggers a user-space bug more often than with other > schedulers. ... futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call) --- SIGINT (Interrupt) @ 0 (0) --- +++ killed by SIGINT +++ is where freeze occurs. Full log can be found at [1] > > [1] http://cekirdek.pardus.org.tr/~caglar/psaux > > looks quite weird! :) [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-14 16:59 ` S.Çağlar Onur @ 2007-04-15 16:13 ` Ingo Molnar 2007-04-15 16:25 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 16:13 UTC (permalink / raw) To: S.Çağlar Onur Cc: linux-kernel, Michael Lothian, Christophe Thommeret, Christoph Pfister, Jurgen Kofler * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > > hm, could you try to strace it and/or attach gdb to it and figure > > out what's wrong? (perhaps involving the Kaffeine developers too?) > > As long as it's not a kernel level crash i cannot see how the > > scheduler could directly cause this - other than by accident > > creating a scheduling pattern that triggers a user-space bug more > > often than with other schedulers. > > ... > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 > futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call) > --- SIGINT (Interrupt) @ 0 (0) --- > +++ killed by SIGINT +++ > > is where freeze occurs. Full log can be found at [1] > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine thanks. This does has the appearance of a userspace race condition of some sorts. Can you trigger this hang with the patch below applied to the vanilla tree as well? (with no CFS patch applied) if yes then this would suggest that Kaffeine has some sort of child-runs-first problem. (which CFS changes to parent-runs-first. Kaffeine starts a couple of threads and the futex calls are a sign of thread<->thread communication.) [ i have also Cc:-ed the Kaffeine folks - maybe your strace gives them an idea about what the problem could be :) ] Ingo --- kernel/sched.c | 70 ++------------------------------------------------------- 1 file changed, 3 insertions(+), 67 deletions(-) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1653,77 +1653,13 @@ void fastcall sched_fork(struct task_str */ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags) { - struct rq *rq, *this_rq; unsigned long flags; - int this_cpu, cpu; + struct rq *rq; rq = task_rq_lock(p, &flags); BUG_ON(p->state != TASK_RUNNING); - this_cpu = smp_processor_id(); - cpu = task_cpu(p); - - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. The parent - * (current) is done further down, under its lock. - */ - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->prio = effective_prio(p); - - if (likely(cpu == this_cpu)) { - if (!(clone_flags & CLONE_VM)) { - /* - * The VM isn't cloned, so we're in a good position to - * do child-runs-first in anticipation of an exec. This - * usually avoids a lot of COW overhead. - */ - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - p->normal_prio = current->normal_prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - inc_nr_running(p, rq); - } - set_need_resched(); - } else - /* Run child last */ - __activate_task(p, rq); - /* - * We skip the following code due to cpu == this_cpu - * - * task_rq_unlock(rq, &flags); - * this_rq = task_rq_lock(current, &flags); - */ - this_rq = rq; - } else { - this_rq = cpu_rq(this_cpu); - - /* - * Not the local CPU - must adjust timestamp. This should - * get optimised away in the !CONFIG_SMP case. - */ - p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) - + rq->most_recent_timestamp; - __activate_task(p, rq); - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); - - /* - * Parent and child are on different CPUs, now get the - * parent runqueue to update the parent's ->sleep_avg: - */ - task_rq_unlock(rq, &flags); - this_rq = task_rq_lock(current, &flags); - } - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - task_rq_unlock(this_rq, &flags); + __activate_task(p, rq); + task_rq_unlock(rq, &flags); } /* ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-15 16:13 ` Kaffeine problem with CFS Ingo Molnar @ 2007-04-15 16:25 ` Ingo Molnar 2007-04-15 16:55 ` Christoph Pfister 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 16:25 UTC (permalink / raw) To: S.Çağlar Onur Cc: linux-kernel, Michael Lothian, Christophe Thommeret, Christoph Pfister, Jurgen Kofler * Ingo Molnar <mingo@elte.hu> wrote: > > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine > > thanks. This does has the appearance of a userspace race condition of > some sorts. Can you trigger this hang with the patch below applied to > the vanilla tree as well? (with no CFS patch applied) oops, please use the patch below instead. Ingo --- kernel/sched.c | 69 ++++----------------------------------------------------- 1 file changed, 5 insertions(+), 64 deletions(-) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1653,77 +1653,18 @@ void fastcall sched_fork(struct task_str */ void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags) { - struct rq *rq, *this_rq; unsigned long flags; - int this_cpu, cpu; + struct rq *rq; rq = task_rq_lock(p, &flags); BUG_ON(p->state != TASK_RUNNING); - this_cpu = smp_processor_id(); - cpu = task_cpu(p); - - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. The parent - * (current) is done further down, under its lock. - */ - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); p->prio = effective_prio(p); + __activate_task(p, rq); + if (TASK_PREEMPTS_CURR(p, rq)) + resched_task(rq->curr); - if (likely(cpu == this_cpu)) { - if (!(clone_flags & CLONE_VM)) { - /* - * The VM isn't cloned, so we're in a good position to - * do child-runs-first in anticipation of an exec. This - * usually avoids a lot of COW overhead. - */ - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - p->normal_prio = current->normal_prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - inc_nr_running(p, rq); - } - set_need_resched(); - } else - /* Run child last */ - __activate_task(p, rq); - /* - * We skip the following code due to cpu == this_cpu - * - * task_rq_unlock(rq, &flags); - * this_rq = task_rq_lock(current, &flags); - */ - this_rq = rq; - } else { - this_rq = cpu_rq(this_cpu); - - /* - * Not the local CPU - must adjust timestamp. This should - * get optimised away in the !CONFIG_SMP case. - */ - p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) - + rq->most_recent_timestamp; - __activate_task(p, rq); - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); - - /* - * Parent and child are on different CPUs, now get the - * parent runqueue to update the parent's ->sleep_avg: - */ - task_rq_unlock(rq, &flags); - this_rq = task_rq_lock(current, &flags); - } - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - task_rq_unlock(this_rq, &flags); + task_rq_unlock(rq, &flags); } /* ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-15 16:25 ` Ingo Molnar @ 2007-04-15 16:55 ` Christoph Pfister 2007-04-15 22:14 ` S.Çağlar Onur ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Christoph Pfister @ 2007-04-15 16:55 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler Hi, 2007/4/15, Ingo Molnar <mingo@elte.hu>: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine Could you try xine-ui or gxine? Because I suspect rather xine-lib for freezing issues. In any way I think a gdb backtrace would be much nicer - but if you can't reproduce the freeze issue with other xine based players and want to run kaffeine in gdb, you need to execute "gdb --args kaffeine --nofork". > > thanks. This does has the appearance of a userspace race condition of > > some sorts. Can you trigger this hang with the patch below applied to > > the vanilla tree as well? (with no CFS patch applied) > > oops, please use the patch below instead. > > Ingo <snip> Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-15 16:55 ` Christoph Pfister @ 2007-04-15 22:14 ` S.Çağlar Onur 2007-04-18 8:27 ` Ingo Molnar [not found] ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com> 2 siblings, 0 replies; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-15 22:14 UTC (permalink / raw) To: Christoph Pfister Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler [-- Attachment #1: Type: text/plain, Size: 1063 bytes --] 15 Nis 2007 Paz tarihinde, Christoph Pfister şunları yazmıştı: > Could you try xine-ui or gxine? Because I suspect rather xine-lib for > freezing issues. In any way I think a gdb backtrace would be much > nicer - but if you can't reproduce the freeze issue with other xine > based players and want to run kaffeine in gdb, you need to execute > "gdb --args kaffeine --nofork". I just tested xine-ui and i can easily reproduce exact same problem with xine-ui also so you are right, it seems a xine-lib problem trigger by CFS changes. > > > thanks. This does has the appearance of a userspace race condition of > > > some sorts. Can you trigger this hang with the patch below applied to > > > the vanilla tree as well? (with no CFS patch applied) > > > > oops, please use the patch below instead. Tomorrow i'll test that patch and also try to get a backtrace. Cheers -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-15 16:55 ` Christoph Pfister 2007-04-15 22:14 ` S.Çağlar Onur @ 2007-04-18 8:27 ` Ingo Molnar 2007-04-18 8:57 ` Ingo Molnar 2007-04-18 8:57 ` Christoph Pfister [not found] ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com> 2 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 8:27 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper [ i've Cc:-ed Ulrich Drepper, this CFS-triggered hang seems to have some futex and pthread_cond_wait() relevance. ] * Christoph Pfister <christophpfister@gmail.com> wrote: > >> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine > > Could you try xine-ui or gxine? Because I suspect rather xine-lib for > freezing issues. In any way I think a gdb backtrace would be much > nicer - but if you can't reproduce the freeze issue with other xine > based players and want to run kaffeine in gdb, you need to execute > "gdb --args kaffeine --nofork". update: i've reproduced one kind of a hang but i'm not sure it's the same hang Ismail is seeing. It was quite hard to trigger it under CFS, i had to do wild forward/backward button seeks on a real DVD and i mixed it with CPU-intense workloads on the same box. Here are the straces and gdb backtraces: kaffeine thread PID 9303, waiting for other threads to do something, stuck in pthread_mutex_lock(): futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...> backtrace: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #8 0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #9 0x4b55526e in QApplication::notify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 #11 0x4b4dd5de in QETWidget::translateWheelEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #12 0x4b4eb41d in QETWidget::translateMouseEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #13 0x4b4e9766 in QApplication::x11ProcessEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #14 0x4b4fb38b in QEventLoop::processEvents () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #15 0x4b56ce30 in QEventLoop::enterLoop () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #18 0x0806fc1a in QWidget::setUpdatesEnabled () #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 #20 0x0806f7e1 in QWidget::setUpdatesEnabled () Kaffeine thread 9324, seems to be in an infinite pthread_cond_wait() loop that does: futex(0xb0740b78, FUTEX_WAIT, 3559, NULL) = 0 futex(0xb0740b5c, FUTEX_WAKE, 1) = 0 munmap(0xaacb1000, 1662976) = 0 mmap2(NULL, 1662976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xaacb1000 gettimeofday({1176891363, 347259}, NULL) = 0 munmap(0xab309000, 1662976) = 0 backtrace: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #3 0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #5 0x4a05820e in clone () from /lib/libc.so.6 Kaffine thread 9325 does a loop of short pthread_cond_wait() futex sleeps: 1176891721.419314 futex(0xb07527e8, FUTEX_WAIT, 8537, NULL) = 0 <0.011710> 1176891721.431068 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000006> 1176891721.431429 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000008> 1176891721.431458 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000012> 1176891721.431489 futex(0xb07527e8, FUTEX_WAIT, 8539, NULL) = 0 <0.007339> 1176891721.439008 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000052> 1176891721.439510 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000055> 1176891721.439636 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000089> 1176891721.439789 futex(0xb07527e8, FUTEX_WAIT, 8541, NULL) = 0 <0.007045> 1176891721.447017 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000054> 1176891721.447682 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000065> backtrace: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #3 0xb7a04079 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #5 0x4a05820e in clone () from /lib/libc.so.6 library versions: xine-lib-1.1.5-1.fc7 xine-plugin-1.0-3.fc7 glibc-headers-2.5.90-21 glibc-common-2.5.90-21 glibc-2.5.90-21 glibc-devel-2.5.90-21 gxine-0.5.11-3.fc7 kaffeine-0.8.3-4.fc7 xine-0.99.4-11.lvn7 xine-lib-extras-1.1.5-1.fc7 gxine-mozplugin-0.5.11-3.fc7 what's weird is that all threads are in a pthread op and seem to be kind of busy-looping. Maybe xine-lib has some buggy use of pthread condvars that CFS happens to trigger? (If CFS broke futexes in general i think we'd be seeing far more widespread breakage.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 8:27 ` Ingo Molnar @ 2007-04-18 8:57 ` Ingo Molnar 2007-04-18 9:06 ` Ingo Molnar 2007-04-18 8:57 ` Christoph Pfister 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 8:57 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Ingo Molnar <mingo@elte.hu> wrote: > update: i've reproduced one kind of a hang but i'm not sure it's the > same hang Ismail is seeing. It was quite hard to trigger it under CFS, > i had to do wild forward/backward button seeks on a real DVD and i > mixed it with CPU-intense workloads on the same box. Here are the > straces and gdb backtraces: these were only the threads that showed up in htop. Here's a full analysis about what all threads are doing: Process 9303: stuck in xine_play()/pthread_mutex_lock() Process 9319: stuck in pthread_cond_timedwait() Process 9320: stuck in pthread_cond_timedwait() Process 9321: loop of ~3 msec nanosleeps Process 9322: loop of poll() calls every 335 msecs Process 9323: stuck in pthread_cond_timedwait() Process 9324: stuck in a loop of 1-second futex-waits + mmap/munmap (malloc) Process 9325: stuck in pthread_cond_timedwait() Process 9326: stuck in pthread_cond_timedwait() Process 9327: stuck in pthread_cond_timedwait() now here's a weird thing: occasionally, when i strace one of the threads, i can get a single frame refreshed in the Kaffeine window - but the general picture does not change, the same 'stuck' state is still there. most threads are sitting in: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a25134c in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0xb79f9a05 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #3 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #4 0x4a05820e in clone () from /lib/libc.so.6 9324 is looping around this place, apparently in the opengl video output driver, but the backtrace is not always this one: (gdb) bt #0 0x49ff7257 in memset () from /lib/libc.so.6 #1 0x49ff1877 in calloc () from /lib/libc.so.6 #2 0xb7a224d6 in xine_xmalloc_aligned () from /usr/lib/libxine.so.1 #3 0xb708c8f6 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/xineplug_vo_out_opengl.so #4 0xb7a0525a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #5 0xb78944e4 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/post/xineplug_post_tvtime.so #6 0xb7895234 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/post/xineplug_post_tvtime.so #7 0xad4e5439 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/xineplug_decode_mpeg2.so #8 0xad4fa8e1 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/xineplug_decode_mpeg2.so #9 0xb7a032d6 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #10 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #11 0x4a05820e in clone () from /lib/libc.so.6 9321 is sitting in: (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2544a6 in nanosleep () from /lib/libpthread.so.0 #2 0xb7a222fa in xine_usec_sleep () from /usr/lib/libxine.so.1 #3 0xb7a073bb in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #5 0x4a05820e in clone () from /lib/libc.so.6 9322 is in poll(): (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0x4a04e533 in poll () from /lib/libc.so.6 #2 0xb12e1f75 in QWidget::setUpdatesEnabled () from /usr/lib/xine/plugins/1.1.5/xineplug_ao_out_alsa.so #3 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #4 0x4a05820e in clone () from /lib/libc.so.6 9303 is stuck in xine_play(), pthread_mutex_lock(): #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #8 0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #9 0x4b55526e in QApplication::notify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 #11 0x4b4dd5de in QETWidget::translateWheelEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #12 0x4b4eb41d in QETWidget::translateMouseEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #13 0x4b4e9766 in QApplication::x11ProcessEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #14 0x4b4fb38b in QEventLoop::processEvents () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #15 0x4b56ce30 in QEventLoop::enterLoop () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #18 0x0806fc1a in QWidget::setUpdatesEnabled () #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 #20 0x0806f7e1 in QWidget::setUpdatesEnabled () library versions: xine-lib-1.1.5-1.fc7 xine-plugin-1.0-3.fc7 glibc-headers-2.5.90-21 glibc-common-2.5.90-21 glibc-2.5.90-21 glibc-devel-2.5.90-21 gxine-0.5.11-3.fc7 kaffeine-0.8.3-4.fc7 xine-0.99.4-11.lvn7 xine-lib-extras-1.1.5-1.fc7 gxine-mozplugin-0.5.11-3.fc7 Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 8:57 ` Ingo Molnar @ 2007-04-18 9:06 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 9:06 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Ingo Molnar <mingo@elte.hu> wrote: > these were only the threads that showed up in htop. Here's a full > analysis about what all threads are doing: > > Process 9303: stuck in xine_play()/pthread_mutex_lock() > Process 9319: stuck in pthread_cond_timedwait() > Process 9320: stuck in pthread_cond_timedwait() > Process 9321: loop of ~3 msec nanosleeps > Process 9322: loop of poll() calls every 335 msecs > Process 9323: stuck in pthread_cond_timedwait() > Process 9324: stuck in a loop of 1-second futex-waits + mmap/munmap (malloc) > Process 9325: stuck in pthread_cond_timedwait() > Process 9326: stuck in pthread_cond_timedwait() > Process 9327: stuck in pthread_cond_timedwait() and here's a top snapshot: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9324 mingo 20 0 300m 59m 17m R 96.4 6.8 21:00.55 kaffeine 9325 mingo 20 0 300m 59m 17m S 2.0 6.8 0:15.57 kaffeine 9327 mingo 20 0 300m 59m 17m S 2.0 6.8 0:20.10 kaffeine so 9324 doing the mpeg decoding seems to be stuck somehow? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 8:27 ` Ingo Molnar 2007-04-18 8:57 ` Ingo Molnar @ 2007-04-18 8:57 ` Christoph Pfister 2007-04-18 9:01 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 8:57 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper Hi, 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > [ i've Cc:-ed Ulrich Drepper, this CFS-triggered hang seems to have some > futex and pthread_cond_wait() relevance. ] > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > >> > [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine > > > > Could you try xine-ui or gxine? Because I suspect rather xine-lib for > > freezing issues. In any way I think a gdb backtrace would be much > > nicer - but if you can't reproduce the freeze issue with other xine > > based players and want to run kaffeine in gdb, you need to execute > > "gdb --args kaffeine --nofork". > > update: i've reproduced one kind of a hang but i'm not sure it's the > same hang Ismail is seeing. It was quite hard to trigger it under CFS, i > had to do wild forward/backward button seeks on a real DVD and i mixed > it with CPU-intense workloads on the same box. Here are the straces and > gdb backtraces: > > kaffeine thread PID 9303, waiting for other threads to do something, > stuck in pthread_mutex_lock(): > > futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...> > > backtrace: > > #0 0xffffe410 in __kernel_vsyscall () > #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 > #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 > #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 > #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 > #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so > #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so > #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #8 0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #9 0x4b55526e in QApplication::notify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 > #11 0x4b4dd5de in QETWidget::translateWheelEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #12 0x4b4eb41d in QETWidget::translateMouseEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #13 0x4b4e9766 in QApplication::x11ProcessEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #14 0x4b4fb38b in QEventLoop::processEvents () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #15 0x4b56ce30 in QEventLoop::enterLoop () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #18 0x0806fc1a in QWidget::setUpdatesEnabled () > #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 > #20 0x0806f7e1 in QWidget::setUpdatesEnabled () > > Kaffeine thread 9324, seems to be in an infinite pthread_cond_wait() > loop that does: > > futex(0xb0740b78, FUTEX_WAIT, 3559, NULL) = 0 > futex(0xb0740b5c, FUTEX_WAKE, 1) = 0 > munmap(0xaacb1000, 1662976) = 0 > mmap2(NULL, 1662976, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xaacb1000 > gettimeofday({1176891363, 347259}, NULL) = 0 > munmap(0xab309000, 1662976) = 0 > > backtrace: > > #0 0xffffe410 in __kernel_vsyscall () > #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 > #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #3 0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > #5 0x4a05820e in clone () from /lib/libc.so.6 This backtrace is useless - QWidget::setUpdatesEnabled() is certainly _not_ defined in libxine. So the function names in #2 and #3 are wrong because the addresses seem to belong to libxine. > Kaffine thread 9325 does a loop of short pthread_cond_wait() futex > sleeps: > > 1176891721.419314 futex(0xb07527e8, FUTEX_WAIT, 8537, NULL) = 0 <0.011710> > 1176891721.431068 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000006> > 1176891721.431429 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000008> > 1176891721.431458 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000012> > 1176891721.431489 futex(0xb07527e8, FUTEX_WAIT, 8539, NULL) = 0 <0.007339> > 1176891721.439008 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000052> > 1176891721.439510 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000055> > 1176891721.439636 futex(0xb0740be8, FUTEX_WAKE, 1) = 1 <0.000089> > 1176891721.439789 futex(0xb07527e8, FUTEX_WAIT, 8541, NULL) = 0 <0.007045> > 1176891721.447017 futex(0xb07527cc, FUTEX_WAKE, 1) = 0 <0.000054> > 1176891721.447682 futex(0xb0740c04, 0x5 /* FUTEX_??? */, 1) = 1 <0.000065> > > backtrace: > > #0 0xffffe410 in __kernel_vsyscall () > #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 > #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #3 0xb7a04079 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > #5 0x4a05820e in clone () from /lib/libc.so.6 Dito. > library versions: > > xine-lib-1.1.5-1.fc7 > xine-plugin-1.0-3.fc7 > glibc-headers-2.5.90-21 > glibc-common-2.5.90-21 > glibc-2.5.90-21 > glibc-devel-2.5.90-21 > gxine-0.5.11-3.fc7 > kaffeine-0.8.3-4.fc7 > xine-0.99.4-11.lvn7 > xine-lib-extras-1.1.5-1.fc7 > gxine-mozplugin-0.5.11-3.fc7 > > what's weird is that all threads are in a pthread op and seem to be kind > of busy-looping. Maybe xine-lib has some buggy use of pthread condvars > that CFS happens to trigger? (If CFS broke futexes in general i think > we'd be seeing far more widespread breakage.) > > Ingo Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 8:57 ` Christoph Pfister @ 2007-04-18 9:01 ` Ingo Molnar 2007-04-18 9:12 ` Mike Galbraith 2007-04-18 9:13 ` Christoph Pfister 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 9:01 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Christoph Pfister <christophpfister@gmail.com> wrote: > >backtrace: > > > > #0 0xffffe410 in __kernel_vsyscall () > > #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from > > /lib/libpthread.so.0 > > #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > #3 0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > > #5 0x4a05820e in clone () from /lib/libc.so.6 > > This backtrace is useless - QWidget::setUpdatesEnabled() is certainly > _not_ defined in libxine. So the function names in #2 and #3 are wrong > because the addresses seem to belong to libxine. are the updated backtraces in the followup mail i just sent more useful? (I still have that stuck session running so i can whatever debugging you'd like to see done.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:01 ` Ingo Molnar @ 2007-04-18 9:12 ` Mike Galbraith 2007-04-18 9:13 ` Christoph Pfister 1 sibling, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-18 9:12 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Pfister, S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper On Wed, 2007-04-18 at 11:01 +0200, Ingo Molnar wrote: > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > >backtrace: > > > > > > #0 0xffffe410 in __kernel_vsyscall () > > > #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/libpthread.so.0 > > > #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > > #3 0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > > #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > > > #5 0x4a05820e in clone () from /lib/libc.so.6 > > > > This backtrace is useless - QWidget::setUpdatesEnabled() is certainly > > _not_ defined in libxine. So the function names in #2 and #3 are wrong > > because the addresses seem to belong to libxine. > > are the updated backtraces in the followup mail i just sent more useful? > (I still have that stuck session running so i can whatever debugging > you'd like to see done.) The xine website release note says there are problems with playback with xine-lib version 1.1.5, so people encountering this may want to check to see if they're running 1.1.5, and either upgrade to the latest, or downgrade to 1.1.4. <snippet from xine website> 18.04.2007 xine-lib 1.1.6 A new xine-lib version is now available. This is mainly a bug-fix release; 1.1.5 has CD audio and DVD playback problems and a couple of X-related build problems. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:01 ` Ingo Molnar 2007-04-18 9:12 ` Mike Galbraith @ 2007-04-18 9:13 ` Christoph Pfister 2007-04-18 9:17 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 9:13 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > >backtrace: > > > > > > #0 0xffffe410 in __kernel_vsyscall () > > > #1 0x4a2510c6 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/libpthread.so.0 > > > #2 0xb79fd1a8 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > > #3 0xb7a030ab in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > > > #4 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > > > #5 0x4a05820e in clone () from /lib/libc.so.6 > > > > This backtrace is useless - QWidget::setUpdatesEnabled() is certainly > > _not_ defined in libxine. So the function names in #2 and #3 are wrong > > because the addresses seem to belong to libxine. > > are the updated backtraces in the followup mail i just sent more useful? > (I still have that stuck session running so i can whatever debugging > you'd like to see done.) QWidget::setUpdatesEnabled() is (wrongly) present in every thread except the main. So I'm afraid there's nothing which can be done :-/ Btw the main thread is waiting for the first frame being displayed after the seek. > Ingo Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:13 ` Christoph Pfister @ 2007-04-18 9:17 ` Ingo Molnar 2007-04-18 9:25 ` Christoph Pfister 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 9:17 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Christoph Pfister <christophpfister@gmail.com> wrote: > >are the updated backtraces in the followup mail i just sent more > >useful? (I still have that stuck session running so i can whatever > >debugging you'd like to see done.) > > QWidget::setUpdatesEnabled() is (wrongly) present in every thread > except the main. So I'm afraid there's nothing which can be done :-/ > Btw the main thread is waiting for the first frame being displayed > after the seek. i didnt have all the debuginfo packages installed. I installed some (but not all yet), here's an updated backtrace: (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0 #2 0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0 #3 0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0 #4 0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0 #5 0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #6 0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570 #7 0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #8 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #9 0x4a05820e in clone () from /lib/libc.so.6 at least the dvd_plugin_free_buffer() call has been resolved now. (I'll hunt for the other debuginfo packages too.) which thread would be the most interesting to you - 9324? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:17 ` Ingo Molnar @ 2007-04-18 9:25 ` Christoph Pfister 2007-04-18 9:28 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 9:25 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > >are the updated backtraces in the followup mail i just sent more > > >useful? (I still have that stuck session running so i can whatever > > >debugging you'd like to see done.) > > > > QWidget::setUpdatesEnabled() is (wrongly) present in every thread > > except the main. So I'm afraid there's nothing which can be done :-/ > > Btw the main thread is waiting for the first frame being displayed > > after the seek. > > i didnt have all the debuginfo packages installed. I installed some (but > not all yet), here's an updated backtrace: > > (gdb) bt > #0 0xffffe410 in __kernel_vsyscall () > #1 0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0 > #2 0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0 > #3 0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0 > #4 0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0 > #5 0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #6 0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570 > #7 0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 > #8 0x4a24d2db in start_thread () from /lib/libpthread.so.0 > #9 0x4a05820e in clone () from /lib/libc.so.6 > > at least the dvd_plugin_free_buffer() call has been resolved now. (I'll > hunt for the other debuginfo packages too.) > > which thread would be the most interesting to you - 9324? The thread which should wake the main thread - but hmm ... 9303 seems to be rather dead-locked than doing pthread_cond_timedwait() ? > Ingo Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:25 ` Christoph Pfister @ 2007-04-18 9:28 ` Ingo Molnar 2007-04-18 9:52 ` Christoph Pfister 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 9:28 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Christoph Pfister <christophpfister@gmail.com> wrote: > >which thread would be the most interesting to you - 9324? > > The thread which should wake the main thread - but hmm ... 9303 seems > to be rather dead-locked than doing pthread_cond_timedwait() ? ok, here it is, 9303 with better symbol names: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () from /usr/lib/kde3/libxinepart.so #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #8 0x4b55353b in QApplication::internalNotify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #9 0x4b55526e in QApplication::notify () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 #11 0x4b4dd5de in QETWidget::translateWheelEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #12 0x4b4eb41d in QETWidget::translateMouseEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #13 0x4b4e9766 in QApplication::x11ProcessEvent () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #14 0x4b4fb38b in QEventLoop::processEvents () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #15 0x4b56ce30 in QEventLoop::enterLoop () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 #18 0x0806fc1a in QWidget::setUpdatesEnabled () #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 #20 0x0806f7e1 in QWidget::setUpdatesEnabled () does this make more sense to you? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:28 ` Ingo Molnar @ 2007-04-18 9:52 ` Christoph Pfister 2007-04-18 10:04 ` Christoph Pfister 2007-04-18 10:17 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 9:52 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > >which thread would be the most interesting to you - 9324? > > > > The thread which should wake the main thread - but hmm ... 9303 seems > > to be rather dead-locked than doing pthread_cond_timedwait() ? > > ok, here it is, 9303 with better symbol names: > > #0 0xffffe410 in __kernel_vsyscall () > #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 > #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 > #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 > #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 > #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () > from /usr/lib/kde3/libxinepart.so > #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so > #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #8 0x4b55353b in QApplication::internalNotify () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #9 0x4b55526e in QApplication::notify () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 > #11 0x4b4dd5de in QETWidget::translateWheelEvent () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #12 0x4b4eb41d in QETWidget::translateMouseEvent () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #13 0x4b4e9766 in QApplication::x11ProcessEvent () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #14 0x4b4fb38b in QEventLoop::processEvents () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #15 0x4b56ce30 in QEventLoop::enterLoop () > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > #18 0x0806fc1a in QWidget::setUpdatesEnabled () > #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 > #20 0x0806f7e1 in QWidget::setUpdatesEnabled () > > does this make more sense to you? It's nearly impossible for me to find out which mutex is deadlocking. There are 4 mutexs locked / released during xine_play (or one of the possibly inlined functions) and to be honest I have little idea which other thread is also involved in the deadlock (maybe some xine-lib junkie could help you more with that). It would be great if you could reproduce the same problem with a xine-lib which has been compiled with debug support (so you'd get line numbers in the back trace - that makes life _a lot_ easier and maybe I could identify the problem that way) and the least optimization possible ... :-) > Ingo Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:52 ` Christoph Pfister @ 2007-04-18 10:04 ` Christoph Pfister 2007-04-18 10:17 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 10:04 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper 2007/4/18, Christoph Pfister <christophpfister@gmail.com>: > 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > > > >which thread would be the most interesting to you - 9324? > > > > > > The thread which should wake the main thread - but hmm ... 9303 seems > > > to be rather dead-locked than doing pthread_cond_timedwait() ? > > > > ok, here it is, 9303 with better symbol names: > > > > #0 0xffffe410 in __kernel_vsyscall () > > #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 > > #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 > > #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 > > #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 > > #5 0xb7a9b0fb in KXineWidget::slotSeekToPosition () > > from /usr/lib/kde3/libxinepart.so > > #6 0xb7a9b3bc in KXineWidget::wheelEvent () from /usr/lib/kde3/libxinepart.so > > #7 0x4b5f9150 in QWidget::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #8 0x4b55353b in QApplication::internalNotify () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #9 0x4b55526e in QApplication::notify () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #10 0x4a72065e in KApplication::notify () from /usr/lib/libkdecore.so.4 > > #11 0x4b4dd5de in QETWidget::translateWheelEvent () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #12 0x4b4eb41d in QETWidget::translateMouseEvent () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #13 0x4b4e9766 in QApplication::x11ProcessEvent () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #14 0x4b4fb38b in QEventLoop::processEvents () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #15 0x4b56ce30 in QEventLoop::enterLoop () > > from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #16 0x4b56cce6 in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #17 0x4b55317f in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3 > > #18 0x0806fc1a in QWidget::setUpdatesEnabled () > > #19 0x49f9df10 in __libc_start_main () from /lib/libc.so.6 > > #20 0x0806f7e1 in QWidget::setUpdatesEnabled () > > > > does this make more sense to you? > > It's nearly impossible for me to find out which mutex is deadlocking. > There are 4 mutexs locked / released during xine_play (or one of the > possibly inlined functions) and to be honest I have little idea which > other thread is also involved in the deadlock (maybe some xine-lib > junkie could help you more with that). > It would be great if you could reproduce the same problem with a > xine-lib which has been compiled with debug support (so you'd get line > numbers in the back trace - that makes life _a lot_ easier and maybe I > could identify the problem that way) and the least optimization > possible ... :-) > > > Ingo Or I could try playing around a bit with your patchset and trying to reproduce it over here. Because I already have debug builds for xine-lib and compiling a new kernel can take place in the background it wouldn't be much effort for me. Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 9:52 ` Christoph Pfister 2007-04-18 10:04 ` Christoph Pfister @ 2007-04-18 10:17 ` Ingo Molnar 2007-04-18 10:32 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:17 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Christoph Pfister <christophpfister@gmail.com> wrote: > It's nearly impossible for me to find out which mutex is deadlocking. i've disassembled the xine_play function, and here are the function calls in it: <unresolved widget call?> pthread_mutex_lock() xine_log() <unresolved widget call?> function pointer call right after it: pthread_mutex_lock() this second pthread_mutex_lock() in question is the one that deadlocks. It comes right after that function pointer call, maybe that identifies it? [some time passes] i rebuilt the library from source and while the installed library is different from it, looking at the disassembly i'm quite sure it's this pthread_mutex_lock() in xine_play_internal(): pthread_mutex_lock( &stream->demux_lock ); src/xine-engine/xine.c:1201 the function pointer call was: stream->xine->port_ticket->acquire(stream->xine->port_ticket, 1); right before the pthread_mutex_lock() call. > It would be great if you could reproduce the same problem with a > xine-lib which has been compiled with debug support (so you'd get line > numbers in the back trace - that makes life _a lot_ easier and maybe I > could identify the problem that way) and the least optimization > possible ... :-) ok, i'll try that too (but it will take some more time), but given how hard it was for me to trigger it, i wanted to get maximum info out of it before having to kill the threads. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 10:17 ` Ingo Molnar @ 2007-04-18 10:32 ` Ingo Molnar 2007-04-18 10:37 ` Ingo Molnar 2007-04-18 10:53 ` Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:32 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c does this: pthread_mutex_unlock( &stream->demux_lock ); lprintf ("sched_yield\n"); sched_yield(); pthread_mutex_lock( &stream->demux_lock ); why is this done? CFS has definitely changed the yield implementation so there could be some connection. OTOH, in the 'hung' state none of the straces suggests any yield() call. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 10:32 ` Ingo Molnar @ 2007-04-18 10:37 ` Ingo Molnar 2007-04-18 10:49 ` Ingo Molnar 2007-04-18 10:53 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:37 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Ingo Molnar <mingo@elte.hu> wrote: > hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c > does this: plus it does this too: pthread_mutex_unlock( &stream->demux_lock ); xine_usec_sleep(100000); pthread_mutex_lock( &stream->demux_lock ); this would explain the nanosleep() strace entries. But the task stuck on demux_lock never gets the unlock event. Weird. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 10:37 ` Ingo Molnar @ 2007-04-18 10:49 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:49 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Ingo Molnar <mingo@elte.hu> wrote: > > hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c > > does this: > > plus it does this too: > > pthread_mutex_unlock( &stream->demux_lock ); > xine_usec_sleep(100000); > pthread_mutex_lock( &stream->demux_lock ); > > this would explain the nanosleep() strace entries. But the task stuck > on demux_lock never gets the unlock event. Weird. 9303 is stuck here on demux_lock: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2538ce in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #2 0x4a24f71c in _L_mutex_lock_79 () from /lib/libpthread.so.0 #3 0x4a24f24d in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0xb79f64f9 in xine_play () from /usr/lib/libxine.so.1 that mutex related futex is at address 0xb07409e0, but the only sign in the strace of that futex being touched is: 9303 futex(0xb07409e0, FUTEX_WAIT, 2, NULL <unfinished ...> no other event ever happens on futex 0xb07409e0. Other threads dont touch it. Maybe thread 9324 is the owner of that mutex, and it's looping somewhere that does xine_xmalloc_aligned(), with the lock held? It did this: #0 0xffffe410 in __kernel_vsyscall () #1 0x4a2539e1 in __lll_mutex_unlock_wake () from /lib/libpthread.so.0 #2 0x4a2506f9 in _L_mutex_unlock_99 () from /lib/libpthread.so.0 #3 0x4a250370 in __pthread_mutex_unlock_usercnt () from /lib/libpthread.so.0 #4 0x4a2506f0 in pthread_mutex_unlock () from /lib/libpthread.so.0 #5 0xb79fce5a in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #6 0xb7a4b90b in dvd_plugin_free_buffer (buf=0xb0745470) at input_dvd.c:570 #7 0xb7a030a2 in QWidget::setUpdatesEnabled () from /usr/lib/libxine.so.1 #8 0x4a24d2db in start_thread () from /lib/libpthread.so.0 #9 0x4a05820e in clone () from /lib/libc.so.6 Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 10:32 ` Ingo Molnar 2007-04-18 10:37 ` Ingo Molnar @ 2007-04-18 10:53 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:53 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Ingo Molnar <mingo@elte.hu> wrote: > hm. I've reviewed all uses of demux_lock. ./src/xine-engine/demux.c > does this: > > pthread_mutex_unlock( &stream->demux_lock ); > > lprintf ("sched_yield\n"); > > sched_yield(); > pthread_mutex_lock( &stream->demux_lock ); > > why is this done? CFS has definitely changed the yield implementation > so there could be some connection. > > OTOH, in the 'hung' state none of the straces suggests any yield() > call. yeah, there's no yield() call in any of the straces so i'd exclude this as a possibility . Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
[parent not found: <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com>]
[parent not found: <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com>]
* Re: Kaffeine problem with CFS [not found] ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com> @ 2007-04-18 13:05 ` S.Çağlar Onur 2007-04-18 13:21 ` Christoph Pfister 1 sibling, 0 replies; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-18 13:05 UTC (permalink / raw) To: Christoph Pfister Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper [-- Attachment #1: Type: text/plain, Size: 867 bytes --] 18 Nis 2007 Çar tarihinde, Christoph Pfister şunları yazmıştı: > > Okay - so here are some results (it's strange that gdb goes nuts > > inside the xine_play call). I have three bts (seems to be fairly easy > > to reproduce that behaviour over here): Twice while playing an audio > > cd and once while playing a normal file. The hang usually ends if you > > wait long enough (something around 30 secs over here). I can confirm this, freeze ends after some wait period (~20-30 secs) if kaffine is the only active process. I didn't notice that cause most probably CPU is busy with compiling kernel at that time... Now i'm testing Ingo's msleep patch + xine-lib-1.1.6... Cheers -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS [not found] ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com> 2007-04-18 13:05 ` S.Çağlar Onur @ 2007-04-18 13:21 ` Christoph Pfister 2007-04-18 13:25 ` S.Çağlar Onur 2007-04-18 15:08 ` Ingo Molnar 1 sibling, 2 replies; 713+ messages in thread From: Christoph Pfister @ 2007-04-18 13:21 UTC (permalink / raw) To: Ingo Molnar Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper 2007/4/18, Christoph Pfister <christophpfister@gmail.com>: > [ Sorry for accidentally dropping CCs ] > > 2007/4/18, Christoph Pfister <christophpfister@gmail.com>: > > 2007/4/18, Ingo Molnar <mingo@elte.hu>: > > > > > > * Christoph Pfister <christophpfister@gmail.com> wrote: > > > > > > > Or I could try playing around a bit with your patchset and trying to > > > > reproduce it over here. Because I already have debug builds for > > > > xine-lib and compiling a new kernel can take place in the background > > > > it wouldn't be much effort for me. > > > > > > that would be great :) Here are the URLs for it. CFS is based on > > > v2.6.21-rc7: > > > > > > http://kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.21-rc7.tar.bz2 > > > > > > And the CFS patch is at: > > > > > > http://people.redhat.com/mingo/cfs-scheduler/sched-cfs-v2.patch > > > > > > rebuild your kernel as usual and boot into it. No extra configuration is > > > needed, you'll get CFS by default. > > > > > > if this kernel builds/boots fine for you then you might also want to > > > send me a quick note about how it feels, interactivity-wise. And of > > > course i'm interested in any sort of feedback about problems as well. > > > I'd like to make CFS as media-playback friendly as possible, so if > > > there's any problem in that area it would be nice for me to know about > > > it as soon as possible. > > > > > > Ingo > > > > Okay - so here are some results (it's strange that gdb goes nuts > > inside the xine_play call). I have three bts (seems to be fairly easy > > to reproduce that behaviour over here): Twice while playing an audio > > cd and once while playing a normal file. The hang usually ends if you > > wait long enough (something around 30 secs over here). <big snip> > > Christoph > > > > > > PS: Haven't analyzed them yet - but doing so now :-) > > Ok - one nice thing: In all those bts demux_loop is at demux.c:285 - > meaing that demux_lock is held and xine_play is waiting for it ... > The lock should be temporilary unreleased with a sched_yield so that > the main thread can access it. As you wrote the implementation of this > function seems to have changed a bit - so I'll replace it with a short > sleep and try again ... > > Christoph Replacing the sched_yield in demux.c with an usleep(10) stopped those seeking hangs here (at least I was able to pull the slider back and forth during 2 mins without trouble compared to the few secs I need earlier to get a hang). Christoph ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 13:21 ` Christoph Pfister @ 2007-04-18 13:25 ` S.Çağlar Onur 2007-04-18 15:48 ` Ingo Molnar 2007-04-18 15:08 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-18 13:25 UTC (permalink / raw) To: Christoph Pfister Cc: Ingo Molnar, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper [-- Attachment #1: Type: text/plain, Size: 1073 bytes --] 18 Nis 2007 Çar tarihinde, Christoph Pfister şunları yazmıştı: > Replacing the sched_yield in demux.c with an usleep(10) stopped those > seeking hangs here (at least I was able to pull the slider back and > forth during 2 mins without trouble compared to the few secs I need > earlier to get a hang). Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -3785,7 +3785,7 @@ asmlinkage long sys_sched_yield(void) _raw_spin_unlock(&rq->lock); preempt_enable_no_resched(); - schedule(); + msleep(1); return 0; } which Ingo sends me to try also has the same effect on me. I cannot reproduce hangs anymore with that patch applied top of CFS while one console checks out SVN repos and other one compiles a small test software. Cheers -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 13:25 ` S.Çağlar Onur @ 2007-04-18 15:48 ` Ingo Molnar 2007-04-18 16:07 ` William Lee Irwin III 2007-04-18 21:08 ` S.Çağlar Onur 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 15:48 UTC (permalink / raw) To: S.Çağlar Onur Cc: Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > - schedule(); > + msleep(1); > which Ingo sends me to try also has the same effect on me. I cannot > reproduce hangs anymore with that patch applied top of CFS while one > console checks out SVN repos and other one compiles a small test > software. great! Could you please unapply the hack above and try the proper fix below, does this one solve the hangs too? Ingo Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq /* * sched_yield() support is very simple via the rbtree, we just - * dequeue and enqueue the task, which causes the task to - * roundrobin to the end of the tree: + * dequeue the task and move it to the rightmost position, which + * causes the task to roundrobin to the end of the tree. */ static void requeue_task_fair(struct rq *rq, struct task_struct *p) { dequeue_task_fair(rq, p); p->on_rq = 0; - enqueue_task_fair(rq, p); + /* + * Temporarily insert at the last position of the tree: + */ + p->fair_key = LLONG_MAX; + __enqueue_task_fair(rq, p); p->on_rq = 1; + + /* + * Update the key to the real value, so that when all other + * tasks from before the rightmost position have executed, + * this task is picked up again: + */ + p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; } /* @@ -380,7 +391,10 @@ static void task_tick_fair(struct rq *rq * Dequeue and enqueue the task to update its * position within the tree: */ - requeue_task_fair(rq, curr); + dequeue_task_fair(rq, curr); + curr->on_rq = 0; + enqueue_task_fair(rq, curr); + curr->on_rq = 1; /* * Reschedule if another task tops the current one. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 15:48 ` Ingo Molnar @ 2007-04-18 16:07 ` William Lee Irwin III 2007-04-18 16:14 ` Ingo Molnar 2007-04-18 21:08 ` S.Çağlar Onur 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-18 16:07 UTC (permalink / raw) To: Ingo Molnar Cc: S.??a??lar Onur, Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper On Wed, Apr 18, 2007 at 05:48:11PM +0200, Ingo Molnar wrote: > static void requeue_task_fair(struct rq *rq, struct task_struct *p) > { > dequeue_task_fair(rq, p); > p->on_rq = 0; > - enqueue_task_fair(rq, p); > + /* > + * Temporarily insert at the last position of the tree: > + */ > + p->fair_key = LLONG_MAX; > + __enqueue_task_fair(rq, p); > p->on_rq = 1; > + > + /* > + * Update the key to the real value, so that when all other > + * tasks from before the rightmost position have executed, > + * this task is picked up again: > + */ > + p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; At this point you might as well call the requeue operation something having to do with yield. I also suspect what goes on during the timer tick may eventually become something different from requeueing the current task, and furthermore class-dependent. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 16:07 ` William Lee Irwin III @ 2007-04-18 16:14 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 16:14 UTC (permalink / raw) To: William Lee Irwin III Cc: S.??a??lar Onur, Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * William Lee Irwin III <wli@holomorphy.com> wrote: > At this point you might as well call the requeue operation something > having to do with yield. [...] agreed - i've just done a requeue_task -> yield_task rename in my tree. (patch below) > [...] I also suspect what goes on during the timer tick may eventually > become something different from requeueing the current task, and > furthermore class-dependent. it already is, scheduler tick processing is done in class->task_tick(). Ingo --- include/linux/sched.h | 2 +- kernel/sched.c | 7 +------ kernel/sched_fair.c | 4 ++-- kernel/sched_rt.c | 2 +- 4 files changed, 5 insertions(+), 10 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -796,7 +796,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p); void (*dequeue_task) (struct rq *rq, struct task_struct *p); - void (*requeue_task) (struct rq *rq, struct task_struct *p); + void (*yield_task) (struct rq *rq, struct task_struct *p); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -560,11 +560,6 @@ static void dequeue_task(struct rq *rq, p->on_rq = 0; } -static void requeue_task(struct rq *rq, struct task_struct *p) -{ - p->sched_class->requeue_task(rq, p); -} - /* * __normal_prio - return the priority that is based on the static prio */ @@ -3773,7 +3768,7 @@ asmlinkage long sys_sched_yield(void) schedstat_inc(rq, yld_cnt); if (rq->nr_running == 1) schedstat_inc(rq, yld_act_empty); - requeue_task(rq, current); + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -273,7 +273,7 @@ static void dequeue_task_fair(struct rq * dequeue the task and move it to the rightmost position, which * causes the task to roundrobin to the end of the tree. */ -static void requeue_task_fair(struct rq *rq, struct task_struct *p) +static void yield_task_fair(struct rq *rq, struct task_struct *p) { dequeue_task_fair(rq, p); p->on_rq = 0; @@ -509,7 +509,7 @@ static void task_init_fair(struct rq *rq struct sched_class fair_sched_class __read_mostly = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, - .requeue_task = requeue_task_fair, + .yield_task = yield_task_fair, .check_preempt_curr = check_preempt_curr_fair, Index: linux/kernel/sched_rt.c =================================================================== --- linux.orig/kernel/sched_rt.c +++ linux/kernel/sched_rt.c @@ -165,7 +165,7 @@ static void task_init_rt(struct rq *rq, static struct sched_class rt_sched_class __read_mostly = { .enqueue_task = enqueue_task_rt, .dequeue_task = dequeue_task_rt, - .requeue_task = requeue_task_rt, + .yield_task = requeue_task_rt, .check_preempt_curr = check_preempt_curr_rt, ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 15:48 ` Ingo Molnar 2007-04-18 16:07 ` William Lee Irwin III @ 2007-04-18 21:08 ` S.Çağlar Onur 2007-04-18 21:12 ` Ingo Molnar 2007-04-20 19:31 ` Bill Davidsen 1 sibling, 2 replies; 713+ messages in thread From: S.Çağlar Onur @ 2007-04-18 21:08 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper [-- Attachment #1: Type: text/plain, Size: 792 bytes --] 18 Nis 2007 Çar tarihinde, Ingo Molnar şunları yazmıştı: > * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > > - schedule(); > > + msleep(1); > > > > which Ingo sends me to try also has the same effect on me. I cannot > > reproduce hangs anymore with that patch applied top of CFS while one > > console checks out SVN repos and other one compiles a small test > > software. > > great! Could you please unapply the hack above and try the proper fix > below, does this one solve the hangs too? Instead of that one, i tried CFSv3 and i cannot reproduce the hang anymore, Thanks!... Cheers -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 21:08 ` S.Çağlar Onur @ 2007-04-18 21:12 ` Ingo Molnar 2007-04-20 19:31 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 21:12 UTC (permalink / raw) To: S.Çağlar Onur Cc: Christoph Pfister, linux-kernel, Michael Lothian, Linus Torvalds, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > > great! Could you please unapply the hack above and try the proper > > fix below, does this one solve the hangs too? > > Instead of that one, i tried CFSv3 and i cannot reproduce the hang > anymore, Thanks!... cool, thanks for the quick turnaround! Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 21:08 ` S.Çağlar Onur 2007-04-18 21:12 ` Ingo Molnar @ 2007-04-20 19:31 ` Bill Davidsen 2007-04-21 8:36 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Bill Davidsen @ 2007-04-20 19:31 UTC (permalink / raw) To: caglar Cc: Ingo Molnar, Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper S.Çag(lar Onur wrote: > 18 Nis 2007 Çar tarihinde, Ingo Molnar s,unlar? yazm?s,t?: >> * S.Çag(lar Onur <caglar@pardus.org.tr> wrote: >>> - schedule(); >>> + msleep(1); >>> >>> which Ingo sends me to try also has the same effect on me. I cannot >>> reproduce hangs anymore with that patch applied top of CFS while one >>> console checks out SVN repos and other one compiles a small test >>> software. >> great! Could you please unapply the hack above and try the proper fix >> below, does this one solve the hangs too? > > Instead of that one, i tried CFSv3 and i cannot reproduce the hang anymore, > Thanks!... > And that explains why CFS-v3 on 21-rc7-git3 wouldn't show me the hang. As a matter of fact, nothing I did showed any bad behavior! Note that I was doing actual badly behaved things which do sometimes glitch the standard scheduler, not running benchmarks. This scheduler is boring, everything works. I am going to try some tests on a uniprocessor, though, I have been running everything on either SMP or HT CPUs. But so far it looks fine. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-20 19:31 ` Bill Davidsen @ 2007-04-21 8:36 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-21 8:36 UTC (permalink / raw) To: Bill Davidsen Cc: caglar, Christoph Pfister, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Bill Davidsen <davidsen@tmr.com> wrote: > > Instead of that one, i tried CFSv3 and i cannot reproduce the hang > > anymore, Thanks!... > > And that explains why CFS-v3 on 21-rc7-git3 wouldn't show me the hang. > As a matter of fact, nothing I did showed any bad behavior! Note that > I was doing actual badly behaved things which do sometimes glitch the > standard scheduler, not running benchmarks. > > This scheduler is boring, everything works. [...] hehe :) Having a 'boring' scheduler in the end is the main goal :) > [...] I am going to try some tests on a uniprocessor, though, I have > been running everything on either SMP or HT CPUs. But so far it looks > fine. yeah, please do - while i do test UP frequently, most of my CFS testing was on SMP. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Kaffeine problem with CFS 2007-04-18 13:21 ` Christoph Pfister 2007-04-18 13:25 ` S.Çağlar Onur @ 2007-04-18 15:08 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 15:08 UTC (permalink / raw) To: Christoph Pfister Cc: S.Çağlar Onur, linux-kernel, Michael Lothian, Christophe Thommeret, Jurgen Kofler, Ulrich Drepper * Christoph Pfister <christophpfister@gmail.com> wrote: > Replacing the sched_yield in demux.c with an usleep(10) stopped those > seeking hangs here (at least I was able to pull the slider back and > forth during 2 mins without trouble compared to the few secs I need > earlier to get a hang). great - thanks for figuring it out! Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (8 preceding siblings ...) 2007-04-14 15:09 ` S.Çağlar Onur @ 2007-04-15 3:27 ` Con Kolivas 2007-04-15 5:16 ` Bill Huey ` (2 more replies) 2007-04-15 12:29 ` Esben Nielsen ` (4 subsequent siblings) 14 siblings, 3 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-15 3:27 UTC (permalink / raw) To: Ingo Molnar, ck list, Peter Williams, Bill Huey Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Saturday 14 April 2007 06:21, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. The casual observer will be completely confused by what on earth has happened here so let me try to demystify things for them. 1. I tried in vain some time ago to push a working extensable pluggable cpu scheduler framework (based on wli's work) for the linux kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as being absolutely the wrong approach and that we should never do that. Oddly enough the linux-kernel-mailing list was -dead- at the time and the discussion did not make it to the mailing list. Every time I've tried to forward it to the mailing list the spam filter decided to drop it so most people have not even seen this original veto-forever discussion. 2. Since then I've been thinking/working on a cpu scheduler design that takes away all the guesswork out of scheduling and gives very predictable, as fair as possible, cpu distribution and latency while preserving as solid interactivity as possible within those confines. For weeks now, Ingo has said that the interactivity regressions were showstoppers and we should address them, never mind the fact that the so-called regressions were purely "it slows down linearly with load" which to me is perfectly desirable behaviour. While this was not perma-vetoed, I predicted pretty accurately your intent was to veto it based on this. People kept claiming scheduling problems were few and far between but what was really happening is users were terrified of lkml and instead used 1. windows and 2. 2.4 kernels. The problems were there. So where are we now? Here is where your latest patch comes in. As a solution to the many scheduling problems we finally all agree exist, you propose a patch that adds 1. a limited pluggable framework and 2. a fairness based cpu scheduler policy... o_O So I should be happy at last now that the things I was promoting you are also promoting, right? Well I'll fill in the rest of the gaps and let other people decide how I should feel. > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and having trips in and out of hospitals for my condition. I appreciate greatly the sympathy and patience from people in this regard. However at one stage I virtually begged for support with my attempts and help with the code. Dmitry Adamushko is the only person who actually helped me with the code in the interim, while others poked sticks at it. Sure the sticks helped at times but the sticks always seemed to have their ends kerosene doused and flaming for reasons I still don't get. No other help was forthcoming. Now that you're agreeing my direction was correct you've done the usual Linux kernel thing - ignore all my previous code and write your own version. Oh well, that I've come to expect; at least you get a copyright notice in the bootup and somewhere in the comments give me credit for proving it's possible. Let's give some other credit here too. William Lee Irwin provided the major architecture behind plugsched at my request and I simply finished the work and got it working. He is also responsible for many IRC discussions I've had about cpu scheduling fairness, designs, programming history and code help. Even though he did not contribute code directly to SD, his comments have been invaluable. So let's look at the code. kernel/sched.c kernel/sched_fair.c kernel/sched_rt.c It turns out this is not a pluggable cpu scheduler framework at all, and I guess you didn't really promote it as such. It's a "modular scheduler core". Which means you moved code from sched.c into sched_fair.c and sched_rt.c. This abstracts out each _scheduling policy's_ functions into struct sched_class and allows each scheduling policy's functions to be in a separate file etc. Ok so what it means is that instead of whole cpu schedulers being able to be plugged into this framework we can plug in only cpu scheduling policies.... hrm... So let's look on -#define SCHED_NORMAL 0 Ok once upon a time we rename SCHED_OTHER which every other unix calls the standard policy 99.9% of applications used into a more meaningful name, SCHED_NORMAL. That's fine since all it did was change the description internally for those reading the code. Let's see what you've done now: +#define SCHED_FAIR 0 You've renamed it again. This is, I don't know what exactly to call it, but an interesting way of making it look like there is now more choice. Well, whatever you call it, everything in linux spawned from init without specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed SCHED_NORMAL and now SCHED_FAIR. You encouraged me to create a sched_sd.c to add onto your design as well. Well, what do I do with that? I need to create another scheduling policy for that code to even be used. A separate scheduling policy requires a userspace change to even benefit from it. Even if I make that sched_sd.c patch, people cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to read SCHED_SD 0 or similar. The same goes for original staircase cpusched, nicksched, zaphod, spa_ws, ebs and so on. So what you've achieved with your patch is - replaced the current scheduler with another one and moved it into another file. There is no choice, and no pluggability, just code trumping. Do I support this? In this form.... no. It's not that I don't like your new scheduler. Heck it's beautiful like most of your _serious_ code. It even comes with a catchy name that's bound to give people hard-ons (even though many schedulers aim to be completely fair, yours has been named that for maximum selling power). The complaint I have is that you are not providing quite what you advertise (on the modular front), or perhaps you're advertising it as such to make it look more appealing; I'm not sure. Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything different, and that this is anything other than your NIH (Not Invented Here) cpu scheduling policy rewrite which will probably end up taking it's position in mainline after yet another truckload of regression/performance tests and so on. I haven't seen an awful lot of comparisons with SD yet, just people jumping on your bandwagon which is fine I guess. Maybe a few tiny tests showing less than 5% variation in their fairness from what I can see. Either way, I already feel you've killed off SD... like pretty much everything else I've done lately. At least I no longer have to try and support my code mostly by myself. In the interest of putting aside any ego concerns since this is about linux and not me... Because... You are a hair's breadth away from producing something that I would support, which _does_ do what you say and produces the pluggability we're all begging for with only tiny changes to the code you've already done. Make Kconfig let you choose which sched_*.c gets built into the kernel, and make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig and even choose one of the alternative built in ones with boot parametersyour code has more clout than mine will (ie do exactly what plugsched does). Then we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is the very thing Linus didn't want in specialisation with the cpu schedulers! Does this mean this idea will be vetoed yet again? In all likelihood, yes. I guess I have lots to put into -ck still... sigh. > Ingo -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas @ 2007-04-15 5:16 ` Bill Huey 2007-04-15 8:44 ` Ingo Molnar 2007-04-15 16:11 ` Bernd Eckenfels 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 15:05 ` Ingo Molnar 2 siblings, 2 replies; 713+ messages in thread From: Bill Huey @ 2007-04-15 5:16 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote: ... > Now that you're agreeing my direction was correct you've done the usual Linux > kernel thing - ignore all my previous code and write your own version. Oh > well, that I've come to expect; at least you get a copyright notice in the > bootup and somewhere in the comments give me credit for proving it's > possible. Let's give some other credit here too. William Lee Irwin provided > the major architecture behind plugsched at my request and I simply finished > the work and got it working. He is also responsible for many IRC discussions > I've had about cpu scheduling fairness, designs, programming history and code > help. Even though he did not contribute code directly to SD, his comments > have been invaluable. Hello folks, I think the main failure I see here is that Con wasn't included in this design or privately in review process. There could have been better co-ownership of the code. This could also have been done openly on lkml (since this is kind of what this medium is about to significant degree) so that consensus can happen (Con can be reasoned with). It would have achieved the same thing but probably more smoothly if folks just listened, considered an idea and then, in this case, created something that would allow for experimentation from outsiders in a fluid fashion. If these issues aren't fixed, you're going to stuck with the same kind of creeping elitism that has gradually killed the FreeBSD project and other BSDs. I can't comment on the code implementation. I'm focus on other things now that I'm at NetApp and I can't help out as much as I could. Being former BSDi, I had a first hand account of these issues as they played out. A development process like this is likely to exclude smart people from wanting to contribute to Linux and folks should be conscious about this issues. It's basically a lot of code and concept that at least two individuals have worked on (wli and con) only to have it be rejected and then sudden replaced by code from a community gatekeeper. In this case, this results in both Con and Bill Irwin being woefully under utilized. If I were one of these people. I'd be mighty pissed. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 5:16 ` Bill Huey @ 2007-04-15 8:44 ` Ingo Molnar 2007-04-15 9:51 ` Bill Huey 2007-04-15 16:11 ` Bernd Eckenfels 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 8:44 UTC (permalink / raw) To: Bill Huey Cc: Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > Hello folks, > > I think the main failure I see here is that Con wasn't included in > this design or privately in review process. There could have been > better co-ownership of the code. This could also have been done openly > on lkml [...] Bill, you come from a BSD background and you are still relatively new to Linux development, so i dont at all fault you for misunderstanding this situation, and fortunately i have a really easy resolution for your worries: i did exactly that! :) i wrote the first line of code of the CFS patch this week, 8am Wednesday morning, and released it to lkml 62 hours later, 22pm on Friday. (I've listed the file timestamps of my backup patches further below, for all the fine details.) I prefer such early releases to lkml _alot_ more than any private review process. I released the CFS code about 6 hours after i thought "okay, this looks pretty good" and i spent those final 6 hours on testing it (making sure it doesnt blow up on your box, etc.), in the final 2 hours i showed it to two folks i could reach on IRC (Arjan and Thomas) and on various finishing touches. It doesnt get much faster than that and i definitely didnt want to sit on it even one day longer because i very much thought that Con and others should definitely see this work! And i very much credited (and still credit) Con for the whole fairness angle: || i'd like to give credit to Con Kolivas for the general approach here: || he has proven via RSDL/SD that 'fair scheduling' is possible and that || it results in better desktop scheduling. Kudos Con! the 'design consultation' phase you are talking about is _NOW_! :) I got the v1 code out to Con, to Mike and to many others ASAP. That's how you are able to comment on this thread and be part of the development process to begin with, in a 'private consultation' setup you'd not have had any opportunity to see _any_ of this. In the BSD space there seem to be more 'political' mechanisms for development, but Linux is truly about doing things out in the open, and doing it immediately. Okay? ;-) Here's the timestamps of all my backups of the patch, from its humble 4K beginnings to the 100K first-cut v1 result: -rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch i hope this helps :) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:44 ` Ingo Molnar @ 2007-04-15 9:51 ` Bill Huey 2007-04-15 10:39 ` Pekka Enberg 0 siblings, 1 reply; 713+ messages in thread From: Bill Huey @ 2007-04-15 9:51 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote: > I prefer such early releases to lkml _alot_ more than any private review > process. I released the CFS code about 6 hours after i thought "okay, > this looks pretty good" and i spent those final 6 hours on testing it > (making sure it doesnt blow up on your box, etc.), in the final 2 hours > i showed it to two folks i could reach on IRC (Arjan and Thomas) and on > various finishing touches. It doesnt get much faster than that and i > definitely didnt want to sit on it even one day longer because i very > much thought that Con and others should definitely see this work! > > And i very much credited (and still credit) Con for the whole fairness > angle: > > || i'd like to give credit to Con Kolivas for the general approach here: > || he has proven via RSDL/SD that 'fair scheduling' is possible and that > || it results in better desktop scheduling. Kudos Con! > > the 'design consultation' phase you are talking about is _NOW_! :) > > I got the v1 code out to Con, to Mike and to many others ASAP. That's > how you are able to comment on this thread and be part of the > development process to begin with, in a 'private consultation' setup > you'd not have had any opportunity to see _any_ of this. > > In the BSD space there seem to be more 'political' mechanisms for > development, but Linux is truly about doing things out in the open, and > doing it immediately. I can't even begin to talk about how screwed up BSD development is. Maybe another time privately. Ok, Linux development and inclusiveness can be improved. I'm not trying to "call you out" (slang for accusing you with the sole intention to call you crazy in a highly confrontative manner). This is discussed publically here to bring this issue to light, open a communication channel as a means to resolve it. > Okay? ;-) It's cool. We're still getting to know each other professionally and it's okay to a certain degree to have a communication disconnect but only as long as it clears. Your productivity is amazing BTW. But here's the problem, there's this perception that NIH is the default mentality here in Linux. Con feels that this kind of action is intentional and has a malicious quality to it as means of "churn squating" sections of the kernel tree. The perception here is that there is that there is this expectation that sections of the Linux kernel are intentionally "churn squated" to prevent any other ideas from creeping in other than of the owner of that subsytem (VM, scheduling, etc...) because of lack of modularity in the kernel. This isn't an API question but a question possibly general code quality and how maintenance () of it can . This was predicted by folks and then this perception was *realized* when you wrote the equivalent kind of code that has technical overlap with SDL (this is just one dry example). To a person that is writing new code for Linux, having one of the old guards write equivalent code to that of a newcomer has the effect of displacing that person both with regards to code and responsibility with that. When this happens over and over again and folks get annoyed by it, it starts seeming that Linux development seems elitist. I know this because I heard (read) Con's IRC chats all the time about these matters all of the time. This is not just his view but a view of other kernel folks that differing views as to. The closing talk at OLS 2006 was highly disturbing in many ways. It went "Christoph" is right everybody else is wrong which sends a highly negative message to new kernel developers that, say, don't work for RH directly or any of the other mainstream Linux companies. After a while, it starts seeming like this kind of behavior is completely intentional and that Linux is full of arrogant bastards. What I would have done here was to contact Peter Williams, Bill Irwin and Con about what your doing and reach a common concensus about how to create something that would be inclusive of all of their ideas. Discussions can technically heated but that's ok, the discussion is happening and it brings down the wall of this perception. Bill and Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra. It might be very useful, it might not be. Folks are all stubborn about there ideas and hold on to them for dear life. Effective leaders can deconstruct this hostility and animosity. I don't claim to be one. Because of past hostility to something like schedplugin, the hostility and terseness of responses can be percieved simply as "I'm right, you're wrong" which is condescending. This effects discussion and outright destroys a constructive process if this happens continually since it reenforces that view of "You're an outsider, we don't care about you". Nobody is listening to each other at that point, folks get pissed. Then they think about "I'm going to NIH this person with patc X because he/she did the same here" which is dysfunctional. Oddly enough, sometimes you're the best person to get a new idea into the tree. What's not happening here is communication. That takes sensitivity, careful listening which is a difficult skill, and then understanding of the characters involved to unify creative energies. That's a very difficult thing to do for folks that are use to working solo. It take time to develop trust in those relationships so that a true collaboration can happen. I know that there is a lot of creativity in folks like Con and Bill. It would be wise to develop a dialog with them to see if they can offload some of your work for you (we all know you're really busy) yet have you be a key facilitator of their and your ideas. That's a really tough thing to do and it requires practice. Just imagine (assuming they can follow through) what could have positively happened if there collective knowledge was leveraged better. It's not all clear and rosy, but I think these people are more on your side that you might realized and it might be a good thing to discover that. This is tough because I know the personalities involved and I know kind of how people function and malfunction in this discussion on a personal basis. [We can continue privately. This not just about you but applicable to open source development in general] The tone of this email is intellectually critical (not ment as personality attack) and calm. If I'm otherwise, them I'm a bastard. :) bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 9:51 ` Bill Huey @ 2007-04-15 10:39 ` Pekka Enberg 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 15:16 ` Gene Heskett 0 siblings, 2 replies; 713+ messages in thread From: Pekka Enberg @ 2007-04-15 10:39 UTC (permalink / raw) To: hui Bill Huey Cc: Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > The perception here is that there is that there is this expectation that > sections of the Linux kernel are intentionally "churn squated" to prevent > any other ideas from creeping in other than of the owner of that subsytem Strangely enough, my perception is that Ingo is simply trying to address the issues Mike's testing discovered in RDSL and SD. It's not surprising Ingo made it a separate patch set as Con has repeatedly stated that the "problems" are in fact by design and won't be fixed. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 10:39 ` Pekka Enberg @ 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg ` (2 more replies) 2007-04-15 15:16 ` Gene Heskett 1 sibling, 3 replies; 713+ messages in thread From: Willy Tarreau @ 2007-04-15 12:45 UTC (permalink / raw) To: Pekka Enberg Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote: > On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > >The perception here is that there is that there is this expectation that > >sections of the Linux kernel are intentionally "churn squated" to prevent > >any other ideas from creeping in other than of the owner of that subsytem > > Strangely enough, my perception is that Ingo is simply trying to > address the issues Mike's testing discovered in RDSL and SD. It's not > surprising Ingo made it a separate patch set as Con has repeatedly > stated that the "problems" are in fact by design and won't be fixed. That's not exactly the problem. There are people who work very hard to try to improve some areas of the kernel. They progress slowly, and acquire more and more skills. Sometimes they feel like they need to change some concepts and propose those changes which are required for them to go further, or to develop faster. Those are rejected. So they are constrained to work in a delimited perimeter from which it is difficult for them to escape. Then, the same person who rejected their changes comes with something shiny new, better and which took him far less time. But he sort of broke the rules because what was forbidden to the first persons is suddenly permitted. Maybe for very good reasons, I'm not discussing that. The good reason should have been valid the first time too. The fact is that when changes are rejected, we should not simply say "no", but explain why and define what would be acceptable. Some people here have excellent teaching skills for this, but most others do not. Anyway, the rules should be the same for everybody. Also, there is what can be perceived as marketting here. Con worked on his idea with convictions, he took time to write some generous documentation, but he hit a wall where his concept was suboptimal on a given workload. But at least, all the work was oriented on a technical basis : design + code + doc. Then, Ingo comes in with something looking amazingly better, with virtually no documentation, an appealing announcement, and a shiny advertising at boot. All this implemented without the constraints other people had to respect. It already looks like definitive work which will be merge as-is without many changes except a few bugfixes. If those were two companies, the first one would simply have accused the second one of not having respected contracts and having employed heaving marketting to take the first place. People here do not code for a living, they do it at least because they believe in what they are doing, and some of them want a bit of gratitude for their work. I've met people who were proud to say they implement this or that feature in the kernel, so it is something important for them. And being cited in an email is nothing compared to advertising at boot time. When the discussion was blocked between Con and Mike concerning the design problems, it is where a new discussion should have taken place. Ingo could have publicly spoken with them about his ideas of killing the O(1) scheduler and replacing it with an rbtree-based one, and using part of Bill's work to speed up development. It is far easier to resign when people explain what concepts are wrong and how they think they will do than when they suddenly present something out of nowhere which is already better. And it's not specific to Ingo (though I think his ability to work that fast alone makes him tend to practise this more often than others). Imagine if Con had worked another full week on his scheduler with better results on Mike's workload, but still not as good as Ingo's, and they both published at the same time. You certainly can imagine he would have preferred to be informed first that it was pointless to continue in that direction. Now I hope he and Bill will get over this and accept to work on improving this scheduler, because I really find it smarter than a dumb O(1). I even agree with Mike that we now have a solid basis for future work. But for this, maybe a good starting point would be to remove the selfish printk at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind) and improve the documentation a bit so that people can work together on the new design, without feeling like their work will only server to promote X or Y. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau @ 2007-04-15 13:08 ` Pekka J Enberg 2007-04-15 17:32 ` Mike Galbraith 2007-04-15 15:26 ` William Lee Irwin III 2007-04-15 15:39 ` Ingo Molnar 2 siblings, 1 reply; 713+ messages in thread From: Pekka J Enberg @ 2007-04-15 13:08 UTC (permalink / raw) To: Willy Tarreau Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Willy Tarreau wrote: > Ingo could have publicly spoken with them about his ideas of killing > the O(1) scheduler and replacing it with an rbtree-based one, and using > part of Bill's work to speed up development. He did exactly that and he did it with a patch. Nothing new here. This is how development on LKML proceeds when you have two or more competing designs. There's absolutely no need to get upset or hurt your feelings over it. It's not malicious, it's how we do Linux development. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 13:08 ` Pekka J Enberg @ 2007-04-15 17:32 ` Mike Galbraith 2007-04-15 17:59 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 17:32 UTC (permalink / raw) To: Pekka J Enberg Cc: Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote: > On Sun, 15 Apr 2007, Willy Tarreau wrote: > > Ingo could have publicly spoken with them about his ideas of killing > > the O(1) scheduler and replacing it with an rbtree-based one, and using > > part of Bill's work to speed up development. > > He did exactly that and he did it with a patch. Nothing new here. This is > how development on LKML proceeds when you have two or more competing > designs. There's absolutely no need to get upset or hurt your feelings > over it. It's not malicious, it's how we do Linux development. Yes. Exactly. This is what it's all about, this is what makes it work. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:32 ` Mike Galbraith @ 2007-04-15 17:59 ` Linus Torvalds 2007-04-15 19:00 ` Jonathan Lundell 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-15 17:59 UTC (permalink / raw) To: Mike Galbraith Cc: Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Mike Galbraith wrote: > On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote: > > > > He did exactly that and he did it with a patch. Nothing new here. This is > > how development on LKML proceeds when you have two or more competing > > designs. There's absolutely no need to get upset or hurt your feelings > > over it. It's not malicious, it's how we do Linux development. > > Yes. Exactly. This is what it's all about, this is what makes it work. I obviously agree, but I will also add that one of the most motivating things there *is* in open source is "personal pride". It's a really good thing, and it means that if somebody shows that your code is flawed in some way (by, for example, making a patch that people claim gets better behaviour or numbers), any *good* programmer that actually cares about his code will obviously suddenly be very motivated to out-do the out-doer! Does this mean that there will be tension and rivalry? Hell yes. But that's kind of the point. Life is a game, and if you aren't in it to win, what the heck are you still doing here? As long as it's reasonably civil (I'm not personally a huge believer in being too polite or "politically correct", so I think the "reasonably" is more important than the "civil" part!), and as long as the end result is judged on TECHNICAL MERIT, it's all good. We don't want to play politics. But encouraging peoples competitive feelings? Oh, yes. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:59 ` Linus Torvalds @ 2007-04-15 19:00 ` Jonathan Lundell 2007-04-15 22:52 ` Con Kolivas 0 siblings, 1 reply; 713+ messages in thread From: Jonathan Lundell @ 2007-04-15 19:00 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > It's a really good thing, and it means that if somebody shows that > your > code is flawed in some way (by, for example, making a patch that > people > claim gets better behaviour or numbers), any *good* programmer that > actually cares about his code will obviously suddenly be very > motivated to > out-do the out-doer! "No one who cannot rejoice in the discovery of his own mistakes deserves to be called a scholar." --Don Foster, "literary sleuth", on retracting his attribution of "A Funerall Elegye" to Shakespeare (it's more likely John Ford's work). ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:00 ` Jonathan Lundell @ 2007-04-15 22:52 ` Con Kolivas 2007-04-16 2:28 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-15 22:52 UTC (permalink / raw) To: Jonathan Lundell Cc: Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 05:00, Jonathan Lundell wrote: > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > > It's a really good thing, and it means that if somebody shows that > > your > > code is flawed in some way (by, for example, making a patch that > > people > > claim gets better behaviour or numbers), any *good* programmer that > > actually cares about his code will obviously suddenly be very > > motivated to > > out-do the out-doer! > > "No one who cannot rejoice in the discovery of his own mistakes > deserves to be called a scholar." Lovely comment. I realise this is not truly directed at me but clearly in the context it has been said people will assume it is directed my way, so while we're all spinning lkml quality rhetoric, let me have a right of reply. One thing I have never tried to do was to ignore bug reports. I'm forever joking that I keep pulling code out of my arse to improve what I've done. RSDL/SD was no exception; heck it had 40 iterations. The reason I could not reply to bug report A with "Oh that is problem B so I'll fix it with code C" was, as I've said many many times over, health related. I did indeed try to fix many of them without spending hours replying to sometimes unpleasant emails. If health wasn't an issue there might have been 1000 iterations of SD. There was only ever _one_ thing that I was absolutely steadfast on as a concept that I refused to fix that people might claim was "a mistake I did not rejoice in to be a scholar". That was that the _correct_ behaviour for a scheduler is to be fair such that proportional slowdown with load is (using that awful pun) a feature, not a bug. Now there are people who will still disagree violently with me on that. SD attempted to be a fairness first virtual-deadline design. If I failed on that front, then so be it (and at least one person certainly has said in lovely warm fuzzy friendly communication that I'm a global failure on all fronts with SD). But let me point out now that Ingo's shiny new scheduler is a fairness-first virtual-deadline design which will have proportional slowdown with load. So it will have a very similar feature. I dare anyone to claim that proportional slowdown with load is a bug, because I will no longer feel like I'm standing alone with a BFG9000 trying to defend my standpoint. Others can take up the post at last. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:52 ` Con Kolivas @ 2007-04-16 2:28 ` Nick Piggin 2007-04-16 3:15 ` Con Kolivas [not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com> 0 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-16 2:28 UTC (permalink / raw) To: Con Kolivas Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 08:52:33AM +1000, Con Kolivas wrote: > On Monday 16 April 2007 05:00, Jonathan Lundell wrote: > > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > > > It's a really good thing, and it means that if somebody shows that > > > your > > > code is flawed in some way (by, for example, making a patch that > > > people > > > claim gets better behaviour or numbers), any *good* programmer that > > > actually cares about his code will obviously suddenly be very > > > motivated to > > > out-do the out-doer! > > > > "No one who cannot rejoice in the discovery of his own mistakes > > deserves to be called a scholar." > > Lovely comment. I realise this is not truly directed at me but clearly in the > context it has been said people will assume it is directed my way, so while > we're all spinning lkml quality rhetoric, let me have a right of reply. > > One thing I have never tried to do was to ignore bug reports. I'm forever > joking that I keep pulling code out of my arse to improve what I've done. > RSDL/SD was no exception; heck it had 40 iterations. The reason I could not > reply to bug report A with "Oh that is problem B so I'll fix it with code C" > was, as I've said many many times over, health related. I did indeed try to > fix many of them without spending hours replying to sometimes unpleasant > emails. If health wasn't an issue there might have been 1000 iterations of > SD. Well what matters is the code and development. I don't think Ingo's scheduler is the final word, although I worry that Linus might jump the gun and merge something "just to give it a test", which we then get stuck with :P I don't know how anybody can think Ingo's new scheduler is anything but a good thing (so long as it has to compete before being merged). And that's coming from someone who wants *their* scheduler to get merged... I think mine can compete ;) and if it can't, then I'd rather be using the scheduler that beats it. > There was only ever _one_ thing that I was absolutely steadfast on as a > concept that I refused to fix that people might claim was "a mistake I did > not rejoice in to be a scholar". That was that the _correct_ behaviour for a > scheduler is to be fair such that proportional slowdown with load is (using > that awful pun) a feature, not a bug. If something is using more than a fair share of CPU time, over some macro period, in order to be interactive, then definitely it should get throttled. I've always maintained (since starting scheduler work) that the 2.6 scheduler is horrible because it allows these cases where some things can get more CPU time just by how they behave. Glad people are starting to come around on that point. So, on to something productive, we have 3 candidates for a new scheduler so far. How do we decide which way to go? (and yes, I still think switchable schedulers is wrong and a copout) This is one area where it is virtually impossible to discount any decent design on correctness/performance/etc. and even testing in -mm isn't really enough. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 2:28 ` Nick Piggin @ 2007-04-16 3:15 ` Con Kolivas 2007-04-16 3:34 ` Nick Piggin [not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com> 1 sibling, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-16 3:15 UTC (permalink / raw) To: Nick Piggin Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 12:28, Nick Piggin wrote: > So, on to something productive, we have 3 candidates for a new scheduler so > far. How do we decide which way to go? (and yes, I still think switchable > schedulers is wrong and a copout) This is one area where it is virtually > impossible to discount any decent design on correctness/performance/etc. > and even testing in -mm isn't really enough. We're in agreement! YAY! Actually this is simpler than that. I'm taking SD out of the picture. It has served it's purpose of proving that we need to seriously address all the scheduling issues and did more than a half decent job at it. Unfortunately I also cannot sit around supporting it forever by myself. My own life is more important, so consider SD not even running the race any more. I'm off to continue maintaining permanent-out-of-tree leisurely code at my own pace. What's more is, I think I'll just stick to staircase Gen I version blah and shelve SD and try to have fond memories of SD as an intellectual prompting exercise only. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:15 ` Con Kolivas @ 2007-04-16 3:34 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-16 3:34 UTC (permalink / raw) To: Con Kolivas Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 01:15:27PM +1000, Con Kolivas wrote: > On Monday 16 April 2007 12:28, Nick Piggin wrote: > > So, on to something productive, we have 3 candidates for a new scheduler so > > far. How do we decide which way to go? (and yes, I still think switchable > > schedulers is wrong and a copout) This is one area where it is virtually > > impossible to discount any decent design on correctness/performance/etc. > > and even testing in -mm isn't really enough. > > We're in agreement! YAY! > > Actually this is simpler than that. I'm taking SD out of the picture. It has > served it's purpose of proving that we need to seriously address all the > scheduling issues and did more than a half decent job at it. Unfortunately I > also cannot sit around supporting it forever by myself. My own life is more > important, so consider SD not even running the race any more. > > I'm off to continue maintaining permanent-out-of-tree leisurely code at my own > pace. What's more is, I think I'll just stick to staircase Gen I version blah > and shelve SD and try to have fond memories of SD as an intellectual > prompting exercise only. Well I would hope that _if_ we decide to switch schedulers, then you get a chance to field something (and I hope you will decide to and have time to), and I hope we don't rush into the decision. We've had the current scheduler for so many years now that it is much more important to make sure we take the time to do the right thing rather than absolutely have to merge a new scheduler right now ;) ^ permalink raw reply [flat|nested] 713+ messages in thread
[parent not found: <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com>]
* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] [not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com> @ 2007-04-16 6:27 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-16 6:27 UTC (permalink / raw) To: Matthew Hawkins; +Cc: linux-kernel, ck list On Mon, Apr 16, 2007 at 03:57:54PM +1000, Matthew Hawkins wrote: > On 4/16/07, Nick Piggin <npiggin@suse.de> wrote: > > > >So, on to something productive, we have 3 candidates for a new scheduler > >so > >far. How do we decide which way to go? (and yes, I still think switchable > >schedulers is wrong and a copout) > > > I'm with you on that one. It sounds good as a concept however there's > various kernel structures etc that simply cannot be altered at runtime, > which throws away the only advantage I can see of plugsched - a test/debug > framework. > > I think the best way is for those working on this stuff to keep producing > their separate patches against mainline and people being encouraged to > test. THEN > (and here comes the fun part) subsystem maintainers have to be prepared to > accept code that is not their own or that of their IRC buddies. I'm > noticing this disturbing trend that Linux kernel development is becoming > more and more like BSD where only the elite few ever get anywhere. Con > Kolivas, having a medical not CS degree, bruises the egos of those with CS > degrees when he comes up with fairly clean, working, and widely-tested > implementations of things like the staircase scheduler, R(SD)L, SCHED_ISO, > swap prefetch, etc. when they can't. We should be encouraging guys like The thing is, it is really hard for anybody to change anything in page reclaim or CPU scheduler. A few people saying a change is good for them doesn't really mean anything because of the huge amount of diversity in usages. I've got my own CPU scheduler for 4 years and I and a few others think it is better than mainline. I've tried to make many many VM changes that haven't gone in. Add to that, I don't actually know or care what sort of education most kernel hackers have. I do know at least one of the more brilliant ones does not have a CS degree, and I was able to get quite a few things in before I had a degree (eg. rewrote IO scheduler and multiprocessor CPU scheduler). > It's all about the patches, baby I don't know what would give anyone the idea that it isn't... patches and numbers. Nick ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg @ 2007-04-15 15:26 ` William Lee Irwin III 2007-04-16 15:55 ` Chris Friesen 2007-04-15 15:39 ` Ingo Molnar 2 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 15:26 UTC (permalink / raw) To: Willy Tarreau Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote: > Now I hope he and Bill will get over this and accept to work on improving > this scheduler, because I really find it smarter than a dumb O(1). I even > agree with Mike that we now have a solid basis for future work. But for > this, maybe a good starting point would be to remove the selfish printk > at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind) > and improve the documentation a bit so that people can work together on > the new design, without feeling like their work will only server to > promote X or Y. While I appreciate people coming to my defense, or at least the good intentions behind such, my only actual interest in pointing out 4-year-old work is getting some acknowledgment of having done something relevant at all. Sometimes it has "I told you so" value. At other times it's merely clarifying what went on when people refer to it since in a number of cases the patches are no longer extant, so they can't actually look at it to get an idea of what was or wasn't done. At other times I'm miffed about not being credited, whether I should've been or whether dead and buried code has an implementation of the same idea resurfacing without the author(s) having any knowledge of my prior work. One should note that in this case, the first work of mine this trips over (scheduling classes) was never publicly posted as it was only a part of the original plugsched (an alternate scheduler implementation devised to demonstrate plugsched's flexibility with respect to scheduling policies), and a part that was dropped by subsequent maintainers. The second work of mine this trips over, a virtual deadline scheduler named "vdls," was also never publicly posted. Both are from around the same time period, which makes them approximately 4 years dead. Neither of the codebases are extant, having been lost in a transition between employers, though various people recall having been sent them privately, and plugsched survives in a mutated form as maintained by Peter Williams, who's been very good about acknowledging my original contribution. If I care to become a direct participant in scheduler work, I can do so easily enough. I'm not entirely sure what this is about a basis for future work. By and large one should alter the API's and data structures to fit the policy being implemented. While the array swapping was nice for algorithmically improving 2.4.x -style epoch expiry, most algorithms not based on the 2.4.x scheduler (in however mutated a form) should use a different queue structure, in fact, one designed around their policy's specific algorithmic needs. IOW, when one alters the scheduler, one should also alter the queue data structure appropriately. I'd not expect the priority queue implementation in cfs to continue to be used unaltered as it matures, nor would I expect any significant modification of the scheduler to necessarily use a similar one. By and large I've been mystified as to why there is such a penchant for preserving the existing queue structures in the various scheduler patches floating around. I am now every bit as mystified at the point of view that seems to be emerging that a change of queue structure is particularly significant. These are all largely internal changes to sched.c, and as such, rather small changes in and of themselves. While they do tend to have user-visible effects, from this point of view even changing out every line of sched.c is effectively a micropatch. Something more significant might be altering the schedule() API to take a mandatory description of the intention of the call to it, or breaking up schedule() into several different functions to distinguish between different sorts of uses of it to which one would then respond differently. Also more significant would be adding a new state beyond TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE users to use the new state and handle those fatal signals. While not quite as ostentatious in their user-visible effects as SCHED_OTHER policy affairs, they are tremendously more work than switching out the implementation of a single C file, and so somewhat more respectable. Even as scheduling semantics go, these are micropatches. So SCHED_OTHER changes a little. Where are the gang schedulers? Where are the batch schedulers (SCHED_BATCH is not truly such)? Where are the isochronous (frame) schedulers? I suppose there is some CKRM work that actually has a semantic impact despite being largely devoted to SCHED_OTHER, and there's some spufs gang scheduling going on, though not all that much. And to reiterate a point from other threads, even as SCHED_OTHER patches go, I see precious little verification that things like the semantics of nice numbers or other sorts of CPU bandwidth allocation between competing tasks of various natures are staying the same while other things are changed, or at least being consciously modified in such a fashion as to improve them. I've literally only seen one or two tests (and rather inflexible ones with respect to sleep and running time mixtures) with any sort of quantification of how CPU bandwidth is distributed get run on all this. So from my point of view, there's a lot of churn and craziness going on in one tiny corner of the kernel and people don't seem to have a very solid grip on what effects their changes have or how they might potentially break userspace. So I've developed a sudden interest in regression testing of the scheduler in order to ensure that various sorts of semantics on which userspace relies are not broken, and am trying to spark more interest in general in nailing down scheduling semantics and verifying that those semantics are honored and remain honored by whatever future scheduler implementations might be merged. Thus far, the laundry list of semantics I'd like to have nailed down are specifically: (1) CPU bandwidth allocation according to nice numbers (2) CPU bandwidth allocation among mixtures of tasks with varying sleep/wakeup behavior e.g. that consume some percentage of cpu in isolation, perhaps also varying the granularity of their sleep/wakeup patterns (3) sched_yield(), so multitier userspace locking doesn't go haywire (4) How these work with SMP; most people agree it should be mostly the same as it works on UP, but it's not being verified, as most testcases are barely SMP-aware if at all, and corner cases where proportionality breaks down aren't considered The sorts of like explicit decisions I'd like to be made for these are: (1) In a mixture of tasks with varying nice numbers, a given nice number corresponds to some share of CPU bandwidth. Implementations should not have the freedom to change this arbitrarily according to some intention. (2) A given scheduler _implementation_ intends to distribute CPU bandwidth among mixtures of tasks that would each consume some percentage of the CPU in isolation varying across tasks in some particular pattern. For example, maybe some scheduler implementation assigns a share of 1/%cpu to a task that would consume %cpu in isolation, for a CPU bandwidth allocation of (1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks (this is not to say that such a policy makes sense). (3) sched_yield() is intended to result in some particular scheduling pattern in a given scheduler implementation. For instance, an implementation may intend that a set of CPU hogs calling sched_yield() between repeatedly printf()'ing their pid's will see their printf()'s come out in an approximately consistent order as the scheduler cycles between them. (4) What an implementation intends to do with respect to SMP CPU bandwidth allocation when precise emulation of UP behavior is impossible, considering sched_yield() scheduling patterns when possible as well. For instance, perhaps an implementation intends to ensure equal CPU bandwidth among competing CPU-bound tasks of equal priority at all costs, and so triggers migration and/or load balancing to make it so. Or perhaps an implementation intends to ensure precise sched_yield() ordering at all costs even on SMP. Some sort of specification of the intention, then a verification that the intention is carried out in a testcase. Also, if there's a semantic issue to be resolved, I want it to have something describing it and verifying it. For instance, characterizing whatever sort of scheduling artifacts queue-swapping causes in the mainline scheduler and then a testcase to demonstrate the artifact and its resolution in a given scheduler rewrite would be a good design statement and verification. For instance, if someone wants to go back to queue-swapping or other epoch expiry semantics, it would make them (and hopefully everyone else) conscious of the semantic issue the change raises, or possibly serve as a demonstration that the artifacts can be mitigated in some implementation retaining epoch expiry semantics. As I become aware of more potential issues I'll add more to my laundry list, and I'll hammer out testcases as I go. My concern with the scheduler is that this sort of basic functionality may be significantly disturbed with no one noticing at all until a distro issues a prerelease and benchmarks go haywire, and furthermore that changes to this kind of basic behavior may be signs of things going awry, particularly as more churn happens. So now that I've clarified my role in all this to date and my point of view on it, it should be clear that accepting something and working on some particular scheduler implementation don't make sense as suggestions to me. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:26 ` William Lee Irwin III @ 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Chris Friesen @ 2007-04-16 15:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > The sorts of like explicit decisions I'd like to be made for these are: > (1) In a mixture of tasks with varying nice numbers, a given nice number > corresponds to some share of CPU bandwidth. Implementations > should not have the freedom to change this arbitrarily according > to some intention. The first question that comes to my mind is whether nice levels should be linear or not. I would lean towards nonlinear as it allows a wider range (although of course at the expense of precision). Maybe something like "each nice level gives X times the cpu of the previous"? I think a value of X somewhere between 1.15 and 1.25 might be reasonable. What about also having something that looks at latency, and how latency changes with niceness? What about specifying the timeframe over which the cpu bandwidth is measured? I currently have a system where the application designers would like it to be totally fair over a period of 1 second. As you can imagine, mainline doesn't do very well in this case. Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen @ 2007-04-16 16:13 ` William Lee Irwin III 2007-04-17 0:04 ` Peter Williams 2007-04-17 13:07 ` James Bruce 2 siblings, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-16 16:13 UTC (permalink / raw) To: Chris Friesen Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. On Mon, Apr 16, 2007 at 09:55:14AM -0600, Chris Friesen wrote: > The first question that comes to my mind is whether nice levels should > be linear or not. I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. > What about also having something that looks at latency, and how latency > changes with niceness? > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. As you can > imagine, mainline doesn't do very well in this case. It's unclear how latency enters the picture as the semantics of nice levels relevant to such are essentially priority preemption, which is not particularly easy to mess up. I suppose tests to ensure priority preemption occurs properly are in order. I don't really have a preference regarding specific semantics for nice numbers, just that they should be deterministic and specified somewhere. It's not really for us to decide what those semantics are as it's more of a userspace ABI/API issue. The timeframe is also relevant, but I suspect it's more of a performance metric than a strict requirement. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III @ 2007-04-17 0:04 ` Peter Williams 2007-04-17 13:07 ` James Bruce 2 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-17 0:04 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > William Lee Irwin III wrote: > >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. > > The first question that comes to my mind is whether nice levels should > be linear or not. No. That squishes one end of the table too much. It needs to be (approximately) piecewise linear around nice == 0. Here's the mapping I use in my entitlement based schedulers: #define NICE_TO_LP(nice) ((nice >=0) ? (20 - (nice)) : (20 + (nice) * (nice))) It has the (good) feature that a nice == 19 task has 1/20th the entitlement of a nice == 0 task and a nice == -20 task has 21 times the entitlement of a nice == 0 task. It's not strictly linear for negative nice values but is very cheap to calculate and quite easy to invert if necessary. > I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. > > What about also having something that looks at latency, and how latency > changes with niceness? > > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. Have you tried the spa_ebs scheduler? The half life is no longer a run time configurable parameter (as making it highly adjustable results in less efficient code) but it could be adjusted to be approximately equivalent to 0.5 seconds by changing some constants in the code. > As you can > imagine, mainline doesn't do very well in this case. You should look back through the plugsched patches where many of these ideas have been experimented with. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III 2007-04-17 0:04 ` Peter Williams @ 2007-04-17 13:07 ` James Bruce 2007-04-17 20:05 ` William Lee Irwin III 2 siblings, 1 reply; 713+ messages in thread From: James Bruce @ 2007-04-17 13:07 UTC (permalink / raw) To: linux-kernel; +Cc: ck Chris Friesen wrote: > William Lee Irwin III wrote: > >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. > > The first question that comes to my mind is whether nice levels should > be linear or not. I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 That value has the property that a nice=10 task gets 1/10th the cpu of a nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that would be fairly easy to explain to admins and users so that they can know what to expect from nicing tasks. > What about also having something that looks at latency, and how latency > changes with niceness? I think this would be a lot harder to pin down, since it's a function of all the other tasks running and their nice levels. Do you have any of the RT-derived analysis models in mind? > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. As you can > imagine, mainline doesn't do very well in this case. It might be easier to specify the maximum deviation from the ideal bandwidth over a certain period. I.e. something like "over a period of one second, each task receives within 10% of the expected bandwidth". - Jim Bruce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:07 ` James Bruce @ 2007-04-17 20:05 ` William Lee Irwin III 0 siblings, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 20:05 UTC (permalink / raw) To: James Bruce Cc: Chris Friesen, Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > That value has the property that a nice=10 task gets 1/10th the cpu of a > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > would be fairly easy to explain to admins and users so that they can > know what to expect from nicing tasks. [...additional good commentary trimmed...] Lots of good ideas here. I'll follow them. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg 2007-04-15 15:26 ` William Lee Irwin III @ 2007-04-15 15:39 ` Ingo Molnar 2007-04-15 15:47 ` William Lee Irwin III 2007-04-16 5:27 ` Peter Williams 2 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 15:39 UTC (permalink / raw) To: Willy Tarreau Cc: Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Ingo could have publicly spoken with them about his ideas of killing > the O(1) scheduler and replacing it with an rbtree-based one, [...] yes, that's precisely what i did, via a patchset :) [ I can even tell you when it all started: i was thinking about Mike's throttling patches while watching Manchester United beat the crap out of AS Roma (7 to 1 end result), Thuesday evening. I started coding it Wednesday morning and sent the patch Friday evening. I very much believe in low-latency when it comes to development too ;) ] (if this had been done via a comittee then today we'd probably still be trying to find a suitable timeslot for the initial conference call where we'd discuss the election of a chair who would be tasked with writing up an initial document of feature requests, on which we'd take a vote, possibly this year already, because the matter is really urgent you know ;-) > [...] and using part of Bill's work to speed up development. ok, let me make this absolutely clear: i didnt use any bit of plugsched - in fact the most difficult bits of the modularization was for areas of sched.c that plugsched never even touched AFAIK. (the load-balancer for example.) Plugsched simply does something else: i modularized scheduling policies in essence that have to cooperate with each other, while plugsched modularized complete schedulers which are compile-time or boot-time selected, with no runtime cooperation between them. (one has to be selected at a time) (and i have no trouble at all with crediting Will's work either: a few years ago i used Will's PID rework concepts for an NPTL related speedup and Will is very much credited for it in today's kernel/pid.c and he continued to contribute to it later on.) (the tree walking bits of sched_fair.c were in fact derived from kernel/hrtimer.c, the rbtree code written by Thomas and me :-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:39 ` Ingo Molnar @ 2007-04-15 15:47 ` William Lee Irwin III 2007-04-16 5:27 ` Peter Williams 1 sibling, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 15:47 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: >> [...] and using part of Bill's work to speed up development. On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote: > ok, let me make this absolutely clear: i didnt use any bit of plugsched > - in fact the most difficult bits of the modularization was for areas of > sched.c that plugsched never even touched AFAIK. (the load-balancer for > example.) > Plugsched simply does something else: i modularized scheduling policies > in essence that have to cooperate with each other, while plugsched > modularized complete schedulers which are compile-time or boot-time > selected, with no runtime cooperation between them. (one has to be > selected at a time) > (and i have no trouble at all with crediting Will's work either: a few > years ago i used Will's PID rework concepts for an NPTL related speedup > and Will is very much credited for it in today's kernel/pid.c and he > continued to contribute to it later on.) > (the tree walking bits of sched_fair.c were in fact derived from > kernel/hrtimer.c, the rbtree code written by Thomas and me :-) The extant plugsched patches have nothing to do with cfs; I suspect what everyone else is going on about is terminological confusion. The 4-year-old sample policy with scheduling classes for the original plugsched is something you had no way of knowing about, as it was never publicly posted. There isn't really anything all that interesting going on here, apart from pointing out that it's been done before. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:39 ` Ingo Molnar 2007-04-15 15:47 ` William Lee Irwin III @ 2007-04-16 5:27 ` Peter Williams 2007-04-16 6:23 ` Peter Williams 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-16 5:27 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Willy Tarreau <w@1wt.eu> wrote: > >> Ingo could have publicly spoken with them about his ideas of killing >> the O(1) scheduler and replacing it with an rbtree-based one, [...] > > yes, that's precisely what i did, via a patchset :) > > [ I can even tell you when it all started: i was thinking about Mike's > throttling patches while watching Manchester United beat the crap out > of AS Roma (7 to 1 end result), Thuesday evening. I started coding it > Wednesday morning and sent the patch Friday evening. I very much > believe in low-latency when it comes to development too ;) ] > > (if this had been done via a comittee then today we'd probably still be > trying to find a suitable timeslot for the initial conference call where > we'd discuss the election of a chair who would be tasked with writing up > an initial document of feature requests, on which we'd take a vote, > possibly this year already, because the matter is really urgent you know > ;-) > >> [...] and using part of Bill's work to speed up development. > > ok, let me make this absolutely clear: i didnt use any bit of plugsched > - in fact the most difficult bits of the modularization was for areas of > sched.c that plugsched never even touched AFAIK. (the load-balancer for > example.) This sounds like your new scheduler intends to increase the coupling between scheduling and load balancing. I think that this would be a mistake and lead (down the track) to spiralling complexity as you make changes to the code to address the corner conditions that it will create. > > Plugsched simply does something else: i modularized scheduling policies > in essence that have to cooperate with each other, while plugsched > modularized complete schedulers which are compile-time or boot-time > selected, with no runtime cooperation between them. (one has to be > selected at a time) You can't really have more than one scheduler operating in the same priority range on the same CPU as they will be fighting each other trying to achieve their separate and not necessarily compatible (in fact highly likely to be incompatible) aims. Multiple schedulers on the same CPU have to have a pecking order just like SCHED_OTHER and real time policies. It wouldn't be hard to prove that SCHED_RR and SCHED_FIFO is a problem in waiting if ever someone tried to use them both on a highly real time system. > > (and i have no trouble at all with crediting Will's work either: a few > years ago i used Will's PID rework concepts for an NPTL related speedup > and Will is very much credited for it in today's kernel/pid.c and he > continued to contribute to it later on.) > > (the tree walking bits of sched_fair.c were in fact derived from > kernel/hrtimer.c, the rbtree code written by Thomas and me :-) > > Ingo Are your new patches available somewhere for easy download or do I have to try to dig them out of the mailing list archive? Or could you mail them to me separately? I'm keen to see how you new scheduler proposal works. Thanks Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:27 ` Peter Williams @ 2007-04-16 6:23 ` Peter Williams 2007-04-16 6:40 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-16 6:23 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > > Are your new patches available somewhere for easy download or do I have > to try to dig them out of the mailing list archive? Or could you mail > them to me separately? I'm keen to see how you new scheduler proposal > works. Forget about this. I found the patch. After a quick look, I like a lot of what I see especially the removal of the dual arrays in the run queue. Some minor suggestions: 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to initialize the task structure in init_task.h. 2. the on_rq field in the task structure is unnecessary as many years of experience with ingosched in plugsched indicates that !list_empty(&(p)->run_list does the job provided list_del_init() is used when dequeueing and there is no noticeable overhead incurred so there's no gain by caching the result. Also it removes the possibility of errors creeping in due the value of on_rq being inconsistent with the task's actual state. 3. having modular load balancing is a good idea but it should be decoupled form the scheduler and provided as a separate interface. This would enable different schedulers to use the same load balancer if they desired. 4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be fair(ish) anyway. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 6:23 ` Peter Williams @ 2007-04-16 6:40 ` Peter Williams 2007-04-16 7:32 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-16 6:40 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Peter Williams wrote: >> >> Are your new patches available somewhere for easy download or do I >> have to try to dig them out of the mailing list archive? Or could you >> mail them to me separately? I'm keen to see how you new scheduler >> proposal works. > > Forget about this. I found the patch. > > After a quick look, I like a lot of what I see especially the removal of > the dual arrays in the run queue. > > Some minor suggestions: > > 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to > initialize the task structure in init_task.h. > 2. the on_rq field in the task structure is unnecessary as many years of > experience with ingosched in plugsched indicates that > !list_empty(&(p)->run_list does the job provided list_del_init() is used > when dequeueing and there is no noticeable overhead incurred so there's > no gain by caching the result. Also it removes the possibility of > errors creeping in due the value of on_rq being inconsistent with the > task's actual state. > 3. having modular load balancing is a good idea but it should be > decoupled form the scheduler and provided as a separate interface. This > would enable different schedulers to use the same load balancer if they > desired. > 4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be > fair(ish) anyway. One more quick comment. The claim that there is no concept of time slice in the new scheduler is only true in the sense of the rather arcane implementation of time slices extant in the O(1) scheduler. Your new parameter sched_granularity_ns is equivalent to the concept of time slice in most other kernels that I've peeked inside and computing literature in general (going back over several decades e.g. the magic garden). Welcome to the mainstream, Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 6:40 ` Peter Williams @ 2007-04-16 7:32 ` Ingo Molnar 2007-04-16 8:54 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-16 7:32 UTC (permalink / raw) To: Peter Williams Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > One more quick comment. The claim that there is no concept of time > slice in the new scheduler is only true in the sense of the rather > arcane implementation of time slices extant in the O(1) scheduler. yeah. AFAIK most other mainstream OSs also still often use similarly 'arcane' concepts (i'm here ignoring literature, you can find everything and its opposite suggested in literature) so i felt the need to point out the difference ;) After all Linux is about doing a better mainstream OS, it is not about beating the OS literature at lunacy ;-) The precise statement would be: "there's no concept of giving out a time-slice to a task and sticking to it unless a higher-prio task comes along, nor is there a concept of having a low-res granularity ->time_slice thing. There is accurate accounting of how much CPU time a task used up, and there is a granularity setting that together gives the current task a fairness advantage of a given amount of nanoseconds - which has similar [but not equivalent] effects to traditional timeslices that most mainstream OSs use". > Your new parameter sched_granularity_ns is equivalent to the concept > of time slice in most other kernels that I've peeked inside and > computing literature in general (going back over several decades e.g. > the magic garden). note that you can set it to 0 and the box still functions - so sched_granularity_ns, while useful for performance/bandwidth workloads, isnt truly inherent to the design. So in the announcement i just opted for a short sentence: "there's no concept of timeslices", albeit like most short stentences it's not a technically 100% accurate statement - but still it conveyed the intended information more effectively to the interested lkml reader than the longer version could ever have =B-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 7:32 ` Ingo Molnar @ 2007-04-16 8:54 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-16 8:54 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Peter Williams <pwil3058@bigpond.net.au> wrote: > >> One more quick comment. The claim that there is no concept of time >> slice in the new scheduler is only true in the sense of the rather >> arcane implementation of time slices extant in the O(1) scheduler. > > yeah. AFAIK most other mainstream OSs also still often use similarly > 'arcane' concepts (i'm here ignoring literature, you can find everything > and its opposite suggested in literature) so i felt the need to point > out the difference ;) After all Linux is about doing a better mainstream > OS, it is not about beating the OS literature at lunacy ;-) > > The precise statement would be: "there's no concept of giving out a > time-slice to a task and sticking to it unless a higher-prio task comes > along, I would have said "no concept of using tile slices to implement nice" which always seemed strange to me. If it really does what you just said then a (malicious or otherwise) CPU intensive task that never sleeps, once it got the CPU, would completely hog the CPU. > nor is there a concept of having a low-res granularity > ->time_slice thing. There is accurate accounting of how much CPU time a > task used up, and there is a granularity setting that together gives the > current task a fairness advantage of a given amount of nanoseconds - > which has similar [but not equivalent] effects to traditional timeslices > that most mainstream OSs use". Most traditional OSes have more or less fixed time slices and do the scheduling by fiddling the dynamic priority. Using total CPU used will also come to grief when used for long running tasks. Eventually, even very low bandwidth tasks will accumulate enough total CPU to look busy. The CPU bandwidth the task is using is what needs to be controlled. Or have I not looked closely enough at what sched_granularity_ns does? Is it really a control for the decay rate of a CPU usage bandwidth metric? > >> Your new parameter sched_granularity_ns is equivalent to the concept >> of time slice in most other kernels that I've peeked inside and >> computing literature in general (going back over several decades e.g. >> the magic garden). > > note that you can set it to 0 and the box still functions - so > sched_granularity_ns, while useful for performance/bandwidth workloads, > isnt truly inherent to the design. Just like my SPA schedulers. But if you set it to zero you'll get a fairly high context switch rate with associated overhead, won't you? > > So in the announcement i just opted for a short sentence: "there's no > concept of timeslices", albeit like most short stentences it's not a > technically 100% accurate statement - but still it conveyed the intended > information more effectively to the interested lkml reader than the > longer version could ever have =B-) I hope that I implied that I was being picky :-) (I meant to -- imply I was being picky, that is). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 10:39 ` Pekka Enberg 2007-04-15 12:45 ` Willy Tarreau @ 2007-04-15 15:16 ` Gene Heskett 2007-04-15 16:43 ` Con Kolivas 1 sibling, 1 reply; 713+ messages in thread From: Gene Heskett @ 2007-04-15 15:16 UTC (permalink / raw) To: Pekka Enberg Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Pekka Enberg wrote: >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: >> The perception here is that there is that there is this expectation that >> sections of the Linux kernel are intentionally "churn squated" to prevent >> any other ideas from creeping in other than of the owner of that subsytem > >Strangely enough, my perception is that Ingo is simply trying to >address the issues Mike's testing discovered in RDSL and SD. It's not >surprising Ingo made it a separate patch set as Con has repeatedly >stated that the "problems" are in fact by design and won't be fixed. I won't get into the middle of this just yet, not having decided which dog I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for about 24 hours, its been generally usable, but gzip still causes lots of 5 to 10+ second lags when its running. I'm coming to the conclusion that gzip simply doesn't play well with others... Amazing to me, the cpu its using stays generally below 80%, and often below 60%, even while the kmail composer has a full sentence in its buffer that it still hasn't shown me when I switch to the htop screen to check, and back to the kmail screen to see if its updated yet. The screen switch doesn't seem to lag so I don't think renicing x would be helpfull. Those are the obvious lags, and I'll build & reboot to the CFS patch at some point this morning (whats left of it that is :). And report in due time of course -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) knot in cables caused data stream to become twisted and kinked ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:16 ` Gene Heskett @ 2007-04-15 16:43 ` Con Kolivas 2007-04-15 16:58 ` Gene Heskett 0 siblings, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-15 16:43 UTC (permalink / raw) To: Gene Heskett Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 01:16, Gene Heskett wrote: > On Sunday 15 April 2007, Pekka Enberg wrote: > >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > >> The perception here is that there is that there is this expectation that > >> sections of the Linux kernel are intentionally "churn squated" to > >> prevent any other ideas from creeping in other than of the owner of that > >> subsytem > > > >Strangely enough, my perception is that Ingo is simply trying to > >address the issues Mike's testing discovered in RDSL and SD. It's not > >surprising Ingo made it a separate patch set as Con has repeatedly > >stated that the "problems" are in fact by design and won't be fixed. > > I won't get into the middle of this just yet, not having decided which dog > I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for > about 24 hours, its been generally usable, but gzip still causes lots of 5 > to 10+ second lags when its running. I'm coming to the conclusion that > gzip simply doesn't play well with others... Actually Gene I think you're being bitten here by something I/O bound since the cpu usage never tops out. If that's the case and gzip is dumping truckloads of writes then you're suffering something that irks me even more than the scheduler in linux, and that's how much writes hurt just about everything else. Try your testcase with bzip2 instead (since that won't be i/o bound), or drop your dirty ratio to as low as possible which helps a little bit (5% is the minimum) echo 5 > /proc/sys/vm/dirty_ratio and finally try the braindead noop i/o scheduler as well. echo noop > /sys/block/sda/queue/scheduler (replace sda with your drive obviously). I'd wager a big one that's what causes your gzip pain. If it wasn't for the fact that I've decided to all but give up ever trying to provide code for mainline again, trying my best to make writes hurt less on linux would be my next big thing [tm]. Oh and for the others watching, (points to vm hackers) I found a bug when playing with the dirty ratio code. If you modify it to allow it drop below 5% but still above the minimum in the vm code, stalls happen somewhere in the vm where nothing much happens for sometimes 20 or 30 seconds worst case scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be set ultra low because these stalls were gross. > Amazing to me, the cpu its using stays generally below 80%, and often below > 60%, even while the kmail composer has a full sentence in its buffer that > it still hasn't shown me when I switch to the htop screen to check, and > back to the kmail screen to see if its updated yet. The screen switch > doesn't seem to lag so I don't think renicing x would be helpfull. Those > are the obvious lags, and I'll build & reboot to the CFS patch at some > point this morning (whats left of it that is :). And report in due time of > course -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:43 ` Con Kolivas @ 2007-04-15 16:58 ` Gene Heskett 2007-04-15 18:00 ` Mike Galbraith 0 siblings, 1 reply; 713+ messages in thread From: Gene Heskett @ 2007-04-15 16:58 UTC (permalink / raw) To: Con Kolivas Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Con Kolivas wrote: >On Monday 16 April 2007 01:16, Gene Heskett wrote: >> On Sunday 15 April 2007, Pekka Enberg wrote: >> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: >> >> The perception here is that there is that there is this expectation >> >> that sections of the Linux kernel are intentionally "churn squated" to >> >> prevent any other ideas from creeping in other than of the owner of >> >> that subsytem >> > >> >Strangely enough, my perception is that Ingo is simply trying to >> >address the issues Mike's testing discovered in RDSL and SD. It's not >> >surprising Ingo made it a separate patch set as Con has repeatedly >> >stated that the "problems" are in fact by design and won't be fixed. >> >> I won't get into the middle of this just yet, not having decided which dog >> I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for >> about 24 hours, its been generally usable, but gzip still causes lots of 5 >> to 10+ second lags when its running. I'm coming to the conclusion that >> gzip simply doesn't play well with others... > >Actually Gene I think you're being bitten here by something I/O bound since >the cpu usage never tops out. If that's the case and gzip is dumping >truckloads of writes then you're suffering something that irks me even more >than the scheduler in linux, and that's how much writes hurt just about >everything else. Try your testcase with bzip2 instead (since that won't be >i/o bound), or drop your dirty ratio to as low as possible which helps a >little bit (5% is the minimum) > >echo 5 > /proc/sys/vm/dirty_ratio > >and finally try the braindead noop i/o scheduler as well. > >echo noop > /sys/block/sda/queue/scheduler > >(replace sda with your drive obviously). > >I'd wager a big one that's what causes your gzip pain. If it wasn't for the >fact that I've decided to all but give up ever trying to provide code for >mainline again, trying my best to make writes hurt less on linux would be my >next big thing [tm]. Chuckle, possibly but then I'm not anything even remotely close to an expert here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and profanity as the case may call for. >Oh and for the others watching, (points to vm hackers) I found a bug when >playing with the dirty ratio code. If you modify it to allow it drop below > 5% but still above the minimum in the vm code, stalls happen somewhere in > the vm where nothing much happens for sometimes 20 or 30 seconds worst case > scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to > be set ultra low because these stalls were gross. I think I'd need a bit of tutoring on how to do that. I recall that one other time, several weeks back, I thought I would try one of those famous echo this >/proc/that ideas that went by on this list, but even though I was root, apparently /proc was read-only AFAIWC. >> Amazing to me, the cpu its using stays generally below 80%, and often >> below 60%, even while the kmail composer has a full sentence in its buffer >> that it still hasn't shown me when I switch to the htop screen to check, >> and back to the kmail screen to see if its updated yet. The screen switch >> doesn't seem to lag so I don't think renicing x would be helpfull. Those >> are the obvious lags, and I'll build & reboot to the CFS patch at some >> point this morning (whats left of it that is :). And report in due time >> of course And now I wonder if I applied the right patch. This one feels good ATM, but I don't think its the CFS thingy. No, I'm sure of it now, none of the patches I've saved say a thing about CFS. Backtrack up the list time I guess, ignore me for the nonce. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Microsoft: Re-inventing square wheels -- From a Slashdot.org post ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:58 ` Gene Heskett @ 2007-04-15 18:00 ` Mike Galbraith 2007-04-16 0:18 ` Gene Heskett 0 siblings, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 18:00 UTC (permalink / raw) To: Gene Heskett Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote: > Chuckle, possibly but then I'm not anything even remotely close to an expert > here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 + > sched-mike-5.patch for grins and giggles, or frowns and profanity as the case > may call for. Erm, that patch is embarrassingly buggy, so profanity should dominate. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 18:00 ` Mike Galbraith @ 2007-04-16 0:18 ` Gene Heskett 0 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-16 0:18 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Mike Galbraith wrote: >On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote: >> Chuckle, possibly but then I'm not anything even remotely close to an >> expert here Con, just reporting what I get. And I just rebooted to >> 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and >> profanity as the case may call for. > >Erm, that patch is embarrassingly buggy, so profanity should dominate. > > -Mike Chuckle, ROTFLMAO even. I didn't run it that long as I immediately rebuilt and rebooted when I found I'd used the wrong patch, and in fact had tested that one and found it sub-optimal before I'd built and ran Con's -0.40 version. As for bugs of the type that make it to the screen or logs, I didn't see any. OTOH, my eyesight is slowly going downhill, now 20/25. It was 20/10 30 years ago. Now thats reason for profanity... -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Unix weanies are as bad at this as anyone. -- Larry Wall in <199702111730.JAA28598@wall.org> ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 5:16 ` Bill Huey 2007-04-15 8:44 ` Ingo Molnar @ 2007-04-15 16:11 ` Bernd Eckenfels 1 sibling, 0 replies; 713+ messages in thread From: Bernd Eckenfels @ 2007-04-15 16:11 UTC (permalink / raw) To: linux-kernel In article <20070415051645.GA28438@gnuppy.monkey.org> you wrote: > A development process like this is likely to exclude smart people from wanting > to contribute to Linux and folks should be conscious about this issues. Nobody is excluded, you can always have a next iteration. Gruss Bernd ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas 2007-04-15 5:16 ` Bill Huey @ 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 8:36 ` Bill Huey 2007-04-17 0:06 ` Peter Williams 2007-04-15 15:05 ` Ingo Molnar 2 siblings, 2 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 6:43 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Peter Williams, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote: > On Saturday 14 April 2007 06:21, Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > [CFS] > > > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > > > This project is a complete rewrite of the Linux task scheduler. My goal > > is to address various feature requests and to fix deficiencies in the > > vanilla scheduler that were suggested/found in the past few years, both > > for desktop scheduling and for server scheduling workloads. > > The casual observer will be completely confused by what on earth has happened > here so let me try to demystify things for them. [...] Demystify what? The casual observer need only read either your attempt at writing a scheduler, or my attempts at fixing the one we have, to see that it was high time for someone with the necessary skills to step in. Now progress can happen, which was _not_ happening before. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 6:43 ` Mike Galbraith @ 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith ` (2 more replies) 2007-04-17 0:06 ` Peter Williams 1 sibling, 3 replies; 713+ messages in thread From: Bill Huey @ 2007-04-15 8:36 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > [...] > > Demystify what? The casual observer need only read either your attempt Here's the problem. You're a casual observer and obviously not paying attention. > at writing a scheduler, or my attempts at fixing the one we have, to see > that it was high time for someone with the necessary skills to step in. > Now progress can happen, which was _not_ happening before. I think that's inaccurate and there are plenty of folks that have that technical skill and background. The scheduler code isn't a deep mystery and there are plenty of good kernel hackers out here across many communities. Ingo isn't the only person on this planet to have deep scheduler knowledge. Priority heaps are not new and Solaris has had a pluggable scheduler framework for years. Con's characterization is something that I'm more prone to believe about how Linux kernel development works versus your view. I think it's a great shame to have folks like Bill Irwin and Con to have waste time trying to do something right only to have their ideas attack, then copied and held as the solution for this kind of technical problem as complete reversal of technical opinion as it suits a moment. This is just wrong in so many ways. It outlines the problems with Linux kernel development and questionable elistism regarding ownership of certain sections of the kernel code. I call it "churn squat" and instances like this only support that view which I would rather it be completely wrong and inaccurate instead. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey @ 2007-04-15 8:45 ` Mike Galbraith 2007-04-15 9:06 ` Ingo Molnar 2007-04-15 16:25 ` Arjan van de Ven 2 siblings, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-15 8:45 UTC (permalink / raw) To: Bill Huey Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote: > On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > > [...] > > > > Demystify what? The casual observer need only read either your attempt > > Here's the problem. You're a casual observer and obviously not paying > attention. > > > at writing a scheduler, or my attempts at fixing the one we have, to see > > that it was high time for someone with the necessary skills to step in. > > Now progress can happen, which was _not_ happening before. > > I think that's inaccurate and there are plenty of folks that have that > technical skill and background. The scheduler code isn't a deep mystery > and there are plenty of good kernel hackers out here across many > communities. Ingo isn't the only person on this planet to have deep > scheduler knowledge. Ok <shrug>, I'm not paying attention, and you can't read. We're even. Have a nice life. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith @ 2007-04-15 9:06 ` Ingo Molnar 2007-04-16 10:00 ` Ingo Molnar 2007-04-15 16:25 ` Arjan van de Ven 2 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 9:06 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > > [...] > > > > Demystify what? The casual observer need only read either your > > attempt > > Here's the problem. You're a casual observer and obviously not paying > attention. guys, please calm down. Judging by the number of contributions to sched.c the main folks who are not 'observers' here and who thus have an unalienable right to be involved in a nasty flamewar about scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else is just a happy bystander, ok? ;-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 9:06 ` Ingo Molnar @ 2007-04-16 10:00 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-16 10:00 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Peter Williams, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > guys, please calm down. Judging by the number of contributions to > sched.c the main folks who are not 'observers' here and who thus have > an unalienable right to be involved in a nasty flamewar about > scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else > is just a happy bystander, ok? ;-) just to make sure: this is a short (and incomplete) list of contributors related to scheduler interactivity code. The full list of contributors to sched.c includes many other people as well: Peter, Suresh, Christoph, Kenneth and many others. Even the git logs, which only span 2 years out of 15, already list 79 individual contributors to kernel/sched.c. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith 2007-04-15 9:06 ` Ingo Molnar @ 2007-04-15 16:25 ` Arjan van de Ven 2007-04-16 5:36 ` Bill Huey 2 siblings, 1 reply; 713+ messages in thread From: Arjan van de Ven @ 2007-04-15 16:25 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Thomas Gleixner > It outlines the problems with Linux kernel development and questionable > elistism regarding ownership of certain sections of the kernel code. I have to step in and disagree here.... Linux is not about who writes the code. Linux is about getting the best solution for a problem. Who wrote which line of the code is irrelevant in the big picture. that often means that multiple implementations happen, and that the a darwinistic process decides that the best solution wins. This darwinistic process often happens in the form of discussion, and that discussion can happen with words or with code. In this case it happened with a code proposal. To make this specific: it has happened many times to me that when I solved an issue with code, someone else stepped in and wrote a different solution (although that was usually for smaller pieces). Was I upset about that? No! I was happy because my *problem got solved* in the best possible way. Now this doesn't mean that people shouldn't be nice to each other, not cooperate or steal credits, but I don't get the impression that that is happening here. Ingo is taking part in the discussion with a counter proposal for discussion *on the mailing list*. What more do you want?? If you or anyone else can improve it or do better, take part of this discussion and show what you mean either in words or in code. Your qualification of the discussion as a elitist takeover... I disagree with that. It's a *discussion*. Now if you agree that Ingo's patch is better technically, you and others should be happy about that because your problem is getting solved better. If you don't agree that his patch is better technically, take part in the technical discussion. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:25 ` Arjan van de Ven @ 2007-04-16 5:36 ` Bill Huey 2007-04-16 6:17 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Bill Huey @ 2007-04-16 5:36 UTC (permalink / raw) To: Arjan van de Ven Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote: > Now this doesn't mean that people shouldn't be nice to each other, not > cooperate or steal credits, but I don't get the impression that that is > happening here. Ingo is taking part in the discussion with a counter > proposal for discussion *on the mailing list*. What more do you want?? Con should have been CCed from the first moment this was put into motion to limit the perception of exclusion. That was mistake number one and big time failures to understand this dynamic. After it was Con's idea. Why the hell he was excluded from Ingo's development process is baffling to me and him (most likely). He put int a lot of effort into SDL and his experiences with scheduling should still be seriously considered in this development process even if he doesn't write a single line of code from this moment on. What should have happened is that our very busy associate at RH by the name of Ingo Molnar should have leverage more of Con's and Bill's work and use them as a proxy for his own ideas. They would have loved to have contributed more and our very busy Ingo Molnar would have gotten a lot of his work and ideas implemented without him even opening a single source file for editting. They would have happily done this work for Ingo. Ingo could have been used for something else more important like making KVM less of a freaking ugly hack and we all would have benefitted from this. He could have been working on SystemTap so that you stop losing accounts to Sun and Solaris 10's Dtrace. He could have been working with Riel to fix your butt ugly page scanning problem causing horrible contention via the Clock/Pro algorithm, etc... He could have been fixing the ugly futex rwsem mapping problem that's killing -rt and anything that uses Posix threads. He could have created a userspace thread control block (TCB) with Mr. Drepper so that we can turn off preemption in userspace (userspace per CPU local storage) and implement a very quick non-kernel crossing implementation of priority ceilings (userspace check for priority and flags at preempt_schedule() in the TCB) so that our -rt Posix API doesn't suck donkey shit... Need I say more ? As programmers like Ingo get spread more thinly, he needs super smart folks like Bill Irwin and Con to help him out and learn to resist NIH folk's stuff out of some weird fear. When this happens, folks like Ingo must learn to "facilitate" development in addition to implementing it with those kind of folks. This takes time and practice to entrust folks to do things for him. Ingo is the best method of getting new Linux kernel ideas and communicate them to Linus. His value goes beyond just just code and is often the biggest hammer we have in the Linux community to get stuff into the kernel. "Facilitation" of others is something that solo programmers must need when groups like the Linux kernel get larger and large every year. Understand ? Are we in embarrassing agreement here ? bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:36 ` Bill Huey @ 2007-04-16 6:17 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-16 6:17 UTC (permalink / raw) To: Bill Huey Cc: Arjan van de Ven, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner On Sun, Apr 15, 2007 at 10:36:29PM -0700, Bill Huey wrote: > On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote: > > Now this doesn't mean that people shouldn't be nice to each other, not > > cooperate or steal credits, but I don't get the impression that that is > > happening here. Ingo is taking part in the discussion with a counter > > proposal for discussion *on the mailing list*. What more do you want?? > > Con should have been CCed from the first moment this was put into motion > to limit the perception of exclusion. That was mistake number one and big > time failures to understand this dynamic. After it was Con's idea. Why > the hell he was excluded from Ingo's development process is baffling to > me and him (most likely). Ingo's scheduler is completely different to any I've seen proposed for Linux. And after he did an initial implementation, he did post it to everyone. Maybe something he said offended someone, but the process followed is exactly how Linux kernel development works (ie. if you think you can do better, then write the code). Sometimes you can give suggestions, but other times if you come up with a different idea then it is better just to do it yourself. Con's code is still out there. If it is better than Ingo's then it should win out. Nobody has a monopoly on schedulers or ideas or posting patches. > He put int a lot of effort into SDL and his experiences with scheduling > should still be seriously considered in this development process even if > he doesn't write a single line of code from this moment on. > > What should have happened is that our very busy associate at RH by the > name of Ingo Molnar should have leverage more of Con's and Bill's work > and use them as a proxy for his own ideas. They would have loved to have > contributed more and our very busy Ingo Molnar would have gotten a lot > of his work and ideas implemented without him even opening a single > source file for editting. They would have happily done this work for > Ingo. Ingo could have been used for something else more important like > making KVM less of a freaking ugly hack and we all would have benefitted > from this. > > He could have been working on SystemTap so that you stop losing accounts > to Sun and Solaris 10's Dtrace. He could have been working with Riel to > fix your butt ugly page scanning problem causing horrible contention via > the Clock/Pro algorithm, etc... He could have been fixing the ugly futex > rwsem mapping problem that's killing -rt and anything that uses Posix > threads. He could have created a userspace thread control block (TCB) > with Mr. Drepper so that we can turn off preemption in userspace > (userspace per CPU local storage) and implement a very quick non-kernel > crossing implementation of priority ceilings (userspace check for priority > and flags at preempt_schedule() in the TCB) so that our -rt Posix API > doesn't suck donkey shit... Need I say more ? Well that's some pretty strong criticism of Linux and of someone who does a great deal to improve things... Let's stick to the topic of schedulers in this thread and try keeping it constructive. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 8:36 ` Bill Huey @ 2007-04-17 0:06 ` Peter Williams 2007-04-17 2:29 ` Mike Galbraith 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 0:06 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner Mike Galbraith wrote: > On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote: >> On Saturday 14 April 2007 06:21, Ingo Molnar wrote: >>> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler >>> [CFS] >>> >>> i'm pleased to announce the first release of the "Modular Scheduler Core >>> and Completely Fair Scheduler [CFS]" patchset: >>> >>> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch >>> >>> This project is a complete rewrite of the Linux task scheduler. My goal >>> is to address various feature requests and to fix deficiencies in the >>> vanilla scheduler that were suggested/found in the past few years, both >>> for desktop scheduling and for server scheduling workloads. >> The casual observer will be completely confused by what on earth has happened >> here so let me try to demystify things for them. > > [...] > > Demystify what? The casual observer need only read either your attempt > at writing a scheduler, or my attempts at fixing the one we have, to see > that it was high time for someone with the necessary skills to step in. Make that "someone with the necessary clout". > Now progress can happen, which was _not_ happening before. > This is true. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 0:06 ` Peter Williams @ 2007-04-17 2:29 ` Mike Galbraith 2007-04-17 3:40 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-17 2:29 UTC (permalink / raw) To: Peter Williams Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > Mike Galbraith wrote: > > > > Demystify what? The casual observer need only read either your attempt > > at writing a scheduler, or my attempts at fixing the one we have, to see > > that it was high time for someone with the necessary skills to step in. > > Make that "someone with the necessary clout". No, I was brutally honest to both of us, but quite correct. > > Now progress can happen, which was _not_ happening before. > > > > This is true. Yup, and progress _is_ happening now, quite rapidly. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 2:29 ` Mike Galbraith @ 2007-04-17 3:40 ` Nick Piggin 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 4:17 ` Peter Williams 0 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 3:40 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > > Mike Galbraith wrote: > > > > > > Demystify what? The casual observer need only read either your attempt > > > at writing a scheduler, or my attempts at fixing the one we have, to see > > > that it was high time for someone with the necessary skills to step in. > > > > Make that "someone with the necessary clout". > > No, I was brutally honest to both of us, but quite correct. > > > > Now progress can happen, which was _not_ happening before. > > > > > > > This is true. > > Yup, and progress _is_ happening now, quite rapidly. Progress as in progress on Ingo's scheduler. I still don't know how we'd decide when to replace the mainline scheduler or with what. I don't think we can say Ingo's is better than the alternatives, can we? If there is some kind of bakeoff, then I'd like one of Con's designs to be involved, and mine, and Peter's... Maybe the progress is that more key people are becoming open to the idea of changing the scheduler. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:40 ` Nick Piggin @ 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang ` (2 more replies) 2007-04-17 4:17 ` Peter Williams 1 sibling, 3 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-17 4:01 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > > Yup, and progress _is_ happening now, quite rapidly. > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > decide when to replace the mainline scheduler or with what. > > I don't think we can say Ingo's is better than the alternatives, can we? No, that would require massive performance testing of all alternatives. > If there is some kind of bakeoff, then I'd like one of Con's designs to > be involved, and mine, and Peter's... The trouble with a bakeoff is that it's pretty darn hard to get people to test in the first place, and then comes weighting the subjective and hard performance numbers. If they're close in numbers, do you go with the one which starts the least flamewars or what? > Maybe the progress is that more key people are becoming open to the idea > of changing the scheduler. Could be. All was quiet for quite a while, but when RSDL showed up, it aroused enough interest to show that scheduling woes is on folks radar. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] 2007-04-17 4:01 ` Mike Galbraith @ 2007-04-17 3:43 ` David Lang 2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin 2007-04-20 20:36 ` Bill Davidsen 2 siblings, 0 replies; 713+ messages in thread From: David Lang @ 2007-04-17 3:43 UTC (permalink / raw) To: Mike Galbraith Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, Mike Galbraith wrote: > Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely > FairScheduler [CFS] > > On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: >> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > >>> Yup, and progress _is_ happening now, quite rapidly. >> >> Progress as in progress on Ingo's scheduler. I still don't know how we'd >> decide when to replace the mainline scheduler or with what. >> >> I don't think we can say Ingo's is better than the alternatives, can we? > > No, that would require massive performance testing of all alternatives. > >> If there is some kind of bakeoff, then I'd like one of Con's designs to >> be involved, and mine, and Peter's... > > The trouble with a bakeoff is that it's pretty darn hard to get people > to test in the first place, and then comes weighting the subjective and > hard performance numbers. If they're close in numbers, do you go with > the one which starts the least flamewars or what? it's especially hard if the people doing the testing need to find the latest patch and apply it. even having a compile-time option to switch between them at least means that the testers can have confidence that the various patches haven't bitrotted. boot time options would be even better, but I understand from previous discussions I've watched that this is performance critical enough that the overhead of this would throw off the results. David Lang ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang @ 2007-04-17 4:14 ` Nick Piggin 2007-04-17 6:26 ` Peter Williams 2007-04-17 9:51 ` Ingo Molnar 2007-04-20 20:36 ` Bill Davidsen 2 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 4:14 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 06:01:29AM +0200, Mike Galbraith wrote: > On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: > > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > > > > Yup, and progress _is_ happening now, quite rapidly. > > > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > > decide when to replace the mainline scheduler or with what. > > > > I don't think we can say Ingo's is better than the alternatives, can we? > > No, that would require massive performance testing of all alternatives. > > > If there is some kind of bakeoff, then I'd like one of Con's designs to > > be involved, and mine, and Peter's... > > The trouble with a bakeoff is that it's pretty darn hard to get people > to test in the first place, and then comes weighting the subjective and > hard performance numbers. If they're close in numbers, do you go with > the one which starts the least flamewars or what? I don't know how you'd do it. I know you wouldn't count people telling you how good they are (getting people to tell you how bad they are, and whether others do better in a given situation might be slightly move viable). But we have to choose somehow. I'd hope that is going to be based solely on the results and technical properties of the code, so... if we were to somehow determine that the results are exactly the same, we'd go for the the simpler one, wouldn't we? > > Maybe the progress is that more key people are becoming open to the idea > > of changing the scheduler. > > Could be. All was quiet for quite a while, but when RSDL showed up, it > aroused enough interest to show that scheduling woes is on folks radar. Well I know people have had woes with the scheduler for ever (I guess that isn't going to change :P). I think people generally lost a bit of interest in trying to improve the situation because of the upstream problem. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin @ 2007-04-17 6:26 ` Peter Williams 2007-04-17 9:51 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-17 6:26 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > Well I know people have had woes with the scheduler for ever (I guess that > isn't going to change :P). I think people generally lost a bit of interest > in trying to improve the situation because of the upstream problem. Yes. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin 2007-04-17 6:26 ` Peter Williams @ 2007-04-17 9:51 ` Ingo Molnar 2007-04-17 13:44 ` Peter Williams 2007-04-20 20:47 ` Bill Davidsen 1 sibling, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 9:51 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Peter Williams, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > > Maybe the progress is that more key people are becoming open to > > > the idea of changing the scheduler. > > > > Could be. All was quiet for quite a while, but when RSDL showed up, > > it aroused enough interest to show that scheduling woes is on folks > > radar. > > Well I know people have had woes with the scheduler for ever (I guess > that isn't going to change :P). [...] yes, that part isnt going to change, because the CPU is a _scarce resource_ that is perhaps the most frequently overcommitted physical computer resource in existence, and because the kernel does not (yet) track eye movements of humans to figure out which tasks are more important them. So critical human constraints are unknown to the scheduler and thus complaints will always come. The upstream scheduler thought it had enough information: the sleep average. So now the attempt is to go back and _simplify_ the scheduler and remove that information, and concentrate on getting fairness precisely right. The magic thing about 'fairness' is that it's a pretty good default policy if we decide that we simply have not enough information to do an intelligent choice. ( Lets be cautious though: the jury is still out whether people actually like this more than the current approach. While CFS feedback looks promising after a whopping 3 days of it being released [ ;-) ], the test coverage of all 'fairness centric' schedulers, even considering years of availability is less than 1% i'm afraid, and that < 1% was mostly self-selecting. ) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:51 ` Ingo Molnar @ 2007-04-17 13:44 ` Peter Williams 2007-04-17 23:00 ` Michael K. Edwards 2007-04-20 20:47 ` Bill Davidsen 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 13:44 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Nick Piggin <npiggin@suse.de> wrote: > >>>> Maybe the progress is that more key people are becoming open to >>>> the idea of changing the scheduler. >>> Could be. All was quiet for quite a while, but when RSDL showed up, >>> it aroused enough interest to show that scheduling woes is on folks >>> radar. >> Well I know people have had woes with the scheduler for ever (I guess >> that isn't going to change :P). [...] > > yes, that part isnt going to change, because the CPU is a _scarce > resource_ that is perhaps the most frequently overcommitted physical > computer resource in existence, and because the kernel does not (yet) > track eye movements of humans to figure out which tasks are more > important them. So critical human constraints are unknown to the > scheduler and thus complaints will always come. > > The upstream scheduler thought it had enough information: the sleep > average. So now the attempt is to go back and _simplify_ the scheduler > and remove that information, and concentrate on getting fairness > precisely right. The magic thing about 'fairness' is that it's a pretty > good default policy if we decide that we simply have not enough > information to do an intelligent choice. > > ( Lets be cautious though: the jury is still out whether people actually > like this more than the current approach. While CFS feedback looks > promising after a whopping 3 days of it being released [ ;-) ], the > test coverage of all 'fairness centric' schedulers, even considering > years of availability is less than 1% i'm afraid, and that < 1% was > mostly self-selecting. ) At this point I'd like to make the observation that spa_ebs is a very fair scheduler if you consider "nice" to be an indication of the relative entitlement of tasks to CPU bandwidth. It works by mapping nice to shares using a function very similar to the one for calculating p->load weight except it's not offset by the RT priorities as RT is handled separately. In theory, a runnable task's entitlement to CPU bandwidth at any time is the ratio of its shares to the total shares held by runnable tasks on the same CPU (in reality, a smoothed average of this sum is used to make scheduling smoother). The dynamic priorities of the runnable tasks are then fiddled to try to keep each tasks CPU bandwidth usage in proportion to its entitlement. That's the theory anyway. The actual implementation looks a bit different due to efficiency considerations. The modifications to the above theory boil down to keeping a running measure of the (recent) highest CPU bandwidth use per share for tasks running on the CPU -- I call this the yardstick for this CPU. When it's time to put a task on the run queue it's dynamic priority is determined by comparing its CPU bandwidth per share value with the yardstick for its CPU. If it's greater than the yardstick this value becomes the new yardstick and the task gets given the lowest possible dynamic priority (for its scheduling class). If the value is zero it gets the highest possible priority (for its scheduling class) which would be MAX_RT_PRIO for a SCHED_OTHER task. Otherwise it gets given a priority between these two extremes proportional to ratio of its CPU bandwidth per share value and the yardstick. Quite simple really. The other way in which the code deviates from the original as that (for a few years now) I no longer calculated CPU bandwidth usage directly. I've found that the overhead is less if I keep a running average of the size of a tasks CPU bursts and the length of its scheduling cycle (i.e. from on CPU one time to on CPU next time) and using the ratio of these values as a measure of bandwidth usage. Anyway it works and gives very predictable allocations of CPU bandwidth based on nice. Another good feature is that (in this pure form) it's starvation free. However, if you fiddle with it and do things like giving bonus priority boosts to interactive tasks it becomes susceptible to starvation. This can be fixed by using an anti starvation mechanism such as SPA's promotion scheme and that's what spa_ebs does. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:44 ` Peter Williams @ 2007-04-17 23:00 ` Michael K. Edwards 2007-04-17 23:07 ` William Lee Irwin III 2007-04-18 2:39 ` Peter Williams 0 siblings, 2 replies; 713+ messages in thread From: Michael K. Edwards @ 2007-04-17 23:00 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > The other way in which the code deviates from the original as that (for > a few years now) I no longer calculated CPU bandwidth usage directly. > I've found that the overhead is less if I keep a running average of the > size of a tasks CPU bursts and the length of its scheduling cycle (i.e. > from on CPU one time to on CPU next time) and using the ratio of these > values as a measure of bandwidth usage. > > Anyway it works and gives very predictable allocations of CPU bandwidth > based on nice. Works, that is, right up until you add nonlinear interactions with CPU speed scaling. From my perspective as an embedded platform integrator, clock/voltage scaling is the elephant in the scheduler's living room. Patch in DPM (now OpPoint?) to scale the clock based on what task is being scheduled, and suddenly the dynamic priority calculations go wild. Nip this in the bud by putting an RT priority on the relevant threads (which you have to do anyway if you need remotely audio-grade latency), and the lock affinity heuristics break, so you have to hand-tune all the thread priorities. Blecch. Not to mention the likelihood that the task whose clock speed you're trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority than the application. (You want to crank the CPU for this task because it runs with the RF hot, which may cost you as much power as the rest of the platform.) You'd better hope you can remove it from the dynamic priority heuristics with SCHED_BATCH. Otherwise everything _else_ has to be RT priority (or it'll be starved by the soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double blecch! Is it too much to ask for someone with actual engineering training (not me, unfortunately) to sit down and build a negative-feedback control system that handles soft-real-time _and_ dynamic-priority _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock scaling? And actually separates the accounting and control mechanisms from the heuristics, so the latter can be tuned (within a well documented stable range) to reflect the expected system usage patterns? It's not like there isn't a vast literature in this area over the past decade, including some dealing specifically with clock scaling consistent with low-latency applications. It's a pity that people doing academic work in this area rarely wade into LKML, even when they're hacking on a Linux fork. But then, there's not much economic incentive for them to do so, and they can usually get their fill of citation politics and dominance games without leaving their home department. :-P Seriously, though. If you're really going to put the mainline scheduler through this kind of churn, please please pretty please knit in per-task clock scaling (possibly even rejigged during the slice; see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of linger mechanism to keep from taking context switch hits when you're confident that an I/O will complete quickly. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:00 ` Michael K. Edwards @ 2007-04-17 23:07 ` William Lee Irwin III 2007-04-17 23:52 ` Michael K. Edwards 2007-04-18 2:39 ` Peter Williams 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 23:07 UTC (permalink / raw) To: Michael K. Edwards Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:00:53PM -0700, Michael K. Edwards wrote: > Works, that is, right up until you add nonlinear interactions with CPU > speed scaling. From my perspective as an embedded platform > integrator, clock/voltage scaling is the elephant in the scheduler's > living room. Patch in DPM (now OpPoint?) to scale the clock based on > what task is being scheduled, and suddenly the dynamic priority > calculations go wild. Nip this in the bud by putting an RT priority > on the relevant threads (which you have to do anyway if you need > remotely audio-grade latency), and the lock affinity heuristics break, > so you have to hand-tune all the thread priorities. Blecch. [...not terribly enlightening stuff trimmed...] The ongoing scheduler work is on a much more basic level than these affairs I'm guessing you googled. When the basics work as intended it will be possible to move on to more advanced issues. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:07 ` William Lee Irwin III @ 2007-04-17 23:52 ` Michael K. Edwards 2007-04-18 0:36 ` Bill Huey 0 siblings, 1 reply; 713+ messages in thread From: Michael K. Edwards @ 2007-04-17 23:52 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote: > The ongoing scheduler work is on a much more basic level than these > affairs I'm guessing you googled. When the basics work as intended it > will be possible to move on to more advanced issues. OK, let me try this in smaller words for people who can't tell bitter experience from Google hits. CPU clock scaling for power efficiency is already the only thing that matters about the Linux scheduler in my world, because battery-powered device vendors in their infinite wisdom are abandoning real RTOSes in favor of Linux now that WiFi is the "in" thing (again). And on the timescale that anyone will actually be _using_ this shiny new scheduler of Ingo's, it'll be nearly the only thing that matters about the Linux scheduler in anyone's world, because the amount of work the CPU can get done in a given minute will depend mostly on how intelligently it can spend its heat dissipation budget. Clock scaling schemes that aren't integral to the scheduler design make a bad situation (scheduling embedded loads with shotgun heuristics tuned for desktop CPUs) worse, because the opaque heuristics are now being applied to distorted data. Add a "smoothing" scheme for the distorted data, and you may find that you have introduced an actual control-path instability. A small fluctuation in the data (say, two bursts of interrupt traffic at just the right interval) can result in a long-lasting oscillation in some task's "dynamic priority" -- and, on a fully loaded CPU, in the time that task actually gets. If anything else depends on how much work this task gets done each time around, the oscillation can easily propagate throughout the system. Thrash city. (If you haven't seen this happen on real production systems under what shouldn't be a pathological load, you haven't been around long. The classic mechanisms that triggered oscillations in, say, early SMP Solaris boxes haven't bitten recently, perhaps because most modern CPUs don't lose their marbles so comprehensively on context switch. But I got to live this nightmare again recently on ARM Linux, due to some impressively broken application-level threading/locking "design", whose assumptions about scheduler behavior got broken when I switched to an NPTL toolchain.) I don't have the training to design a scheduler that isn't vulnerable to control-feedback oscillations. Neither do you, if you haven't taken (and excelled at) a control theory course, which nowadays seems to be taught by applied math and ECE departments and too often skipped by CS types. But I can recognize an impending train wreck when I see it. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:52 ` Michael K. Edwards @ 2007-04-18 0:36 ` Bill Huey 0 siblings, 0 replies; 713+ messages in thread From: Bill Huey @ 2007-04-18 0:36 UTC (permalink / raw) To: Michael K. Edwards Cc: William Lee Irwin III, Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote: > On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote: > >The ongoing scheduler work is on a much more basic level than these > >affairs I'm guessing you googled. When the basics work as intended it > >will be possible to move on to more advanced issues. ... Will probably shouldn't have dismissed your points but he probably means that can't even get at this stuff until fundamental are in place. > Clock scaling schemes that aren't integral to the scheduler design > make a bad situation (scheduling embedded loads with shotgun > heuristics tuned for desktop CPUs) worse, because the opaque > heuristics are now being applied to distorted data. Add a "smoothing" > scheme for the distorted data, and you may find that you have > introduced an actual control-path instability. A small fluctuation in > the data (say, two bursts of interrupt traffic at just the right > interval) can result in a long-lasting oscillation in some task's > "dynamic priority" -- and, on a fully loaded CPU, in the time that > task actually gets. If anything else depends on how much work this > task gets done each time around, the oscillation can easily propagate > throughout the system. Thrash city. Hyperthreading issues are quite similar that clock scaling issues. Con's infrastructures changes to move things in that direction were rejected, as well as other infrastructure changes, further infuritating Con to drop development on RSDL and derivatives. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:00 ` Michael K. Edwards 2007-04-17 23:07 ` William Lee Irwin III @ 2007-04-18 2:39 ` Peter Williams 1 sibling, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-18 2:39 UTC (permalink / raw) To: Michael K. Edwards Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Michael K. Edwards wrote: > On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >> The other way in which the code deviates from the original as that (for >> a few years now) I no longer calculated CPU bandwidth usage directly. >> I've found that the overhead is less if I keep a running average of the >> size of a tasks CPU bursts and the length of its scheduling cycle (i.e. >> from on CPU one time to on CPU next time) and using the ratio of these >> values as a measure of bandwidth usage. >> >> Anyway it works and gives very predictable allocations of CPU bandwidth >> based on nice. > > Works, that is, right up until you add nonlinear interactions with CPU > speed scaling. From my perspective as an embedded platform > integrator, clock/voltage scaling is the elephant in the scheduler's > living room. Patch in DPM (now OpPoint?) to scale the clock based on > what task is being scheduled, and suddenly the dynamic priority > calculations go wild. Nip this in the bud by putting an RT priority > on the relevant threads (which you have to do anyway if you need > remotely audio-grade latency), and the lock affinity heuristics break, > so you have to hand-tune all the thread priorities. Blecch. > > Not to mention the likelihood that the task whose clock speed you're > trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority > than the application. (You want to crank the CPU for this task > because it runs with the RF hot, which may cost you as much power as > the rest of the platform.) You'd better hope you can remove it from > the dynamic priority heuristics with SCHED_BATCH. Otherwise > everything _else_ has to be RT priority (or it'll be starved by the > soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double > blecch! > > Is it too much to ask for someone with actual engineering training > (not me, unfortunately) to sit down and build a negative-feedback > control system that handles soft-real-time _and_ dynamic-priority > _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock > scaling? And actually separates the accounting and control mechanisms > from the heuristics, so the latter can be tuned (within a well > documented stable range) to reflect the expected system usage > patterns? > > It's not like there isn't a vast literature in this area over the past > decade, including some dealing specifically with clock scaling > consistent with low-latency applications. It's a pity that people > doing academic work in this area rarely wade into LKML, even when > they're hacking on a Linux fork. But then, there's not much economic > incentive for them to do so, and they can usually get their fill of > citation politics and dominance games without leaving their home > department. :-P > > Seriously, though. If you're really going to put the mainline > scheduler through this kind of churn, please please pretty please knit > in per-task clock scaling (possibly even rejigged during the slice; > see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of > linger mechanism to keep from taking context switch hits when you're > confident that an I/O will complete quickly. I think that this doesn't effect the basic design principles of spa_ebs but just means that the statistics that it uses need to be rethought. E.g. instead of measuring average CPU usage per burst in terms of wall clock time spent on the CPU measure it in terms of CPU capacity (for the want of a better word) used per burst. I don't have suitable hardware for investigating this line of attack further, unfortunately, and have no idea what would be the best way to calculate this new statistic. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:51 ` Ingo Molnar 2007-04-17 13:44 ` Peter Williams @ 2007-04-20 20:47 ` Bill Davidsen 2007-04-21 7:39 ` Nick Piggin 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 2 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-20 20:47 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven Ingo Molnar wrote: > ( Lets be cautious though: the jury is still out whether people actually > like this more than the current approach. While CFS feedback looks > promising after a whopping 3 days of it being released [ ;-) ], the > test coverage of all 'fairness centric' schedulers, even considering > years of availability is less than 1% i'm afraid, and that < 1% was > mostly self-selecting. ) > All of my testing has been on desktop machines, although in most cases they were really loaded desktops which had load avg 10..100 from time to time, and none were low memory machines. Up to CFS v3 I thought nicksched was my winner, now CFSv3 looks better, by not having stumbles under stupid loads. I have not tested: 1 - server loads, nntp, smtp, etc 2 - low memory machines 3 - uniprocessor systems I think this should be done before drawing conclusions. Or if someone has tried this, perhaps they would report what they saw. People are talking about smoothness, but not how many pages per second come out of their overloaded web server. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 20:47 ` Bill Davidsen @ 2007-04-21 7:39 ` Nick Piggin 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-21 7:39 UTC (permalink / raw) To: Bill Davidsen Cc: Ingo Molnar, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Fri, Apr 20, 2007 at 04:47:27PM -0400, Bill Davidsen wrote: > Ingo Molnar wrote: > > >( Lets be cautious though: the jury is still out whether people actually > > like this more than the current approach. While CFS feedback looks > > promising after a whopping 3 days of it being released [ ;-) ], the > > test coverage of all 'fairness centric' schedulers, even considering > > years of availability is less than 1% i'm afraid, and that < 1% was > > mostly self-selecting. ) > > > All of my testing has been on desktop machines, although in most cases > they were really loaded desktops which had load avg 10..100 from time to > time, and none were low memory machines. Up to CFS v3 I thought > nicksched was my winner, now CFSv3 looks better, by not having stumbles > under stupid loads. What base_timeslice were you using for nicksched, and what HZ? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 20:47 ` Bill Davidsen 2007-04-21 7:39 ` Nick Piggin @ 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-21 8:33 UTC (permalink / raw) To: Bill Davidsen Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven * Bill Davidsen <davidsen@tmr.com> wrote: > All of my testing has been on desktop machines, although in most cases > they were really loaded desktops which had load avg 10..100 from time > to time, and none were low memory machines. Up to CFS v3 I thought > nicksched was my winner, now CFSv3 looks better, by not having > stumbles under stupid loads. nice! I hope CFSv4 kept that good tradition too ;) > I have not tested: > 1 - server loads, nntp, smtp, etc > 2 - low memory machines > 3 - uniprocessor systems > > I think this should be done before drawing conclusions. Or if someone > has tried this, perhaps they would report what they saw. People are > talking about smoothness, but not how many pages per second come out > of their overloaded web server. i tested heavily swapping systems. (make -j50 workloads easily trigger that) I also tested UP systems and a handful of SMP systems. I have also tested massive_intr.c which i believe is an indicator of how fairly CPU time is distributed between partly sleeping partly running server threads. But i very much agree that diverse feedback is sought and welcome, both from those who are happy with the current scheduler and those who are unhappy about it. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang 2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin @ 2007-04-20 20:36 ` Bill Davidsen 2 siblings, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-20 20:36 UTC (permalink / raw) To: Mike Galbraith Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Mike Galbraith wrote: > On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: >> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > >>> Yup, and progress _is_ happening now, quite rapidly. >> Progress as in progress on Ingo's scheduler. I still don't know how we'd >> decide when to replace the mainline scheduler or with what. >> >> I don't think we can say Ingo's is better than the alternatives, can we? > > No, that would require massive performance testing of all alternatives. > >> If there is some kind of bakeoff, then I'd like one of Con's designs to >> be involved, and mine, and Peter's... > > The trouble with a bakeoff is that it's pretty darn hard to get people > to test in the first place, and then comes weighting the subjective and > hard performance numbers. If they're close in numbers, do you go with > the one which starts the least flamewars or what? > Here we disagree... I picked a scheduler not by running benchmarks, but by running loads which piss me off with the mainline scheduler. And then I ran the other schedulers for a while to find the things, normal things I do, which resulted in bad behavior. And when I found one which had (so far) no such cases I called it my winner, but I haven't tested it under server load, so I can't begin to say it's "the best." What we need is for lots of people to run every scheduler in real life, and do "worst case analysis" by finding the cases which cause bad behavior. And if there were a way to easily choose another scheduler, call it plugable, modular, or Russian Roulette, people who found a worst case would report it (aka bitch about it) and try another. But the average user is better able to boot with an option like "sched=cfs" (or sc, or nick, or ...) than to patch and build a kernel. So if we don't get easily switched schedulers people will not test nearly as well. The best scheduler isn't the one 2% faster than the rest, it's the one with the fewest jackpot cases where it sucks. And if the mainline had multiple schedulers this testing would get done, authors would get more reports and have a better chance of fixing corner cases. Note that we really need multiple schedulers to make people happy, because fairness is not the most desirable behavior on all machines, and adding knobs probably isn't the answer. I want a server to degrade gently, I want my desktop to show my movie and echo my typing, and if that's hard on compiles or the file transfer, so be it. Con doesn't want to compromise his goals, I agree but want to have an option if I don't share them. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:40 ` Nick Piggin 2007-04-17 4:01 ` Mike Galbraith @ 2007-04-17 4:17 ` Peter Williams 2007-04-17 4:29 ` Nick Piggin 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 4:17 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: >> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: >>> Mike Galbraith wrote: >>>> Demystify what? The casual observer need only read either your attempt >>>> at writing a scheduler, or my attempts at fixing the one we have, to see >>>> that it was high time for someone with the necessary skills to step in. >>> Make that "someone with the necessary clout". >> No, I was brutally honest to both of us, but quite correct. >> >>>> Now progress can happen, which was _not_ happening before. >>>> >>> This is true. >> Yup, and progress _is_ happening now, quite rapidly. > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > decide when to replace the mainline scheduler or with what. > > I don't think we can say Ingo's is better than the alternatives, can we? > If there is some kind of bakeoff, then I'd like one of Con's designs to > be involved, and mine, and Peter's... I myself was thinking of this as the chance for a much needed simplification of the scheduling code and if this can be done with the result being "reasonable" it then gives us the basis on which to propose improvements based on the ideas of others such as you mention. As the size of the cpusched indicates, trying to evaluate alternative proposals based on the current O(1) scheduler is fraught. Hopefully, this initiative can fix this problem. Then we just need Ingo to listen to suggestions and he's showing signs of being willing to do this :-) > > Maybe the progress is that more key people are becoming open to the idea > of changing the scheduler. That too. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:17 ` Peter Williams @ 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 4:29 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > >>On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > >>>Mike Galbraith wrote: > >>>>Demystify what? The casual observer need only read either your attempt > >>>>at writing a scheduler, or my attempts at fixing the one we have, to see > >>>>that it was high time for someone with the necessary skills to step in. > >>>Make that "someone with the necessary clout". > >>No, I was brutally honest to both of us, but quite correct. > >> > >>>>Now progress can happen, which was _not_ happening before. > >>>> > >>>This is true. > >>Yup, and progress _is_ happening now, quite rapidly. > > > >Progress as in progress on Ingo's scheduler. I still don't know how we'd > >decide when to replace the mainline scheduler or with what. > > > >I don't think we can say Ingo's is better than the alternatives, can we? > >If there is some kind of bakeoff, then I'd like one of Con's designs to > >be involved, and mine, and Peter's... > > I myself was thinking of this as the chance for a much needed > simplification of the scheduling code and if this can be done with the > result being "reasonable" it then gives us the basis on which to propose > improvements based on the ideas of others such as you mention. > > As the size of the cpusched indicates, trying to evaluate alternative > proposals based on the current O(1) scheduler is fraught. Hopefully, I don't know why. The problem is that you can't really evaluate good proposals by looking at the code (you can say that one is bad, ie. the current one, which has a huge amount of temporal complexity and is explicitly unfair), but it is pretty hard to say one behaves well. And my scheduler for example cuts down the amount of policy code and code size significantly. I haven't looked at Con's ones for a while, but I believe they are also much more straightforward than mainline... For example, let's say all else is equal between them, then why would we go with the O(logN) implementation rather than the O(1)? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin @ 2007-04-17 5:53 ` Willy Tarreau 2007-04-17 6:10 ` Nick Piggin 2007-04-17 6:09 ` William Lee Irwin III 2007-04-17 6:23 ` Peter Williams 2 siblings, 1 reply; 713+ messages in thread From: Willy Tarreau @ 2007-04-17 5:53 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Hi Nick, On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: (...) > And my scheduler for example cuts down the amount of policy code and > code size significantly. I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... > > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? Of course, if this is the case, the question will be raised. But as a general rule, I don't see much potential in O(1) to finely tune scheduling according to several criteria. In O(logN), you can adjust scheduling in realtime at a very low cost. Better processing of varying priorities or fork() comes to mind. Regards, Willy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 5:53 ` Willy Tarreau @ 2007-04-17 6:10 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 6:10 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote: > Hi Nick, > > On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > (...) > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. I haven't looked at Con's ones for a while, > > but I believe they are also much more straightforward than mainline... > > > > For example, let's say all else is equal between them, then why would > > we go with the O(logN) implementation rather than the O(1)? > > Of course, if this is the case, the question will be raised. But as a > general rule, I don't see much potential in O(1) to finely tune scheduling > according to several criteria. What do you mean? By what criteria? > In O(logN), you can adjust scheduling in > realtime at a very low cost. Better processing of varying priorities or > fork() comes to mind. The main problem as I see it is choosing which task to run next and how much time to run it for. And given that there are typically far less than 58 (the number of priorities in nicksched) runnable tasks for a desktop system, I don't find it at all constraining to quantize my "next runnable" criteria onto that size of key. Even if you do expect a huge number of runnable tasks, you would hope for fewer interactive ones toward the higher end of the priority scale. Handwaving or even detailed design descriptions is simply not the best way to decide on a new scheduler, is all I'm saying. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau @ 2007-04-17 6:09 ` William Lee Irwin III 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:23 ` Peter Williams 2 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:09 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: >> I myself was thinking of this as the chance for a much needed >> simplification of the scheduling code and if this can be done with the >> result being "reasonable" it then gives us the basis on which to propose >> improvements based on the ideas of others such as you mention. >> As the size of the cpusched indicates, trying to evaluate alternative >> proposals based on the current O(1) scheduler is fraught. Hopefully, On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > I don't know why. The problem is that you can't really evaluate good > proposals by looking at the code (you can say that one is bad, ie. the > current one, which has a huge amount of temporal complexity and is > explicitly unfair), but it is pretty hard to say one behaves well. > And my scheduler for example cuts down the amount of policy code and > code size significantly. I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? All things are not equal; they all have different properties. I like getting rid of the queue-swapping artifacts as ebs and cfs have done, as the artifacts introduced there are nasty IMNSHO. I'm queueing up a demonstration of epoch expiry scheduling artifacts as a testcase, albeit one with no pass/fail semantics for its results, just detecting scheduler properties. That said, inequality/inequivalence is not a superiority/inferiority ranking per se. What needs to come out of these discussions is a set of standards which a candidate for mainline must pass to be considered correct and a set of performance metrics by which to rank them. Video game framerates and some sort of way to automate window wiggle tests sound like good ideas, but automating such is beyond my userspace programming abilities. An organization able to devote manpower to devising such testcases will likely have to get involved for them to happen, I suspect. On a random note, limitations on kernel address space make O(lg(n)) effectively O(1), albeit with large upper bounds on the worst case and an expected case much faster than the worst case. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:09 ` William Lee Irwin III @ 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:26 ` William Lee Irwin III 2007-04-17 6:50 ` Davide Libenzi 0 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 6:15 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: > >> I myself was thinking of this as the chance for a much needed > >> simplification of the scheduling code and if this can be done with the > >> result being "reasonable" it then gives us the basis on which to propose > >> improvements based on the ideas of others such as you mention. > >> As the size of the cpusched indicates, trying to evaluate alternative > >> proposals based on the current O(1) scheduler is fraught. Hopefully, > > On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > > I don't know why. The problem is that you can't really evaluate good > > proposals by looking at the code (you can say that one is bad, ie. the > > current one, which has a huge amount of temporal complexity and is > > explicitly unfair), but it is pretty hard to say one behaves well. > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. I haven't looked at Con's ones for a while, > > but I believe they are also much more straightforward than mainline... > > For example, let's say all else is equal between them, then why would > > we go with the O(logN) implementation rather than the O(1)? > > All things are not equal; they all have different properties. I like Exactly. So we have to explore those properties and evaluate performance (in all meanings of the word). That's only logical. > On a random note, limitations on kernel address space make O(lg(n)) > effectively O(1), albeit with large upper bounds on the worst case > and an expected case much faster than the worst case. Yeah. O(n!) is also O(1) if you can put an upper bound on n ;) ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:15 ` Nick Piggin @ 2007-04-17 6:26 ` William Lee Irwin III 2007-04-17 7:01 ` Nick Piggin 2007-04-17 6:50 ` Davide Libenzi 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:26 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: >> All things are not equal; they all have different properties. I like On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > Exactly. So we have to explore those properties and evaluate performance > (in all meanings of the word). That's only logical. Any chance you'd be willing to put down a few thoughts on what sorts of standards you'd like to set for both correctness (i.e. the bare minimum a scheduler implementation must do to be considered valid beyond not oopsing) and performance metrics (i.e. things that produce numbers for each scheduler you can compare to say "this scheduler is better than this other scheduler at this."). -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:26 ` William Lee Irwin III @ 2007-04-17 7:01 ` Nick Piggin 2007-04-17 8:23 ` William Lee Irwin III 2007-04-17 21:39 ` Matt Mackall 0 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > >> All things are not equal; they all have different properties. I like > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > Exactly. So we have to explore those properties and evaluate performance > > (in all meanings of the word). That's only logical. > > Any chance you'd be willing to put down a few thoughts on what sorts > of standards you'd like to set for both correctness (i.e. the bare > minimum a scheduler implementation must do to be considered valid > beyond not oopsing) and performance metrics (i.e. things that produce > numbers for each scheduler you can compare to say "this scheduler is > better than this other scheduler at this."). Yeah I guess that's the hard part :) For correctness, I guess fairness is an easy one. I think that unfairness is basically a bug and that it would be very unfortunate to merge something unfair. But this is just within the context of a single runqueue... for better or worse, we allow some unfairness in multiprocessors for performance reasons of course. Latency. Given N tasks in the system, an arbitrary task should get onto the CPU in a bounded amount of time (excluding events like freak IRQ holdoffs and such, obviously -- ie. just considering the context of the scheduler's state machine). I wouldn't like to see a significant drop in any micro or macro benchmarks or even worse real workloads, but I could accept some if it means haaving a fair scheduler by default. Now it isn't actually too hard to achieve the above, I think. The hard bit is trying to compare interactivity. Ideally, we'd be able to get scripted dumps of login sessions, and measure scheduling latencies of key proceses (sh/X/wm/xmms/firefox/etc). People would send a dump if they were having problems with any scheduler, and we could compare all of them against it. Wishful thinking! ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:01 ` Nick Piggin @ 2007-04-17 8:23 ` William Lee Irwin III 2007-04-17 22:23 ` Davide Libenzi 2007-04-17 21:39 ` Matt Mackall 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 8:23 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: >> Any chance you'd be willing to put down a few thoughts on what sorts >> of standards you'd like to set for both correctness (i.e. the bare >> minimum a scheduler implementation must do to be considered valid >> beyond not oopsing) and performance metrics (i.e. things that produce >> numbers for each scheduler you can compare to say "this scheduler is >> better than this other scheduler at this."). On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Yeah I guess that's the hard part :) > For correctness, I guess fairness is an easy one. I think that unfairness > is basically a bug and that it would be very unfortunate to merge something > unfair. But this is just within the context of a single runqueue... for > better or worse, we allow some unfairness in multiprocessors for performance > reasons of course. Requiring that identical tasks be allocated equal shares of CPU bandwidth is the easy part here. ringtest.c exercises another aspect of fairness that is extremely important. Generalizing ringtest.c is a good idea for fairness testing. But another aspect of fairness is that "controlled unfairness" is also intended to exist, in no small part by virtue of nice levels, but also in the form of favoring tasks that are considered interactive somehow. Testing various forms of controlled unfairness to ensure that they are indeed controlled and otherwise have the semantics intended is IMHO the more difficult aspect of fairness testing. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Latency. Given N tasks in the system, an arbitrary task should get > onto the CPU in a bounded amount of time (excluding events like freak > IRQ holdoffs and such, obviously -- ie. just considering the context > of the scheduler's state machine). ISTR Davide Libenzi having a scheduling latency test a number of years ago. Resurrecting that and tuning it to the needs of this kind of testing sounds relevant here. The test suite Peter Willliams mentioned would also help. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > I wouldn't like to see a significant drop in any micro or macro > benchmarks or even worse real workloads, but I could accept some if it > means haaving a fair scheduler by default. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Now it isn't actually too hard to achieve the above, I think. The hard bit > is trying to compare interactivity. Ideally, we'd be able to get scripted > dumps of login sessions, and measure scheduling latencies of key proceses > (sh/X/wm/xmms/firefox/etc). People would send a dump if they were having > problems with any scheduler, and we could compare all of them against it. > Wishful thinking! That's a pretty good idea. I'll queue up writing something of that form as well. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:23 ` William Lee Irwin III @ 2007-04-17 22:23 ` Davide Libenzi 0 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-17 22:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > Latency. Given N tasks in the system, an arbitrary task should get > > onto the CPU in a bounded amount of time (excluding events like freak > > IRQ holdoffs and such, obviously -- ie. just considering the context > > of the scheduler's state machine). > > ISTR Davide Libenzi having a scheduling latency test a number of years > ago. Resurrecting that and tuning it to the needs of this kind of > testing sounds relevant here. The test suite Peter Willliams mentioned > would also help. That helped me a lot at that time. At every context switch was sampling critical scheduler parameters for both entering and exiting task (and associated timestamps). Then the data was collected through a /dev/idontremember from userspace for analysis. It'd very useful to have it those days, to study what really happens under the hook (scheduler internal parameters variations and such) when those wierd loads make the scheduler unstable. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:01 ` Nick Piggin 2007-04-17 8:23 ` William Lee Irwin III @ 2007-04-17 21:39 ` Matt Mackall 2007-04-17 23:23 ` Peter Williams 2007-04-18 3:15 ` Nick Piggin 1 sibling, 2 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-17 21:39 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > >> All things are not equal; they all have different properties. I like > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > Exactly. So we have to explore those properties and evaluate performance > > > (in all meanings of the word). That's only logical. > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > of standards you'd like to set for both correctness (i.e. the bare > > minimum a scheduler implementation must do to be considered valid > > beyond not oopsing) and performance metrics (i.e. things that produce > > numbers for each scheduler you can compare to say "this scheduler is > > better than this other scheduler at this."). > > Yeah I guess that's the hard part :) > > For correctness, I guess fairness is an easy one. I think that unfairness > is basically a bug and that it would be very unfortunate to merge something > unfair. But this is just within the context of a single runqueue... for > better or worse, we allow some unfairness in multiprocessors for performance > reasons of course. I'm a big fan of fairness, but I think it's a bit early to declare it a mandatory feature. Bounded unfairness is probably something we can agree on, ie "if we decide to be unfair, no process suffers more than a factor of x". > Latency. Given N tasks in the system, an arbitrary task should get > onto the CPU in a bounded amount of time (excluding events like freak > IRQ holdoffs and such, obviously -- ie. just considering the context > of the scheduler's state machine). This is a slightly stronger statement than starvation-free (which is obviously mandatory). I think you're looking for something like "worst-case scheduling latency is proportional to the number of runnable tasks". Which I think is quite a reasonable requirement. I'm pretty sure the stock scheduler falls short of both of these guarantees though. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 21:39 ` Matt Mackall @ 2007-04-17 23:23 ` Peter Williams 2007-04-17 23:19 ` Matt Mackall 2007-04-18 3:15 ` Nick Piggin 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 23:23 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Matt Mackall wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: >> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: >>> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: >>>>> All things are not equal; they all have different properties. I like >>> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: >>>> Exactly. So we have to explore those properties and evaluate performance >>>> (in all meanings of the word). That's only logical. >>> Any chance you'd be willing to put down a few thoughts on what sorts >>> of standards you'd like to set for both correctness (i.e. the bare >>> minimum a scheduler implementation must do to be considered valid >>> beyond not oopsing) and performance metrics (i.e. things that produce >>> numbers for each scheduler you can compare to say "this scheduler is >>> better than this other scheduler at this."). >> Yeah I guess that's the hard part :) >> >> For correctness, I guess fairness is an easy one. I think that unfairness >> is basically a bug and that it would be very unfortunate to merge something >> unfair. But this is just within the context of a single runqueue... for >> better or worse, we allow some unfairness in multiprocessors for performance >> reasons of course. > > I'm a big fan of fairness, but I think it's a bit early to declare it > a mandatory feature. Bounded unfairness is probably something we can > agree on, ie "if we decide to be unfair, no process suffers more than > a factor of x". > >> Latency. Given N tasks in the system, an arbitrary task should get >> onto the CPU in a bounded amount of time (excluding events like freak >> IRQ holdoffs and such, obviously -- ie. just considering the context >> of the scheduler's state machine). > > This is a slightly stronger statement than starvation-free (which is > obviously mandatory). I think you're looking for something like > "worst-case scheduling latency is proportional to the number of > runnable tasks". add "taking into consideration nice and/or real time priorities of runnable tasks". I.e. if a task is nice 19 it can expect to wait longer to get onto the CPU than if it was nice 0. > Which I think is quite a reasonable requirement. > > I'm pretty sure the stock scheduler falls short of both of these > guarantees though. > Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:23 ` Peter Williams @ 2007-04-17 23:19 ` Matt Mackall 0 siblings, 0 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-17 23:19 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 09:23:42AM +1000, Peter Williams wrote: > Matt Mackall wrote: > >On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > >>On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > >>>On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > >>>>>All things are not equal; they all have different properties. I like > >>>On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > >>>>Exactly. So we have to explore those properties and evaluate performance > >>>>(in all meanings of the word). That's only logical. > >>>Any chance you'd be willing to put down a few thoughts on what sorts > >>>of standards you'd like to set for both correctness (i.e. the bare > >>>minimum a scheduler implementation must do to be considered valid > >>>beyond not oopsing) and performance metrics (i.e. things that produce > >>>numbers for each scheduler you can compare to say "this scheduler is > >>>better than this other scheduler at this."). > >>Yeah I guess that's the hard part :) > >> > >>For correctness, I guess fairness is an easy one. I think that unfairness > >>is basically a bug and that it would be very unfortunate to merge > >>something > >>unfair. But this is just within the context of a single runqueue... for > >>better or worse, we allow some unfairness in multiprocessors for > >>performance > >>reasons of course. > > > >I'm a big fan of fairness, but I think it's a bit early to declare it > >a mandatory feature. Bounded unfairness is probably something we can > >agree on, ie "if we decide to be unfair, no process suffers more than > >a factor of x". > > > >>Latency. Given N tasks in the system, an arbitrary task should get > >>onto the CPU in a bounded amount of time (excluding events like freak > >>IRQ holdoffs and such, obviously -- ie. just considering the context > >>of the scheduler's state machine). > > > >This is a slightly stronger statement than starvation-free (which is > >obviously mandatory). I think you're looking for something like > >"worst-case scheduling latency is proportional to the number of > >runnable tasks". > > add "taking into consideration nice and/or real time priorities of > runnable tasks". I.e. if a task is nice 19 it can expect to wait longer > to get onto the CPU than if it was nice 0. Yes. Assuming we meet the "bounded unfairness" criterion above, this follows. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 21:39 ` Matt Mackall 2007-04-17 23:23 ` Peter Williams @ 2007-04-18 3:15 ` Nick Piggin 2007-04-18 3:45 ` Mike Galbraith 2007-04-18 4:38 ` Matt Mackall 1 sibling, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-18 3:15 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > > >> All things are not equal; they all have different properties. I like > > > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > > Exactly. So we have to explore those properties and evaluate performance > > > > (in all meanings of the word). That's only logical. > > > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > > of standards you'd like to set for both correctness (i.e. the bare > > > minimum a scheduler implementation must do to be considered valid > > > beyond not oopsing) and performance metrics (i.e. things that produce > > > numbers for each scheduler you can compare to say "this scheduler is > > > better than this other scheduler at this."). > > > > Yeah I guess that's the hard part :) > > > > For correctness, I guess fairness is an easy one. I think that unfairness > > is basically a bug and that it would be very unfortunate to merge something > > unfair. But this is just within the context of a single runqueue... for > > better or worse, we allow some unfairness in multiprocessors for performance > > reasons of course. > > I'm a big fan of fairness, but I think it's a bit early to declare it > a mandatory feature. Bounded unfairness is probably something we can > agree on, ie "if we decide to be unfair, no process suffers more than > a factor of x". I don't know why this would be a useful feature (of course I'm talking about processes at the same nice level). One of the big problems with the current scheduler is that it is unfair in some corner cases. It works OK for most people, but when it breaks down it really hurts. At least if you start with a fair scheduler, you can alter priorities until it satisfies your need... with an unfair one your guess is as good as mine. So on what basis would you allow unfairness? On the basis that it doesn't seem to harm anyone? It doesn't seem to harm testers? I think we should aim for something better. > > Latency. Given N tasks in the system, an arbitrary task should get > > onto the CPU in a bounded amount of time (excluding events like freak > > IRQ holdoffs and such, obviously -- ie. just considering the context > > of the scheduler's state machine). > > This is a slightly stronger statement than starvation-free (which is > obviously mandatory). I think you're looking for something like > "worst-case scheduling latency is proportional to the number of > runnable tasks". Which I think is quite a reasonable requirement. Yes, bounded and proportional to. > I'm pretty sure the stock scheduler falls short of both of these > guarantees though. And I think that's what its main problems are. It's interactivity obviously can't be too bad for most people. It's performance seems to be pretty good. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:15 ` Nick Piggin @ 2007-04-18 3:45 ` Mike Galbraith 2007-04-18 3:56 ` Nick Piggin 2007-04-18 4:38 ` Matt Mackall 1 sibling, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-18 3:45 UTC (permalink / raw) To: Nick Piggin Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > a mandatory feature. Bounded unfairness is probably something we can > > agree on, ie "if we decide to be unfair, no process suffers more than > > a factor of x". > > I don't know why this would be a useful feature (of course I'm talking > about processes at the same nice level). One of the big problems with > the current scheduler is that it is unfair in some corner cases. It > works OK for most people, but when it breaks down it really hurts. At > least if you start with a fair scheduler, you can alter priorities > until it satisfies your need... with an unfair one your guess is as > good as mine. > > So on what basis would you allow unfairness? On the basis that it doesn't > seem to harm anyone? It doesn't seem to harm testers? Well, there's short term fair and long term fair. Seems to me a burst load having to always merge with a steady stream load using a short term fairness yardstick absolutely must 'starve' relative to the steady load, so to be long term fair, you have to add some short term unfairness. The mainline scheduler is more long term fair (discounting the rather obnoxious corner cases). -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:45 ` Mike Galbraith @ 2007-04-18 3:56 ` Nick Piggin 2007-04-18 4:29 ` Mike Galbraith 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-18 3:56 UTC (permalink / raw) To: Mike Galbraith Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote: > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > > a mandatory feature. Bounded unfairness is probably something we can > > > agree on, ie "if we decide to be unfair, no process suffers more than > > > a factor of x". > > > > I don't know why this would be a useful feature (of course I'm talking > > about processes at the same nice level). One of the big problems with > > the current scheduler is that it is unfair in some corner cases. It > > works OK for most people, but when it breaks down it really hurts. At > > least if you start with a fair scheduler, you can alter priorities > > until it satisfies your need... with an unfair one your guess is as > > good as mine. > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > seem to harm anyone? It doesn't seem to harm testers? > > Well, there's short term fair and long term fair. Seems to me a burst > load having to always merge with a steady stream load using a short term > fairness yardstick absolutely must 'starve' relative to the steady load, > so to be long term fair, you have to add some short term unfairness. > The mainline scheduler is more long term fair (discounting the rather > obnoxious corner cases). Oh yes definitely I mean long term fair. I guess it is impossible to be completely fair so long as we have to timeshare the CPU :) So a constant delta is fine and unavoidable. But I don't think I agree with a constant factor: that means you can pick a time where process 1 is allowed an arbitrary T more CPU time than process 2. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:56 ` Nick Piggin @ 2007-04-18 4:29 ` Mike Galbraith 0 siblings, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-18 4:29 UTC (permalink / raw) To: Nick Piggin Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 05:56 +0200, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote: > > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > > > > > > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > > seem to harm anyone? It doesn't seem to harm testers? > > > > Well, there's short term fair and long term fair. Seems to me a burst > > load having to always merge with a steady stream load using a short term > > fairness yardstick absolutely must 'starve' relative to the steady load, > > so to be long term fair, you have to add some short term unfairness. > > The mainline scheduler is more long term fair (discounting the rather > > obnoxious corner cases). > > Oh yes definitely I mean long term fair. I guess it is impossible to be > completely fair so long as we have to timeshare the CPU :) > > So a constant delta is fine and unavoidable. But I don't think I agree > with a constant factor: that means you can pick a time where process 1 > is allowed an arbitrary T more CPU time than process 2. Definitely. Using constants with no consideration of what else is running is what causes the fairness mechanism in mainline to break down under load. (aside: What I was experimenting with before this new scheduler came along was to turn the sleep_avg thing into an off-cpu period thing. Once a time slice begins execution [runqueue wait doesn't count], that task has the right to use it's slice in one go, and _anything_ that knocked it off the cpu added to it's credit. Knocking someone else off detracts from credit, and to get to the point where you can knock others off costs you stored credit proportional to the dynamic priority you attain by using it. All tasks that have credit stay active, no favorites.) -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:15 ` Nick Piggin 2007-04-18 3:45 ` Mike Galbraith @ 2007-04-18 4:38 ` Matt Mackall 2007-04-18 5:00 ` Nick Piggin 1 sibling, 1 reply; 713+ messages in thread From: Matt Mackall @ 2007-04-18 4:38 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > > > >> All things are not equal; they all have different properties. I like > > > > > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > > > Exactly. So we have to explore those properties and evaluate performance > > > > > (in all meanings of the word). That's only logical. > > > > > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > > > of standards you'd like to set for both correctness (i.e. the bare > > > > minimum a scheduler implementation must do to be considered valid > > > > beyond not oopsing) and performance metrics (i.e. things that produce > > > > numbers for each scheduler you can compare to say "this scheduler is > > > > better than this other scheduler at this."). > > > > > > Yeah I guess that's the hard part :) > > > > > > For correctness, I guess fairness is an easy one. I think that unfairness > > > is basically a bug and that it would be very unfortunate to merge something > > > unfair. But this is just within the context of a single runqueue... for > > > better or worse, we allow some unfairness in multiprocessors for performance > > > reasons of course. > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > a mandatory feature. Bounded unfairness is probably something we can > > agree on, ie "if we decide to be unfair, no process suffers more than > > a factor of x". > > I don't know why this would be a useful feature (of course I'm talking > about processes at the same nice level). One of the big problems with > the current scheduler is that it is unfair in some corner cases. It > works OK for most people, but when it breaks down it really hurts. At > least if you start with a fair scheduler, you can alter priorities > until it satisfies your need... with an unfair one your guess is as > good as mine. > > So on what basis would you allow unfairness? On the basis that it doesn't > seem to harm anyone? It doesn't seem to harm testers? On the basis that there's only anecdotal evidence thus far that fairness is the right approach. It's not yet clear that a fair scheduler can do the right thing with X, with various kernel threads, etc. without fiddling with nice levels. Which makes it no longer "completely fair". It's also not yet clear that a scheduler can't be taught to do the right thing with X without fiddling with nice levels. So I'm just not yet willing to completely rule out systems that aren't "completely fair". But I think we should rule out schedulers that don't have rigid bounds on that unfairness. That's where the really ugly behavior lies. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 4:38 ` Matt Mackall @ 2007-04-18 5:00 ` Nick Piggin 2007-04-18 5:55 ` Matt Mackall 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-18 5:00 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:38:31PM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote: > > > > I don't know why this would be a useful feature (of course I'm talking > > about processes at the same nice level). One of the big problems with > > the current scheduler is that it is unfair in some corner cases. It > > works OK for most people, but when it breaks down it really hurts. At > > least if you start with a fair scheduler, you can alter priorities > > until it satisfies your need... with an unfair one your guess is as > > good as mine. > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > seem to harm anyone? It doesn't seem to harm testers? > > On the basis that there's only anecdotal evidence thus far that > fairness is the right approach. > > It's not yet clear that a fair scheduler can do the right thing with X, > with various kernel threads, etc. without fiddling with nice levels. > Which makes it no longer "completely fair". Of course I mean SCHED_OTHER tasks at the same nice level. Otherwise I would be arguing to make nice basically a noop. > It's also not yet clear that a scheduler can't be taught to do the > right thing with X without fiddling with nice levels. Being fair doesn't prevent that. Implicit unfairness is wrong though, because it will bite people. What's wrong with allowing X to get more than it's fair share of CPU time by "fiddling with nice levels"? That's what they're there for. > So I'm just not yet willing to completely rule out systems that aren't > "completely fair". > > But I think we should rule out schedulers that don't have rigid bounds on > that unfairness. That's where the really ugly behavior lies. Been a while since I really looked at the mainline scheduler, but I don't think it can permanently starve something, so I don't know what your bounded unfairness would help with. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:00 ` Nick Piggin @ 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-18 5:55 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote: > > It's also not yet clear that a scheduler can't be taught to do the > > right thing with X without fiddling with nice levels. > > Being fair doesn't prevent that. Implicit unfairness is wrong though, > because it will bite people. > > What's wrong with allowing X to get more than it's fair share of CPU > time by "fiddling with nice levels"? That's what they're there for. Why is X special? Because it does work on behalf of other processes? Lots of things do this. Perhaps a scheduler should focus entirely on the implicit and directed wakeup matrix and optimizing that instead[1]. Why are processes special? Should user A be able to get more CPU time for his job than user B by splitting it into N parallel jobs? Should we be fair per process, per user, per thread group, per session, per controlling terminal? Some weighted combination of the preceding?[2] Why is the measure CPU time? I can imagine a scheduler that weighed memory bandwidth in the equation. Or power consumption. Or execution unit usage. Fairness is nice. It's simple, it's obvious, it's predictable. But it's just not clear that it's optimal. If the question is (and it was!) "what should the basic requirements for the scheduler be?" it's not clear that fairness is a requirement or even how to pick a metric for fairness that's obviously and uniquely superior. It's instead much easier to try to recognize and rule out really bad behaviour with bounded latencies, minimum service guarantees, etc. [1] That's basically how Google decides to prioritize webpages, which it seems to do moderately well. And how a number of other optimization problems are solved. [2] It's trivial to construct two or more perfectly reasonable and desirable definitions of fairness that are mutually incompatible. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall @ 2007-04-18 6:37 ` Nick Piggin 2007-04-18 6:55 ` Matt Mackall 2007-04-18 13:08 ` William Lee Irwin III 2007-04-18 14:48 ` Linus Torvalds 2 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-18 6:37 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote: > > > It's also not yet clear that a scheduler can't be taught to do the > > > right thing with X without fiddling with nice levels. > > > > Being fair doesn't prevent that. Implicit unfairness is wrong though, > > because it will bite people. > > > > What's wrong with allowing X to get more than it's fair share of CPU > > time by "fiddling with nice levels"? That's what they're there for. > > Why is X special? Because it does work on behalf of other processes? The high level reason is that giving it more than its fair share of CPU allows a desktop to remain interactive under load. And it isn't just about doing work on behalf of other processes. Mouse interrupts are a big part of it, for example. > Lots of things do this. Perhaps a scheduler should focus entirely on > the implicit and directed wakeup matrix and optimizing that > instead[1]. You could do that, and I tried a variant of it at one point. The problem was that it leads to unexpected bad things too. UNIX programs more or less expect fair SCHED_OTHER scheduling, and given the principle of least surprise... > Why are processes special? Should user A be able to get more CPU time > for his job than user B by splitting it into N parallel jobs? Should > we be fair per process, per user, per thread group, per session, per > controlling terminal? Some weighted combination of the preceding?[2] I don't know how that supports your argument for unfairness, but processes are special only because that's how we've always done scheduling. I'm not precluding other groupings for fairness, though. > Why is the measure CPU time? I can imagine a scheduler that weighed > memory bandwidth in the equation. Or power consumption. Or execution > unit usage. Feel free. And I'd also argue that once you schedule for those metrics then fairness is also important there too. > Fairness is nice. It's simple, it's obvious, it's predictable. But > it's just not clear that it's optimal. If the question is (and it > was!) "what should the basic requirements for the scheduler be?" it's > not clear that fairness is a requirement or even how to pick a metric > for fairness that's obviously and uniquely superior. What do you mean optimal? If your criteria is fairness, then of course it is optimal. If your criteria is throughput, then it probably isn't. Considering it is simple and what we've always done, measuring fairness by CPU time per process is obvious for a general purpose scheduler. If you accept that, then I argue that fairness is an optimal property given that the alternative is unfairness. > It's instead much easier to try to recognize and rule out really bad > behaviour with bounded latencies, minimum service guarantees, etc. It's the bad behaviour that you didn't recognize that is the problem. If you start with explicit fairness, then unfairness will never be one of those problems. > [1] That's basically how Google decides to prioritize webpages, which > it seems to do moderately well. And how a number of other optimization > problems are solved. This is not an optimization problem, it is a heuristic. There is no right and wrong answer. > [2] It's trivial to construct two or more perfectly reasonable and > desirable definitions of fairness that are mutually incompatible. Probably not if you use common sense, and in the context of a replacement for the 2.6 scheduler. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:37 ` Nick Piggin @ 2007-04-18 6:55 ` Matt Mackall 2007-04-18 7:24 ` Nick Piggin 2007-04-21 13:33 ` Bill Davidsen 0 siblings, 2 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-18 6:55 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: > I don't know how that supports your argument for unfairness, I never had such an argument. I like fairness. My argument is that -you- don't have an argument for making fairness a -requirement-. > processes are special only because that's how we've always done > scheduling. I'm not precluding other groupings for fairness, though. If you make one form of fairness a -requirement- for all acceptable algorithms, your -are- precluding most other forms of fairness. If you refuse to define what "fairness" means when specifying your requirement, what's the point of requiring it? > What do you mean optimal? If your criteria is fairness, then of course > it is optimal. If your criteria is throughput, then it probably isn't. I don't know what optimal behavior is. And neither do you. It may or may not be fair. It very likely includes small deviations from fair. > > [2] It's trivial to construct two or more perfectly reasonable and > > desirable definitions of fairness that are mutually incompatible. > > Probably not if you use common sense, and in the context of a replacement > for the 2.6 scheduler. Ok, trivial example. You cannot allocate equal CPU time to processes/tasks and simultaneously allocate equal time to thread groups. Is it common sense that a heavily-threaded app should be able to get hugely more CPU than a well-written app? No. I don't want Joe's stupid Java app to make my compile crawl. On the other hand, if my heavily threaded app is, say, a voicemail server serving 30 customers, I probably want it to get 30x the CPU of my gzip job. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:55 ` Matt Mackall @ 2007-04-18 7:24 ` Nick Piggin 2007-04-21 13:33 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-18 7:24 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 01:55:34AM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: > > I don't know how that supports your argument for unfairness, > > I never had such an argument. I like fairness. > > My argument is that -you- don't have an argument for making fairness a > -requirement-. It seems easy enough that there is no point accepting unfair behaviour like the old scheduler if we're going to go to all this trouble to replace it. The old scheduler seems to have bounded unfairness and bounded starvation, so let the good times roll. > > processes are special only because that's how we've always done > > scheduling. I'm not precluding other groupings for fairness, though. > > If you make one form of fairness a -requirement- for all acceptable > algorithms, your -are- precluding most other forms of fairness. > > If you refuse to define what "fairness" means when specifying your > requirement, what's the point of requiring it? I don't refuse. I'm talking about per-process CPU time fairness. My paragraph above was pointing out that subsequent work to add other classes of fairness are not excluded as configurable features, but this basic type of fairness should be included. > > What do you mean optimal? If your criteria is fairness, then of course > > it is optimal. If your criteria is throughput, then it probably isn't. > > I don't know what optimal behavior is. And neither do you. It may or > may not be fair. It very likely includes small deviations from fair. You misunderstand me. There is no single "optimal" when you're talking about fairness (or most other scheduler things). So pondering whether fairness is optimal or not doesn't really make sense. I'm saying it should be a basic axiom, not that it is quantitively better. It isn't a refutable argument. I state it because that it is what users and programs expect. You can reject that, and fine. I guess if a scheduler comes along that does exactly the right thing for everyone, then it is better than any fair scheduler. So OK, while we're talking theoretical, I won't dismiss an unfair scheduler out of hand. > > > [2] It's trivial to construct two or more perfectly reasonable and > > > desirable definitions of fairness that are mutually incompatible. > > > > Probably not if you use common sense, and in the context of a replacement > > for the 2.6 scheduler. > > Ok, trivial example. You cannot allocate equal CPU time to > processes/tasks and simultaneously allocate equal time to thread > groups. Is it common sense that a heavily-threaded app should be able > to get hugely more CPU than a well-written app? No. I don't want Joe's > stupid Java app to make my compile crawl. > > On the other hand, if my heavily threaded app is, say, a voicemail > server serving 30 customers, I probably want it to get 30x the CPU of > my gzip job. So that might be a nice addition, but the base funcionality is threads simply because that's what we've always done. Just common sense. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:55 ` Matt Mackall 2007-04-18 7:24 ` Nick Piggin @ 2007-04-21 13:33 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-21 13:33 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Matt Mackall wrote: > On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: >>> [2] It's trivial to construct two or more perfectly reasonable and >>> desirable definitions of fairness that are mutually incompatible. >> Probably not if you use common sense, and in the context of a replacement >> for the 2.6 scheduler. > > Ok, trivial example. You cannot allocate equal CPU time to > processes/tasks and simultaneously allocate equal time to thread > groups. Is it common sense that a heavily-threaded app should be able > to get hugely more CPU than a well-written app? No. I don't want Joe's > stupid Java app to make my compile crawl. > > On the other hand, if my heavily threaded app is, say, a voicemail > server serving 30 customers, I probably want it to get 30x the CPU of > my gzip job. > Matt, you tickled a thought... on one hand we have a single user running a threaded application, and it ideally should get the same total CPU as a user running a single thread process. On the other hand we have a threaded application, call it sendmail, nnrpd, httpd, bind, whatever. In that case each thread is really providing service for an independent user, and should get an appropriate share of the CPU. Perhaps the solution is to add a means for identifying server processes, by capability, or by membership in a "server" group, or by having the initiating process set some flag at exec() time. That doesn't necessarily solve problems, but it may provide more information to allow them to be soluble. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin @ 2007-04-18 13:08 ` William Lee Irwin III 2007-04-18 19:48 ` Davide Libenzi 2007-04-18 14:48 ` Linus Torvalds 2 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-18 13:08 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote: > Why are processes special? Should user A be able to get more CPU time > for his job than user B by splitting it into N parallel jobs? Should > we be fair per process, per user, per thread group, per session, per > controlling terminal? Some weighted combination of the preceding?[2] On a side note, I think a combination of all of the above is a very good idea, plus process groups (pgrp's). All the make -j loads should come up in one pgrp of one session for one user and hence should be automatically kept isolated in its own corner by such policies. Thread bombs, forkbombs, and so on get handled too, which is good when on e.g. a compileserver and someone rudely spawns too many tasks. Thinking of the scheduler as a CPU bandwidth allocator, this means handing out shares of CPU bandwidth to all users on the system, which in turn hand out shares of bandwidth to all sessions, which in turn hand out shares of bandwidth to all process groups, which in turn hand out shares of bandwidth to all thread groups, which in turn hand out shares of bandwidth to threads. The event handlers for the scheduler need not deal with this apart from task creation and exit and various sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.). They just determine what the scheduler sees as ->load_weight or some analogue of ->static_prio, though it is possible to do this by means of data structure organization instead of numerical prioritization. It'd probably have to be calculated on the fly by, say, doing fixpoint arithmetic something like user_share(p)*session_share(p)*pgrp_share(p)*tgrp_share(p)*task_share(p) so that readjusting the shares of aggregates doesn't have to traverse lists and remains O(1). Each of the share computations can instead just do some analogue of the calculation p->load_weight/rq->raw_weighted_load in fixpoint, though precision issues with this make me queasy. There is maybe a slight nasty point in that the ->raw_weighted_load analogue for users or whatever the highest level chosen is ends up being global. One might as well get users in there and omit intermediate levels if any are to be omitted so that the truly global state is as read-only as possible. I suppose jacking up the fixpoint precision to 128-bit or 256-bit all below the radix point (our max is 1.0 after all) until precision issues vanish can be done but the idea of that much number crunching in the scheduler makes me rather uncomfortable. I hope u64 or u32 [2] can be gotten away with as far as fixpoint goes. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 13:08 ` William Lee Irwin III @ 2007-04-18 19:48 ` Davide Libenzi 0 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 19:48 UTC (permalink / raw) To: William Lee Irwin III Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, William Lee Irwin III wrote: > Thinking of the scheduler as a CPU bandwidth allocator, this means > handing out shares of CPU bandwidth to all users on the system, which > in turn hand out shares of bandwidth to all sessions, which in turn > hand out shares of bandwidth to all process groups, which in turn hand > out shares of bandwidth to all thread groups, which in turn hand out > shares of bandwidth to threads. The event handlers for the scheduler > need not deal with this apart from task creation and exit and various > sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.). Yes, it really becomes a hierarchical problem once you consider user and processes. Top level sees a "user" can be scheduled (put itself on the virtual run queue), and passes the ball to the "process" scheduler inside the "user" container, down to maybe "threads". With all the "key" calculation parameters kept at each level (with up-propagation). - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin 2007-04-18 13:08 ` William Lee Irwin III @ 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall ` (2 more replies) 2 siblings, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-18 14:48 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > > Why is X special? Because it does work on behalf of other processes? > Lots of things do this. Perhaps a scheduler should focus entirely on > the implicit and directed wakeup matrix and optimizing that > instead[1]. I 100% agree - the perfect scheduler would indeed take into account where the wakeups come from, and try to "weigh" processes that help other processes make progress more. That would naturally give server processes more CPU power, because they help others I don't believe for a second that "fairness" means "give everybody the same amount of CPU". That's a totally illogical measure of fairness. All processes are _not_ created equal. That said, even trying to do "fairness by effective user ID" would probably already do a lot. In a desktop environment, X would get as much CPU time as the user processes, simply because it's in a different protection domain (and that's really what "effective user ID" means: it's not about "users", it's really about "protection domains"). And "fairness by euid" is probably a hell of a lot easier to do than trying to figure out the wakeup matrix. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds @ 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds ` (2 more replies) 2007-04-19 3:18 ` Nick Piggin 2007-04-21 13:40 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen 2 siblings, 3 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-18 15:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. For the record, you actually don't need to track a whole NxN matrix (or do the implied O(n**3) matrix inversion!) to get to the same result. You can converge on the same node weightings (ie dynamic priorities) by applying a damped function at each transition point (directed wakeup, preemption, fork, exit). The trouble with any scheme like this is that it needs careful tuning of the damping factor to converge rapidly and not oscillate and precise numerical attention to the transition functions so that the sum of dynamic priorities is conserved. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall @ 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:48 ` [ck] " Mark Glines ` (4 more replies) 2007-04-18 19:05 ` Davide Libenzi 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 5 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-18 17:22 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > > On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > And "fairness by euid" is probably a hell of a lot easier to do than > > trying to figure out the wakeup matrix. > > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. I'm sure you can do things differently, but the reason I think "fairness by euid" is actually worth looking at is that it's pretty much the *identical* issue that we'll have with "fairness by virtual machine" and a number of other "container" issues. The fact is: - "fairness" is *not* about giving everybody the same amount of CPU time (scaled by some niceness level or not). Anybody who thinks that is "fair" is just being silly and hasn't thought it through. - "fairness" is multi-level. You want to be fair to threads within a thread group (where "process" may be one good approximation of what a "thread group" is, but not necessarily the only one). But you *also* want to be fair in between those "thread groups", and then you want to be fair across "containers" (where "user" may be one such container). So I claim that anything that cannot be fair by user ID is actually really REALLY unfair. I think it's absolutely humongously STUPID to call something the "Completely Fair Scheduler", and then just be fair on a thread level. That's not fair AT ALL! It's the anti-thesis of being fair! So if you have 2 users on a machine running CPU hogs, you should *first* try to be fair among users. If one user then runs 5 programs, and the other one runs just 1, then the *one* program should get 50% of the CPU time (the users fair share), and the five programs should get 10% of CPU time each. And if one of them uses two threads, each thread should get 5%. So you should see one thread get 50& CPU (single thread of one user), 4 threads get 10% CPU (their fair share of that users time), and 2 threads get 5% CPU (the fair share within that thread group!). Any scheduling argument that just considers the above to be "7 threads total" and gives each thread 14% of CPU time "fairly" is *anything* but fair. It's a joke if that kind of scheduler then calls itself CFS! And yes, that's largely what the current scheduler will do, but at least the current scheduler doesn't claim to be fair! So the current scheduler is a lot *better* if only in the sense that it doesn't make ridiculous claims that aren't true! Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds @ 2007-04-18 17:48 ` Mark Glines 2007-04-18 19:27 ` Chris Friesen 2007-04-18 17:49 ` Ingo Molnar ` (3 subsequent siblings) 4 siblings, 1 reply; 713+ messages in thread From: Mark Glines @ 2007-04-18 17:48 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, William Lee Irwin III, linux-kernel, ck list, Thomas Gleixner, Andrew Morton, Arjan van de Ven On Wed, 18 Apr 2007 10:22:59 -0700 (PDT) Linus Torvalds <torvalds@linux-foundation.org> wrote: > So if you have 2 users on a machine running CPU hogs, you should > *first* try to be fair among users. If one user then runs 5 programs, > and the other one runs just 1, then the *one* program should get 50% > of the CPU time (the users fair share), and the five programs should > get 10% of CPU time each. And if one of them uses two threads, each > thread should get 5%. This sounds great, to me. One minor question: is it even possible to be completely fair on SMP? For instance, if you have a 2-way SMP box running 3 applications, one of which has 2 threads, will the threaded app have an advantage here? (The current system seems to try to keep each thread on a specific CPU, to reduce cache thrashing, which means threads and processes alike each get 50% of the CPU.) Mark ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:48 ` [ck] " Mark Glines @ 2007-04-18 19:27 ` Chris Friesen 2007-04-19 0:49 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Chris Friesen @ 2007-04-18 19:27 UTC (permalink / raw) To: Mark Glines Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, William Lee Irwin III, linux-kernel, ck list, Thomas Gleixner, Andrew Morton, Arjan van de Ven Mark Glines wrote: > One minor question: is it even possible to be completely fair on SMP? > For instance, if you have a 2-way SMP box running 3 applications, one of > which has 2 threads, will the threaded app have an advantage here? (The > current system seems to try to keep each thread on a specific CPU, to > reduce cache thrashing, which means threads and processes alike each > get 50% of the CPU.) I think the ideal in this case would be to have both threads on one cpu, with the other app on the other cpu. This gives inter-process fairness while minimizing the amount of task migration required. More interesting is the case of three processes on a 2-cpu system. Do we constantly migrate one of them back and forth to ensure that each of them gets 66% of a cpu? Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [ck] Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:27 ` Chris Friesen @ 2007-04-19 0:49 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-19 0:49 UTC (permalink / raw) To: Chris Friesen Cc: Mark Glines, Linus Torvalds, Matt Mackall, Nick Piggin, Bill Huey, Mike Galbraith, William Lee Irwin III, linux-kernel, ck list, Thomas Gleixner, Andrew Morton, Arjan van de Ven Chris Friesen wrote: > Mark Glines wrote: > >> One minor question: is it even possible to be completely fair on SMP? >> For instance, if you have a 2-way SMP box running 3 applications, one of >> which has 2 threads, will the threaded app have an advantage here? (The >> current system seems to try to keep each thread on a specific CPU, to >> reduce cache thrashing, which means threads and processes alike each >> get 50% of the CPU.) > > I think the ideal in this case would be to have both threads on one cpu, > with the other app on the other cpu. This gives inter-process fairness > while minimizing the amount of task migration required. Solving this sort of issue was one of the reasons for the smpnice patches. > > More interesting is the case of three processes on a 2-cpu system. Do > we constantly migrate one of them back and forth to ensure that each of > them gets 66% of a cpu? Depends how keen you are on fairness. Unless the process are long term continuously active tasks that never sleep it's probably not an issue as they'll probably move around enough in the long term for them each to get 66% over the long term. Exact load balancing for real work loads (where tasks are coming and going, sleeping and waking semi randomly and over relatively brief periods) is probably unattainable because by the time you've work out the ideal placement of the currently runnable tasks on the available CPUs it's all changed and the solution is invalid. The best you can hope for that change isn't so great as to completely invalidate the solution and the changes you make as a result are an improvement on the current allocation of processes to CPUs. The above probably doesn't hold for some systems such as those large super computer jobs that run for several days but they're probably best served by explicit allocation of processes to CPUs using the process affinity mechanism. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:48 ` [ck] " Mark Glines @ 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 17:59 ` Ingo Molnar 2007-04-18 19:23 ` Linus Torvalds 2007-04-18 18:02 ` William Lee Irwin III ` (2 subsequent siblings) 4 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > The fact is: > > - "fairness" is *not* about giving everybody the same amount of CPU > time (scaled by some niceness level or not). Anybody who thinks > that is "fair" is just being silly and hasn't thought it through. yeah, very much so. But note that most of the reported CFS interactivity wins, as surprising as it might be, were due to fairness between _the same user's tasks_. In the typical case, 99% of the desktop CPU time is executed either as X (root user) or under the uid of the logged in user, and X is just one task. Even with a bad hack of making X super-high-prio, interactivity as experienced by users still sucks without having fairness between the other 100-200 user tasks that a desktop system is typically using. 'renicing X to -10' is a broken way of achieving: 'root uid should get its share of CPU time too, no matter how many user tasks are running'. We can do this much cleaner by saying: 'each uid, if it has any tasks running, should get its fair share of CPU time, independently of the number of tasks it is running'. In that sense 'fairness' is not global (and in fact it is almost _never_ a global property, as X runs under root uid [*]), it is only the most lowlevel scheduling machinery that can then be built upon. Higher-level controls to allocate CPU power between groups of tasks very much make sense - but according to the CFS interactivity test results i got from people so far, they very much need this basic fairness machinery _within_ the uid group too. So 'fairness' is still a powerful lower level scheduling concept. And this all makes lots of sense to me. One purpose of doing the hierarchical scheduling classes stuff was to enable such higher scope task group decisions too. Next i'll try to figure out whether 'task group bandwidth' logic should live right within sched_fair.c itself, or whether it should be layered separately as a sched_group.c. Intutively i'd say it should live within sched_fair.c. Ingo [*] There are distributions where X does not run under root uid anymore. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:49 ` Ingo Molnar @ 2007-04-18 17:59 ` Ingo Molnar 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:23 ` Linus Torvalds 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 17:59 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > In that sense 'fairness' is not global (and in fact it is almost > _never_ a global property, as X runs under root uid [*]), it is only > the most lowlevel scheduling machinery that can then be built upon. > [...] perhaps a more fitting term would be 'precise group-scheduling'. Within the lowest level task group entity (be that thread group or uid group, etc.) 'precise scheduling' is equivalent to 'fairness'. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:59 ` Ingo Molnar @ 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-18 19:40 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > > perhaps a more fitting term would be 'precise group-scheduling'. Within > the lowest level task group entity (be that thread group or uid group, > etc.) 'precise scheduling' is equivalent to 'fairness'. Yes. Absolutely. Except I think that at least if you're going to name somethign "complete" (or "perfect" or "precise"), you should also admit that groups can be hierarchical. The "threads in a process" thing is a great example of a hierarchical group. Imagine if X was running as a collection of threads - then each server thread would no longer be more important than the clients! But if you have a mix of "bags of threads" and "single process" kind applications, then very arguably the single thread in a single traditional process should get as much time as the "bag of threads" process gets total. So it really should be a hierarchical notion, where each thread is owned by one "process", and each process is owned by one "user", and each user is in one "virtual machine" - there's at least three different levels to this, and you'd want to schedule this thing top-down: virtual machines should be given CPU time "fairly" (which doesn't need to mean "equally", of course - nice-values could very well work at that level too), and then within each virtual machine users or "scheduling groups" should be scheduled fairly, and then within each scheduling group the processes should be scheduled, and within each process threads should equally get their fair share at _that_ level. And no, I don't think we necessarily need to do something quite that elaborate. But I think that's the kind of "obviously good goal" to keep in mind. Can we perhaps _approximate_ something like that by other means? For example, maybe we can approximate it by spreading out the statistics: right now you have things like - last_ran, wait_runtime, sum_wait_runtime.. be per-thread things. Maybe some of those can be spread out, so that you put a part of them in the "struct vm_struct" thing (to approximate processes), part of them in the "struct user" struct (to approximate the user-level thing), and part of it in a per-container thing for when/if we support that kind of thing? IOW, I don't think the scheduling "groups" have to be explicit boxes or anything like that. I suspect you can make do with just heurstics that penalize the same "struct user" and "struct vm_struct" to get overly much scheduling time, and you'll get the same _effect_. And I don't think it's wrong to look at the "one hundred processes by the same user" case as being an important case. But it should not be the *only* case or even necessarily the *main* case that matters. I think a benchmark that literally does pid_t pid = fork(); if (pid < 0) exit(1); if (pid) { if (setuid(500) < 0) exit(2); for (;;) /* Do nothing */; } if (setuid(501) < 0) exit(3); fork(); for (;;) /* Do nothing in two processes */; and I think that it's a really valid benchmark: if the scheduler gives 25% of time to each of the two processes of user 501, and 50% to user 500, then THAT is a good scheduler. If somebody wants to actually write and test the above as a test-script, and add it to a collection of scheduler tests, I think that could be a good thing. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds @ 2007-04-18 19:43 ` Ingo Molnar 2007-04-18 20:07 ` Davide Libenzi 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 19:43 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > For example, maybe we can approximate it by spreading out the > statistics: right now you have things like > > - last_ran, wait_runtime, sum_wait_runtime.. > > be per-thread things. [...] yes, yes, yes! :) My thinking is "struct sched_group" embedded into _arbitrary_ other resource containers and abstractions, which sched_group's are then in a simple hierarchy and are driven by the core scheduling machinery. > [...] Maybe some of those can be spread out, so that you put a part of > them in the "struct vm_struct" thing (to approximate processes), part > of them in the "struct user" struct (to approximate the user-level > thing), and part of it in a per-container thing for when/if we support > that kind of thing? yes. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar @ 2007-04-18 20:07 ` Davide Libenzi 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 20:07 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > For example, maybe we can approximate it by spreading out the statistics: > right now you have things like > > - last_ran, wait_runtime, sum_wait_runtime.. > > be per-thread things. Maybe some of those can be spread out, so that you > put a part of them in the "struct vm_struct" thing (to approximate > processes), part of them in the "struct user" struct (to approximate the > user-level thing), and part of it in a per-container thing for when/if we > support that kind of thing? I think Ingo's idea of a new sched_group to contain the generic parameters needed for the "key" calculation, works better than adding more fields to existing strctures (that would, of course, host pointers to it). Otherwise I can already the the struct_signal being the target for other unrelated fields :) - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 20:07 ` Davide Libenzi @ 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 6:52 ` Mike Galbraith 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 21:48 UTC (permalink / raw) To: Davide Libenzi Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Davide Libenzi <davidel@xmailserver.org> wrote: > I think Ingo's idea of a new sched_group to contain the generic > parameters needed for the "key" calculation, works better than adding > more fields to existing strctures (that would, of course, host > pointers to it). Otherwise I can already the the struct_signal being > the target for other unrelated fields :) yeah. Another detail is that for global containers like uids, the statistics will have to be percpu_alloc()-ed, both for correctness (runqueues are per CPU) and for performance. That's one reason why i dont think it's necessarily a good idea to group-schedule threads, we dont really want to do a per thread group percpu_alloc(). In fact for threads the _reverse_ problem exists, threaded apps tend to _strive_ for more performance - hence their desperation of using the threaded programming model to begin with ;) (just think of media playback apps which are typically multithreaded) I dont think threads are all that different. Also, the resource-conserving act of using CLONE_VM to share the VM (and to use a different programming environment like Java) should not be 'punished' by forcing the thread group to be accounted as a single, shared entity against other 'fat' tasks. so my current impression is that we want per UID accounting to solve the X problem, the kernel threads problem and the many-users problem, but i'd not want to do it for threads just yet because for them there's not really any apparent problem to be solved. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 21:48 ` Ingo Molnar @ 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 17:39 ` Bernd Eckenfels 2007-04-19 6:52 ` Mike Galbraith 1 sibling, 2 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > That's one reason why i dont think it's necessarily a good idea to > group-schedule threads, we dont really want to do a per thread group > percpu_alloc(). I still do not have clear how much overhead this will bring into the table, but I think (like Linus was pointing out) the hierarchy should look like: Top (VCPU maybe?) User Process Thread The "run_queue" concept (and data) that now is bound to a CPU, need to be replicated in: ROOT <- VCPUs add themselves here VCPU <- USERs add themselves here USER <- PROCs add themselves here PROC <- THREADs add themselves here THREAD (ultimate fine grained scheduling unit) So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking up a new task would mean: VCPU = ROOT->lookup(); USER = VCPU->lookup(); PROC = USER->lookup(); THREAD = PROC->lookup(); Run-time statistics should propagate back the other way around. > In fact for threads the _reverse_ problem exists, threaded apps tend to > _strive_ for more performance - hence their desperation of using the > threaded programming model to begin with ;) (just think of media > playback apps which are typically multithreaded) The same user nicing two different multi-threaded processes would expect a predictable CPU distribution too. Doing that efficently (the old per-cpu run-queue is pretty nice from many POVs) is the real challenge. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 23:30 ` Davide Libenzi @ 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 15:43 ` Davide Libenzi 2007-04-21 14:09 ` Bill Davidsen 2007-04-19 17:39 ` Bernd Eckenfels 1 sibling, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 8:00 UTC (permalink / raw) To: Davide Libenzi Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Davide Libenzi <davidel@xmailserver.org> wrote: > > That's one reason why i dont think it's necessarily a good idea to > > group-schedule threads, we dont really want to do a per thread group > > percpu_alloc(). > > I still do not have clear how much overhead this will bring into the > table, but I think (like Linus was pointing out) the hierarchy should > look like: > > Top (VCPU maybe?) > User > Process > Thread > > The "run_queue" concept (and data) that now is bound to a CPU, need to be > replicated in: > > ROOT <- VCPUs add themselves here > VCPU <- USERs add themselves here > USER <- PROCs add themselves here > PROC <- THREADs add themselves here > THREAD (ultimate fine grained scheduling unit) > > So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking > up a new task would mean: > > VCPU = ROOT->lookup(); > USER = VCPU->lookup(); > PROC = USER->lookup(); > THREAD = PROC->lookup(); > > Run-time statistics should propagate back the other way around. yeah, but this looks quite bad from an overhead POV ... i think we can do alot simpler to solve X and kernel threads prioritization. > > In fact for threads the _reverse_ problem exists, threaded apps tend > > to _strive_ for more performance - hence their desperation of using > > the threaded programming model to begin with ;) (just think of media > > playback apps which are typically multithreaded) > > The same user nicing two different multi-threaded processes would > expect a predictable CPU distribution too. [...] i disagree that the user 'would expect' this. Some users might. Others would say: 'my 10-thread rendering engine is more important than a 1-thread job because it's using 10 threads for a reason'. And the CFS feedback so far strengthens this point: the default behavior of treating the thread as a single scheduling (and CPU time accounting) unit works pretty well on the desktop. think about it in another, 'kernel policy' way as well: we'd like to _encourage_ more parallel user applications. Hurting them by accounting all threads together sends the exact opposite message. > [...] Doing that efficently (the old per-cpu run-queue is pretty nice > from many POVs) is the real challenge. yeah. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 8:00 ` Ingo Molnar @ 2007-04-19 15:43 ` Davide Libenzi 2007-04-21 14:09 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-19 15:43 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Thu, 19 Apr 2007, Ingo Molnar wrote: > i disagree that the user 'would expect' this. Some users might. Others > would say: 'my 10-thread rendering engine is more important than a > 1-thread job because it's using 10 threads for a reason'. And the CFS > feedback so far strengthens this point: the default behavior of treating > the thread as a single scheduling (and CPU time accounting) unit works > pretty well on the desktop. > > think about it in another, 'kernel policy' way as well: we'd like to > _encourage_ more parallel user applications. Hurting them by accounting > all threads together sends the exact opposite message. There are counter argouments too. Like, not every user knows if a certain process is MT or not. I agree though that doing accounting and fairness at a depth lower then USER is messy, and not only for performance. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 15:43 ` Davide Libenzi @ 2007-04-21 14:09 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-21 14:09 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Davide Libenzi <davidel@xmailserver.org> wrote: >> The same user nicing two different multi-threaded processes would >> expect a predictable CPU distribution too. [...] > > i disagree that the user 'would expect' this. Some users might. Others > would say: 'my 10-thread rendering engine is more important than a > 1-thread job because it's using 10 threads for a reason'. And the CFS > feedback so far strengthens this point: the default behavior of treating > the thread as a single scheduling (and CPU time accounting) unit works > pretty well on the desktop. > If by desktop you mean "one and only one interactive user," that's true. On a shared machine it's very hard to preserve any semblance of fairness when one user gets far more than another, based not on the value of what they're doing but the tools they use to to it. > think about it in another, 'kernel policy' way as well: we'd like to > _encourage_ more parallel user applications. Hurting them by accounting > all threads together sends the exact opposite message. > Why is that? There are lots of things which are intrinsically single threaded, how are we hurting hurting multi-threaded applications by refusing to give them more CPU than an application running on behalf of another user? By accounting all threads together we encourage writing an application in the most logical way. Threads are a solution, not a goal in themselves. >> [...] Doing that efficently (the old per-cpu run-queue is pretty nice >> from many POVs) is the real challenge. > > yeah. > > Ingo -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 8:00 ` Ingo Molnar @ 2007-04-19 17:39 ` Bernd Eckenfels 1 sibling, 0 replies; 713+ messages in thread From: Bernd Eckenfels @ 2007-04-19 17:39 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.64.0704181515290.25880@alien.or.mcafeemobile.com> you wrote: > Top (VCPU maybe?) > User > Process > Thread The problem with that is, that not all Schedulers might work on the User level. You can think of Batch/Job, Parent, Group, Session or namespace level. That would be IMHO a generic Top, with no need for a level above. Greetings Bernd ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 23:30 ` Davide Libenzi @ 2007-04-19 6:52 ` Mike Galbraith 2007-04-19 7:09 ` Ingo Molnar 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 2 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-19 6:52 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote: > so my current impression is that we want per UID accounting to solve the > X problem, the kernel threads problem and the many-users problem, but > i'd not want to do it for threads just yet because for them there's not > really any apparent problem to be solved. If you really mean UID vs EUID as Linus mentioned, I suppose I could learn to login as !root, and set KDE up to always give me root shells. With a heavily reniced X (perfectly fine), that should indeed solve my daily usage pattern nicely (always need godmode for shells, but not for mozilla and ilk. 50/50 split automatic without renice of entire gui) -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:52 ` Mike Galbraith @ 2007-04-19 7:09 ` Ingo Molnar 2007-04-19 7:32 ` Mike Galbraith 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 7:09 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-kernel * Mike Galbraith <efault@gmx.de> wrote: > With a heavily reniced X (perfectly fine), that should indeed solve my > daily usage pattern nicely (always need godmode for shells, but not > for mozilla and ilk. 50/50 split automatic without renice of entire > gui) how about the first-approximation solution i suggested in the previous mail: to add a per UID default nice level? (With this default defaulting to '-10' for all root-owned processes, and defaulting to '0' for everything else.) That would solve most of the current CFS regressions at hand. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:09 ` Ingo Molnar @ 2007-04-19 7:32 ` Mike Galbraith 2007-04-19 16:55 ` Davide Libenzi 0 siblings, 1 reply; 713+ messages in thread From: Mike Galbraith @ 2007-04-19 7:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > daily usage pattern nicely (always need godmode for shells, but not > > for mozilla and ilk. 50/50 split automatic without renice of entire > > gui) > > how about the first-approximation solution i suggested in the previous > mail: to add a per UID default nice level? (With this default defaulting > to '-10' for all root-owned processes, and defaulting to '0' for > everything else.) That would solve most of the current CFS regressions > at hand. That would make my kernel builds etc interfere with my other self's surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the X portion of my Joe-User activity pushes the compile portion of root down in bandwidth utilization automagically, which is exactly the right thing, because the root me in not as important as the Joe-User me using the GUI at that time. If the idea of X disturbing root upsets some, they can move X to another UID. Generally, it seems perfect for here. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:32 ` Mike Galbraith @ 2007-04-19 16:55 ` Davide Libenzi 2007-04-20 5:16 ` Mike Galbraith 0 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-19 16:55 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel On Thu, 19 Apr 2007, Mike Galbraith wrote: > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > > daily usage pattern nicely (always need godmode for shells, but not > > > for mozilla and ilk. 50/50 split automatic without renice of entire > > > gui) > > > > how about the first-approximation solution i suggested in the previous > > mail: to add a per UID default nice level? (With this default defaulting > > to '-10' for all root-owned processes, and defaulting to '0' for > > everything else.) That would solve most of the current CFS regressions > > at hand. > > That would make my kernel builds etc interfere with my other self's > surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the > X portion of my Joe-User activity pushes the compile portion of root > down in bandwidth utilization automagically, which is exactly the right > thing, because the root me in not as important as the Joe-User me using > the GUI at that time. If the idea of X disturbing root upsets some, > they can move X to another UID. Generally, it seems perfect for here. Now guys, I did not follow the whole lengthy and feisty thread, but IIRC Con's scheduler has been attacked because, among other argouments, was requiring X to be reniced. This happened like a month ago IINM. I did not have time to look at Con's scheduler, and I only had a brief look at Ingo's one (looks very promising IMO, but so was the initial O(1) post before all the corner-cases fixes went in). But this is not a about technical merit, this is about applying the same rules of judgement to others as well to ourselves. We went from a "renicing X to -10 is bad because the scheduler should be able to correctly handle the problem w/out additional external plugs" to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads class, on top of all the tasks owned by root" [1]. >From a spectator POV like myself in this case, this looks rather "unfair". [1] I think, before and now, that that's more a duck tape patch than a real solution. OTOH if the "solution" is gonna be another maze of macros and heuristics filled with pretty bad corner cases, I may prefer the former. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 16:55 ` Davide Libenzi @ 2007-04-20 5:16 ` Mike Galbraith 0 siblings, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-20 5:16 UTC (permalink / raw) To: Davide Libenzi; +Cc: Ingo Molnar, linux-kernel On Thu, 2007-04-19 at 09:55 -0700, Davide Libenzi wrote: > On Thu, 19 Apr 2007, Mike Galbraith wrote: > > > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > > > daily usage pattern nicely (always need godmode for shells, but not > > > > for mozilla and ilk. 50/50 split automatic without renice of entire > > > > gui) > > > > > > how about the first-approximation solution i suggested in the previous > > > mail: to add a per UID default nice level? (With this default defaulting > > > to '-10' for all root-owned processes, and defaulting to '0' for > > > everything else.) That would solve most of the current CFS regressions > > > at hand. > > > > That would make my kernel builds etc interfere with my other self's > > surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the > > X portion of my Joe-User activity pushes the compile portion of root > > down in bandwidth utilization automagically, which is exactly the right > > thing, because the root me in not as important as the Joe-User me using > > the GUI at that time. If the idea of X disturbing root upsets some, > > they can move X to another UID. Generally, it seems perfect for here. > > Now guys, I did not follow the whole lengthy and feisty thread, but IIRC > Con's scheduler has been attacked because, among other argouments, was > requiring X to be reniced. This happened like a month ago IINM. I don't object to renicing X if you want it to receive _more_ than it's fair share. I do object to having to renice X in order for it to _get_ it's fair share. That's what I attacked. > I did not have time to look at Con's scheduler, and I only had a brief > look at Ingo's one (looks very promising IMO, but so was the initial O(1) > post before all the corner-cases fixes went in). > But this is not a about technical merit, this is about applying the same > rules of judgement to others as well to ourselves. I'm running the same tests with CFS that I ran for RSDL/SD. It falls short in one key area (to me) in that X+client cannot yet split my box 50/50 with two concurrent tasks. In the CFS case, renicing both X and client does work, but it should not be necessary IMHO. With RSDL/SD renicing didn't help. > We went from a "renicing X to -10 is bad because the scheduler should > be able to correctly handle the problem w/out additional external plugs" > to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads > class, on top of all the tasks owned by root" [1]. > >From a spectator POV like myself in this case, this looks rather "unfair". Well, for me, the renicing I mentioned above is only interesting as a way to improve long term fairness with schedulers with no history. I found Linus' EUID idea intriguing in that by putting the server together with a steady load in one 'fair' domain, and clients in another, X can, if prioritized to empower it to do so, modulate the steady load in it's domain (but can't starve it!), the clients modulate X, and the steady load gets it all when X and clients are idle. The nice level of X determines to what _extent_ X can modulate the constant load rather like a mixer slider. The synchronous (I'm told) nature of X/client then becomes kind of an asset to the desktop instead of a liability. The specific case I was thinking about is the X+Gforce test where both RSDL and CFS fail to provide fairness (as defined by me;). X and Gforce are mostly not concurrent. The make -j2 I put them up against are mostly concurrent. I don't call giving 1/3 of my CPU to X+Client fair at _all_, but that's what you'll get if your fairstick of the instant generally can't see the fourth competing task. Seemed pretty cool to me because it creates the missing connection between client and server, though also likely complicated (and maybe full of perils, who knows). -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:52 ` Mike Galbraith 2007-04-19 7:09 ` Ingo Molnar @ 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-19 7:14 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Thu, 2007-04-19 at 08:52 +0200, Mike Galbraith wrote: > On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote: > > > so my current impression is that we want per UID accounting to solve the > > X problem, the kernel threads problem and the many-users problem, but > > i'd not want to do it for threads just yet because for them there's not > > really any apparent problem to be solved. > > If you really mean UID vs EUID as Linus mentioned, I suppose I could > learn to login as !root, and set KDE up to always give me root shells. > > With a heavily reniced X (perfectly fine), that should indeed solve my > daily usage pattern nicely (always need godmode for shells, but not for > mozilla and ilk. 50/50 split automatic without renice of entire gui) Backward, needs to be EUID as Linus suggested. Kernel builds etc along with reniced X in root's bucket, surfing and whatnot in Joe-User's bucket. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar 2007-04-18 20:07 ` Davide Libenzi @ 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > perhaps a more fitting term would be 'precise group-scheduling'. > > Within the lowest level task group entity (be that thread group or > > uid group, etc.) 'precise scheduling' is equivalent to 'fairness'. > > Yes. Absolutely. Except I think that at least if you're going to name > somethign "complete" (or "perfect" or "precise"), you should also > admit that groups can be hierarchical. yes. Am i correct to sum up your impression as: " Ingo, for you the hierarchy still appears to be an after-thought, while in practice it's easily the most important thing! Why are you so hung up about 'fairness', it makes no sense!" right? and you would definitely be right if you suggested that i neglected the 'group scheduling' aspects of CFS (except for a minimalistic nice level implementation, which is a poor-man's-non-automatic-group-scheduling), but i very much know its important and i'll definitely fix it for -v4. But please let me explain my reasons for my different focus: yes, group scheduling in practice is the most important first-layer thing, and without it any of the other 'CFS wins' can easily be useless. Firstly, i have not neglected the group scheduling related CFS regressions at all, mainly because there _is_ already a quick hack to check whether group scheduling would solve these regressions: renice. And it was tried in both of the two CFS regression cases i'm aware of: Mike's X starvation problem and Willy's "kevents starvation with thousands of scheddos tasks running" problem. And in both cases, applying the renice hack [which should be properly and automatically implemented as uid group scheduling] fixed the regression for them! So i was not worried at all, group scheduling _provably solves_ these CFS regressions. I rather concentrated on the CFS regressions that were much less clear. But PLEASE believe me: even with perfect cross-group CPU allocation but with a simple non-heuristic scheduler underlying it, you can _easily_ get a sucky desktop experience! I know it because i tried it and others tried it too. (in fact the first version of sched_fair.c was tick based and low-res, and it sucked) Two more things were needed: - the high precision of nsec/64-bit accounting ('reliability of scheduling') - extremely even time-distribution of CPU power ('determinism/smoothness, human perception') (i'm expanding on these two concepts further below) take out any of these and group scheduling or not, you are easily going to have a sucky desktop! (We know that from years of experiments: many people tried to rip out the unfairness from the scheduler and there were always nasty corner cases that 'should' have worked but didnt.) Without these we'd in essence start again at square one, just at a different square, this time with another group of people being irritated! But the biggest and hardest to achieve _wins_ of CFS are _NOT_ achieved via a simple 'get rid of the unfairness of the upstream scheduler and apply group scheduling'. (I know that because i tried it before and because others tried it before, for many many years.) You will _easily_ get sucky desktop experience. The other two things are very much needed too: - the high precision of nsec/64-bit accounting, and the many corner-cases this solves. (For example on a typical desktop there are _lots_ of timing-driven workloads that are in essence 'invisible' to low-resolution, timer-tick based accounting and are heavily skewed.) - extremely even time-distribution of CPU power. CFS behaves pretty well even under the dreaded 'make -jN in an xterm' kernel build workload as reported by Mark Lord, because it also distributes CPU power in a _finegrained_ way. A shell prompt under CFS still behaves acceptably on a single-CPU testbox of mine with a "make -j50" workload. (yes, fifty) Humans react alot more negatively to sudden changes in application behavior ('lags', pauses, short hangs) than they react to fine, gradual, all-encompassing slowdowns. This is a key property of CFS. ( Otherwise renicing X to -10 would have solved most of the interactivity complaints against the vanilla scheduler, otherwise renicing X to -10 would have fixed Mike's setup under SD (it didnt) while it worked much better under CFS, otherwise Gene wouldnt have found CFS markedly better than SD, etc., etc. So getting rid of the heuristics is less than 50% of the road to the perfect desktop scheduler. ) and i claim that these were the really hard bits, and i spent most of the CFS coding only on getting _these_ details 100% right under various workloads, and it makes a night and day difference _even without any group scheduling help_. and note another reason here: group scheduling _masks_ many other scheduling deficiencies that are possible in scheduler. So since CFS doesnt do group scheduling, i get a _fuller_ picture of the behavior of the core "precise scheduling" engine. At the initial stage i didnt want to hide bugs by masking them via group scheduling, especially because the renice workaround/hack was available. Guess how nice it all will get if we also add group scheduling to the mix, and people wouldnt have to add nasty and fragile renice based hacks, it will 'just work' out of box? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 17:59 ` Ingo Molnar @ 2007-04-18 19:23 ` Linus Torvalds 2007-04-18 19:56 ` Davide Libenzi 1 sibling, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-18 19:23 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > > But note that most of the reported CFS interactivity wins, as surprising > as it might be, were due to fairness between _the same user's tasks_. And *ALL* of the CFS interactivity *losses* and complaints have been because it did the wrong thing _between different user's tasks_ So what's your point? Your point was that when people try it out as a single user, it is indeed fair. But that's no point at all, since it totally missed _my_ point. The problems with X scheduling is exactly that "other user" kind of thing. The problem with kernel thread starvation due to user threads getting all the CPU time is exactly the same issue. As logn as you think that all threads are equal, and should be treated equally, you CANNOT make it work well. People can say "ok, you can renice X", but the whole problem stems from the fact that you're trying to be fair based on A TOTALLY INVALID NOTION of what "fair" is. > In the typical case, 99% of the desktop CPU time is executed either as X > (root user) or under the uid of the logged in user, and X is just one > task. So? You are ignoring the argument again. You're totally bringing up a red herring: > Even with a bad hack of making X super-high-prio, interactivity as > experienced by users still sucks without having fairness between the > other 100-200 user tasks that a desktop system is typically using. I didn't say that you should be *unfair* within one user group. What kind of *idiotic* argument are you trying to put forth? OF COURSE you should be fair "within the user group". Nobody contests that the "other 100-200 user tasks" should be scheduled fairly _amongst themselves_. The only point I had was that you cannot just lump all threads together and say "these threads are equally important". The 100-200 user tasks may be equally important, and should get equal amounts of preference, but that has absolutely _zero_ bearing on the _single_ task run in another "scheduling group", ie by other users or by X. I'm not arguing against fairness. I'm arguing against YOUR notion of fairness, which is obviously bogus. It is *not* fair to try to give out CPU time evenly! Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:23 ` Linus Torvalds @ 2007-04-18 19:56 ` Davide Libenzi 2007-04-18 20:11 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 19:56 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > I'm not arguing against fairness. I'm arguing against YOUR notion of > fairness, which is obviously bogus. It is *not* fair to try to give out > CPU time evenly! "Perhaps on the rare occasion pursuing the right course demands an act of unfairness, unfairness itself can be the right course?" - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:56 ` Davide Libenzi @ 2007-04-18 20:11 ` Linus Torvalds 2007-04-19 0:22 ` Davide Libenzi 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-18 20:11 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Davide Libenzi wrote: > > "Perhaps on the rare occasion pursuing the right course demands an act of > unfairness, unfairness itself can be the right course?" I don't think that's the right issue. It's just that "fairness" != "equal". Do you think it "fair" to pay everybody the same regardless of how good a job they do? I don't think anybody really believes that. Equating "fair" and "equal" is simply a very fundamental mistake. They're not the same thing. Never have been, and never will. Now, there's no question that "equal" is much easier to implement, if only because it's a lot easier to agree what it means. "Equal parts" is somethign everybody can agree on. "Fair parts" automatically involves a balancing act, and people will invariably count things differently and thus disagree about what is "fair" and what is not. I don't think we can ever get a "perfect" setup for that reason, but I think we can get something that at least gets reasonably close, at least for the obvious cases. So my suggested test-case of running one process as one user and two processes as another one has a fairly "obviously correct" solution if you have just one CPU's, and you can probably be pretty fair in practice on two CPU's (there's an obvious theoretical solution, whether you can get there with a practical algorithm is another thing). On three or more CPU's, you obviously wouldn't even *want* to be fair, since you can very naturally just give a CPU to each.. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 20:11 ` Linus Torvalds @ 2007-04-19 0:22 ` Davide Libenzi 2007-04-19 0:30 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-19 0:22 UTC (permalink / raw) To: Linus Torvalds Cc: Davide Libenzi, Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > On Wed, 18 Apr 2007, Davide Libenzi wrote: > > > > "Perhaps on the rare occasion pursuing the right course demands an act of > > unfairness, unfairness itself can be the right course?" > > I don't think that's the right issue. > > It's just that "fairness" != "equal". > > Do you think it "fair" to pay everybody the same regardless of how good a > job they do? I don't think anybody really believes that. > > Equating "fair" and "equal" is simply a very fundamental mistake. They're > not the same thing. Never have been, and never will. I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :) - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 0:22 ` Davide Libenzi @ 2007-04-19 0:30 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-19 0:30 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Davide Libenzi wrote: > > I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :) Ahh, I'm clearly not cultured enough, I didn't catch that reference. Linus "yes, I've seen the movie, but it apparently left more of a mark in other people" Torvalds ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:48 ` [ck] " Mark Glines 2007-04-18 17:49 ` Ingo Molnar @ 2007-04-18 18:02 ` William Lee Irwin III 2007-04-18 18:12 ` Ingo Molnar 2007-04-18 18:36 ` Diego Calleja 2007-04-19 0:37 ` Peter Williams 4 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-18 18:02 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote: > So I claim that anything that cannot be fair by user ID is actually really > REALLY unfair. I think it's absolutely humongously STUPID to call > something the "Completely Fair Scheduler", and then just be fair on a > thread level. That's not fair AT ALL! It's the anti-thesis of being fair! > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. > So you should see one thread get 50& CPU (single thread of one user), 4 > threads get 10% CPU (their fair share of that users time), and 2 threads > get 5% CPU (the fair share within that thread group!). > Any scheduling argument that just considers the above to be "7 threads > total" and gives each thread 14% of CPU time "fairly" is *anything* but > fair. It's a joke if that kind of scheduler then calls itself CFS! I don't think it's completely fair [sic] to come down on it that hard. It does largely achieve the sort of fairness it set out for itself as its design goal. One should also note that the queueing mechanism is more than flexible enough to handle prioritization by a number of different methods, and the large precision of its priorities is useful there. So a rather broad variety of policies can be implemented by changing the ->fair_key calculations. In some respects, the vast priority space and very high clock precision are two of its most crucial advantages. On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote: > And yes, that's largely what the current scheduler will do, but at least > the current scheduler doesn't claim to be fair! So the current scheduler > is a lot *better* if only in the sense that it doesn't make ridiculous > claims that aren't true! The name chosen was somewhat buzzwordy. I'd have named it something more descriptive of the algorithm, though what's implemented in the current dynamic priority (i.e. ->fair_key) calculations are somewhat difficult to precisely categorize. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 18:02 ` William Lee Irwin III @ 2007-04-18 18:12 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 18:12 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > It does largely achieve the sort of fairness it set out for itself as > its design goal. One should also note that the queueing mechanism is > more than flexible enough to handle prioritization by a number of > different methods, and the large precision of its priorities is useful > there. So a rather broad variety of policies can be implemented by > changing the ->fair_key calculations. yeah. Note that i concentrated on the bit that makes the largest interactivity improvement: to implement "precise scheduling" (a'ka complete fairness) between the 100+ user tasks that do a complex scheduling dance on a typical desktop on various workloads. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds ` (2 preceding siblings ...) 2007-04-18 18:02 ` William Lee Irwin III @ 2007-04-18 18:36 ` Diego Calleja 2007-04-19 0:37 ` Peter Williams 4 siblings, 0 replies; 713+ messages in thread From: Diego Calleja @ 2007-04-18 18:36 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner El Wed, 18 Apr 2007 10:22:59 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> escribió: > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. "Fairness between users" was implemented long time ago by rik van riel (http://surriel.com/patches/2.4/2.4.19-fairsched). Some people has been asking for a functionality like that for a long time, ie: universities that want to avoid gcc processes from one student that is trying to learn how fork() works from starving the processes of rest of the students. But not only they want "fairness between users", they also want "priorities between users and/or groups of users", ie: "the 'students' group shouldn't starve the 'admins' group". ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds ` (3 preceding siblings ...) 2007-04-18 18:36 ` Diego Calleja @ 2007-04-19 0:37 ` Peter Williams 4 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-19 0:37 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner Linus Torvalds wrote: > > On Wed, 18 Apr 2007, Matt Mackall wrote: >> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: >>> And "fairness by euid" is probably a hell of a lot easier to do than >>> trying to figure out the wakeup matrix. >> For the record, you actually don't need to track a whole NxN matrix >> (or do the implied O(n**3) matrix inversion!) to get to the same >> result. > > I'm sure you can do things differently, but the reason I think "fairness > by euid" is actually worth looking at is that it's pretty much the > *identical* issue that we'll have with "fairness by virtual machine" and a > number of other "container" issues. > > The fact is: > > - "fairness" is *not* about giving everybody the same amount of CPU time > (scaled by some niceness level or not). Anybody who thinks that is > "fair" is just being silly and hasn't thought it through. > > - "fairness" is multi-level. You want to be fair to threads within a > thread group (where "process" may be one good approximation of what a > "thread group" is, but not necessarily the only one). > > But you *also* want to be fair in between those "thread groups", and > then you want to be fair across "containers" (where "user" may be one > such container). > > So I claim that anything that cannot be fair by user ID is actually really > REALLY unfair. I think it's absolutely humongously STUPID to call > something the "Completely Fair Scheduler", and then just be fair on a > thread level. That's not fair AT ALL! It's the anti-thesis of being fair! > > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. > > So you should see one thread get 50& CPU (single thread of one user), 4 > threads get 10% CPU (their fair share of that users time), and 2 threads > get 5% CPU (the fair share within that thread group!). > > Any scheduling argument that just considers the above to be "7 threads > total" and gives each thread 14% of CPU time "fairly" is *anything* but > fair. It's a joke if that kind of scheduler then calls itself CFS! > > And yes, that's largely what the current scheduler will do, but at least > the current scheduler doesn't claim to be fair! So the current scheduler > is a lot *better* if only in the sense that it doesn't make ridiculous > claims that aren't true! > > Linus Sounds a lot like the PLFS (process level fair sharing) scheduler in Aurema's ARMTech (for whom I used to work). The "fair" in the title is a bit misleading as it's all about unfair scheduling in order to meet specific policies. But it's based on the principle that if you can allocate CPU band width "fairly" (which really means in proportion to the entitlement each process is allocated) then you can allocate CPU band width "fairly" between higher level entities such as process groups, users groups and so on by subdividing the entitlements downwards. The tricky part of implementing this was the fact that not all entities at the various levels have sufficient demand for CPU band width to use their entitlements and this in turn means that the entities above them will have difficulty using their entitlements even if other of their subordinates have sufficient demand (because their entitlements will be too small). The trick is to have a measure of each entity's demand for CPU bandwidth and use that to modify the way entitlement is divided among subordinates. As a first guess, an entity's CPU band width usage is an indicator of demand but doesn't take into account unmet demand due to tasks waiting on a run queue waiting for access to the CPU. On the other hand, usage plus time waiting on the queue isn't a good measure of demand either (although it's probably a good upper bound) as it's unlikely that the task would have used the same amount of CPU as the waiting time if it had gone straight to the CPU. But my main point is that it is possible to build schedulers that can achieve higher level scheduling policies. Versions of PLFS work on Windows from user space by twiddling process priorities. Part of my more recent work at Aurema had been involved in patching Linux's scheduler so that nice worked more predictably so that we could release a user space version of PLFS for Linux. The other part was to add hard CPU band width caps for processes so that ARMTech could enforce hard CPU bandwidth caps on higher level entities (as this can't be done without the kernel being able to do it at that level. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds @ 2007-04-18 19:05 ` Davide Libenzi 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 19:05 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > And "fairness by euid" is probably a hell of a lot easier to do than > > trying to figure out the wakeup matrix. > > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. You can converge on the same node weightings (ie dynamic > priorities) by applying a damped function at each transition point > (directed wakeup, preemption, fork, exit). > > The trouble with any scheme like this is that it needs careful tuning > of the damping factor to converge rapidly and not oscillate and > precise numerical attention to the transition functions so that the sum of > dynamic priorities is conserved. Doing that inside the boundaries of the time constrains imposed by a scheduler, is the interesting part. Given also that the size (and members) of it (matrix) is dynamic. Also, a "wakup matrix" (if the name correctly pictures what it is for) would help with latencies and priority inheritance, but not for global fairness. The maniacal fairness focus we're seeing now, is due to the fact the mainline can have extremely unfair behaviour under certain conditions. IMO fairness, although important, should not be main objective of the scheduler rewrite. Simplification and predictability should be on higher priority, with interactivity achievements bound to decent fariness constraints. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 19:05 ` Davide Libenzi @ 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 0 replies; 713+ messages in thread From: Michael K. Edwards @ 2007-04-18 19:13 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/18/07, Matt Mackall <mpm@selenic.com> wrote: > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. You can converge on the same node weightings (ie dynamic > priorities) by applying a damped function at each transition point > (directed wakeup, preemption, fork, exit). > > The trouble with any scheme like this is that it needs careful tuning > of the damping factor to converge rapidly and not oscillate and > precise numerical attention to the transition functions so that the sum of > dynamic priorities is conserved. That would be the control theory approach. And yes, you have to get both the theoretical transfer function and the numerics right. It sometimes helps to use a control-systems framework like the classic Takagi-Sugeno-Kang fuzzy logic controller; get the numerics right once and for all, and treat the heuristics as data, not logic. (I haven't worked in this area in almost twenty years, but Google -- yes, I do use Google+brain for fact-checking; what do you do? -- says that people are still doing active research on TSK models, and solid fixed-point reference implementations are readily available.) That seems like an attractive strategy here because you could easily embed the control engine in the kernel and load rule sets dynamically. Done right, that could give most of the advantages of pluggable schedulers (different heuristic strokes for different folks) without diluting the tester pool for the actual engine code. (Of course, different scheduling strategies require different input data, and you might not want the overhead of collecting data that your chosen heuristics won't use. But that's not much different from the netfilter situation, and is obviously a solvable problem, if anyone cares to put that much work in. The people who ought to be funding this kind of work are Sun and IBM, who don't have a chance on the desktop and are in big trouble in the database tier; their future as processor vendors depends on being able to service presentation-tier and business-logic-tier loads efficiently on their massively multi-core chips. MIPS should pitch in too, on behalf of licensees like Cavium who need more predictable behavior on multi-core embedded Linux.) Note also that you might not even want to persistently prioritize particular processes or process groups. You might want a heuristic that notices that some task (say, the X server) often responds to being awakened by doing a little work and then unblocking the task that awakened it. When it is pinged from some highly interactive task, you want it to jump the scheduler queue just long enough to unblock the interactive task, which may mean letting it flush some work out of its internal queue. But otherwise you want to batch things up until there's too much "scheduler pressure" behind it, then let it work more or less until it runs out of things to do, because its working set is so large that repeatedly scheduling it in and out is hell on caches. (Priority inheritance is the classic solution to the blocked-high-priority-task problem _in_isolation_. It is not without its pitfalls, especially when the designer of the "server" didn't expect to lose his timeslice instantly on releasing the lock. True priority inheritance is probably not something you want to inflict on a non-real-time system, but you do need some urgency heuristic. What a "fuzzy logic" framework does for you is to let you combine competing heuristics in a way that remains amenable to analysis using control theory techniques.) What does any of this have to do with "fairness"? Nothing whatsoever! There's work that has to be done, and choosing when to do it is almost entirely a matter of staying out of the way of more urgent work while minimizing the task's negative impact on the rest of the system. Does that mean that the X server is "special", kind of the way that latency-sensitive A/V applications are "special", and belongs in a separate scheduler class? No. Nowadays, workloads where the kernel has any idea what tasks belong to what "users" are the exception, not the norm. The X server is the canary in the coal mine, and a scheduler that won't do the right thing for X without hand tweaking won't do the right thing for other eyeball-driven, multiple-tiers-on-one-box scenarios either. If you want fairness among users to the extent that their demands _compete_, you might as well partition the whole machine, and have a separate fairness-oriented scheduler (let's call it a "hypervisor") that lives outside the kernel. (Talk about two students running gcc on the same shell server, with more important people also doing things on the same system, is so 1990's!) Not that the design of scheduler heuristics shouldn't include "fairness"-like considerations; but they're probably only interesting as a fallback for when the scheduler has no idea what it ought to schedule next. So why is Ingo's scheduler apparently working well for desktop loads? I haven't tried it or even looked at its code, but from its marketing I would guess that it effectively penalizes tasks whose I/O requests can be serviced from (or directed to) cache long enough to actually consume a whole timeslice. This is prima facie evidence that their _current_behavior_ is non-interactive. Presumably this penalty expires quickly when the task again asks for information that is not readily at hand (or writes data that the system is not willing to cache) -- which usually implies either actual user interaction or a change of working set, both of which deserve an "urgency premium". The mainline scheduler seems to contain various heuristics that mistake a burst of non-interactive _activity_ for a persistently non-interactive _task_. Take them away in the name of "fairness", and the system adapts more quickly to the change of working set implied by a change of user focus. There are probably fewer pathological load patterns too, since manual knob-turning uninformed by control theory is a lot less likely to get you into trouble when there are few knobs and no deliberately inserted long-time-constant feedback paths. But you can't say there are _no_ pathological load patterns, or even that the major economic drivers of the Linux ecosystem don't generate them, until you do some authentic engineering analysis. In short (too late!) -- alternate schedulers are fun to experiment with, and the sort of people who would actually try out patches floated on LKML may find that they improve their desktop experience, hosting farm throughput, etc. But even if the mainline scheduler is a hack atop a kludge covering a crock, it's more or less what production applications have expected since the last major architectural shift (NPTL). There's just no sense in replacing it until you can either add real value (say, integral clock scaling for power efficiency, with a reasonable "spinning reserve" for peaking load) or demonstrate stability by engineering analysis instead of trial and error. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall @ 2007-04-19 3:18 ` Nick Piggin 2007-04-19 5:14 ` Andrew Morton 2007-04-21 13:40 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen 2 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-19 3:18 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > > On Wed, 18 Apr 2007, Matt Mackall wrote: > > > > Why is X special? Because it does work on behalf of other processes? > > Lots of things do this. Perhaps a scheduler should focus entirely on > > the implicit and directed wakeup matrix and optimizing that > > instead[1]. > > I 100% agree - the perfect scheduler would indeed take into account where > the wakeups come from, and try to "weigh" processes that help other > processes make progress more. That would naturally give server processes > more CPU power, because they help others > > I don't believe for a second that "fairness" means "give everybody the > same amount of CPU". That's a totally illogical measure of fairness. All > processes are _not_ created equal. I believe that unless the kernel is told of these inequalities, then it must schedule fairly. And yes, by fairly, I mean fairly among all threads as a base resource class, because that's what Linux has always done (and if you aggregate into higher classes, you still need that per-thread scheduling). So I'm not excluding extra scheduling classes like per-process, per-user, but among any class of equal schedulable entities, fair scheduling is the only option because the alternative of unfairness is just insane. > That said, even trying to do "fairness by effective user ID" would > probably already do a lot. In a desktop environment, X would get as much > CPU time as the user processes, simply because it's in a different > protection domain (and that's really what "effective user ID" means: it's > not about "users", it's really about "protection domains"). > > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. Well my X server has an euid of root, which would mean my X clients can cause X to do work and eat into root's resources. Or as Ingo said, X may not be running as root. Seems like just another hack to try to implicitly solve the X problem and probably create a lot of others along the way. All fairness issues aside, in the context of keeping a very heavily loaded desktop interactive, X is special. That you are trying to think up funny rules that would implicitly give X better priority is kind of indicative of that. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 3:18 ` Nick Piggin @ 2007-04-19 5:14 ` Andrew Morton 2007-04-19 6:38 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Andrew Morton @ 2007-04-19 5:14 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, 19 Apr 2007 05:18:07 +0200 Nick Piggin <npiggin@suse.de> wrote: > And yes, by fairly, I mean fairly among all threads as a base resource > class, because that's what Linux has always done Yes, there are potential compatibility problems. Example: a machine with 100 busy httpd processes and suddenly a big gzip starts up from console or cron. Under current kernels, that gzip will take ages and the httpds will take a 1% slowdown, which may well be exactly the behaviour which is desired. If we were to schedule by UID then the gzip suddenly gets 50% of the CPU and those httpd's all take a 50% hit, which could be quite serious. That's simple to fix via nicing, but people have to know to do that, and there will be a transition period where some disruption is possible. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 5:14 ` Andrew Morton @ 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 6:38 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner * Andrew Morton <akpm@linux-foundation.org> wrote: > > And yes, by fairly, I mean fairly among all threads as a base > > resource class, because that's what Linux has always done > > Yes, there are potential compatibility problems. Example: a machine > with 100 busy httpd processes and suddenly a big gzip starts up from > console or cron. > > Under current kernels, that gzip will take ages and the httpds will > take a 1% slowdown, which may well be exactly the behaviour which is > desired. > > If we were to schedule by UID then the gzip suddenly gets 50% of the > CPU and those httpd's all take a 50% hit, which could be quite > serious. > > That's simple to fix via nicing, but people have to know to do that, > and there will be a transition period where some disruption is > possible. hmmmm. How about the following then: default to nice -10 for all (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ special: root already has disk space reserved to it, root has special memory allocation allowances, etc. I dont see a reason why we couldnt by default make all root tasks have nice -10. This would be instantly loved by sysadmins i suspect ;-) (distros that go the extra mile of making Xorg run under non-root could also go another extra one foot to renice that X server to -10.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:38 ` Ingo Molnar @ 2007-04-19 7:57 ` William Lee Irwin III 2007-04-19 11:50 ` Peter Williams 2007-04-19 8:33 ` Nick Piggin 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas 2 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-19 7:57 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner * Andrew Morton <akpm@linux-foundation.org> wrote: >> Yes, there are potential compatibility problems. Example: a machine >> with 100 busy httpd processes and suddenly a big gzip starts up from >> console or cron. [...] On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: > hmmmm. How about the following then: default to nice -10 for all > (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ > special: root already has disk space reserved to it, root has special > memory allocation allowances, etc. I dont see a reason why we couldnt by > default make all root tasks have nice -10. This would be instantly loved > by sysadmins i suspect ;-) > (distros that go the extra mile of making Xorg run under non-root could > also go another extra one foot to renice that X server to -10.) I'd further recommend making priority levels accessible to kernel threads that are not otherwise accessible to processes, both above and below user-available priority levels. Basically, if you can get SCHED_RR and SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN scheduler class can coexist with SCHED_OTHER in like fashion, but with availability of higher and lower priorities than any userspace process is allowed, and potentially some differing scheduling semantics. In such a manner nonessential background processing intended not to ever disturb userspace can be given priorities appropriate to it (perhaps even con's SCHED_IDLEPRIO would make sense), and other, urgent processing can be given priority over userspace altogether. I believe root's default priority can be adjusted in userspace as things now stand somewhere in /etc/ but I'm not sure of the specifics. Word is somewhere in /etc/security/limits.conf -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:57 ` William Lee Irwin III @ 2007-04-19 11:50 ` Peter Williams 2007-04-20 5:26 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-19 11:50 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: >>> Yes, there are potential compatibility problems. Example: a machine >>> with 100 busy httpd processes and suddenly a big gzip starts up from >>> console or cron. > [...] > > On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: >> hmmmm. How about the following then: default to nice -10 for all >> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ >> special: root already has disk space reserved to it, root has special >> memory allocation allowances, etc. I dont see a reason why we couldnt by >> default make all root tasks have nice -10. This would be instantly loved >> by sysadmins i suspect ;-) >> (distros that go the extra mile of making Xorg run under non-root could >> also go another extra one foot to renice that X server to -10.) > > I'd further recommend making priority levels accessible to kernel threads > that are not otherwise accessible to processes, both above and below > user-available priority levels. Basically, if you can get SCHED_RR and > SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN > scheduler class can coexist with SCHED_OTHER in like fashion, but with > availability of higher and lower priorities than any userspace process > is allowed, and potentially some differing scheduling semantics. In such > a manner nonessential background processing intended not to ever disturb > userspace can be given priorities appropriate to it (perhaps even con's > SCHED_IDLEPRIO would make sense), and other, urgent processing can be > given priority over userspace altogether. > > I believe root's default priority can be adjusted in userspace as > things now stand somewhere in /etc/ but I'm not sure of the specifics. > Word is somewhere in /etc/security/limits.conf This is sounding very much like System V Release 4 (and descendants) except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that are in system mode dynamic priorities in the SCHED_SYS range (to avoid priority inversion, I believe). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 11:50 ` Peter Williams @ 2007-04-20 5:26 ` William Lee Irwin III 2007-04-20 6:16 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-20 5:26 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> I'd further recommend making priority levels accessible to kernel threads >> that are not otherwise accessible to processes, both above and below >> user-available priority levels. Basically, if you can get SCHED_RR and >> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN >> scheduler class can coexist with SCHED_OTHER in like fashion, but with >> availability of higher and lower priorities than any userspace process >> is allowed, and potentially some differing scheduling semantics. In such >> a manner nonessential background processing intended not to ever disturb >> userspace can be given priorities appropriate to it (perhaps even con's >> SCHED_IDLEPRIO would make sense), and other, urgent processing can be >> given priority over userspace altogether. On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote: > This is sounding very much like System V Release 4 (and descendants) > except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that > are in system mode dynamic priorities in the SCHED_SYS range (to avoid > priority inversion, I believe). Descriptions of that are probably where I got the idea (hurrah for OS textbooks). It makes a fair amount of sense. Not sure what the take on the specific precedent is. The only content here is expanding the priority range with ranges above and below for the exclusive use of ultra-privileged tasks, so it's really trivial. Actually it might be so trivial it should just be some permission checks in the SCHED_OTHER renicing code. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 5:26 ` William Lee Irwin III @ 2007-04-20 6:16 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-20 6:16 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> I'd further recommend making priority levels accessible to kernel threads >>> that are not otherwise accessible to processes, both above and below >>> user-available priority levels. Basically, if you can get SCHED_RR and >>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN >>> scheduler class can coexist with SCHED_OTHER in like fashion, but with >>> availability of higher and lower priorities than any userspace process >>> is allowed, and potentially some differing scheduling semantics. In such >>> a manner nonessential background processing intended not to ever disturb >>> userspace can be given priorities appropriate to it (perhaps even con's >>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be >>> given priority over userspace altogether. > > On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote: >> This is sounding very much like System V Release 4 (and descendants) >> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that >> are in system mode dynamic priorities in the SCHED_SYS range (to avoid >> priority inversion, I believe). > > Descriptions of that are probably where I got the idea (hurrah for OS > textbooks). And long term background memory. :-) > It makes a fair amount of sense. Yes. You could also add a SCHED_IA in between SCHED_SYS and SCHED_OTHER (a la Solaris) for interactive tasks. The only problem is how to get a task into SCHED_IA without root privileges. > Not sure what the take on > the specific precedent is. The only content here is expanding the > priority range with ranges above and below for the exclusive use of > ultra-privileged tasks, so it's really trivial. Actually it might be so > trivial it should just be some permission checks in the SCHED_OTHER > renicing code. Perhaps. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III @ 2007-04-19 8:33 ` Nick Piggin 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas 2 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-19 8:33 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > And yes, by fairly, I mean fairly among all threads as a base > > > resource class, because that's what Linux has always done > > > > Yes, there are potential compatibility problems. Example: a machine > > with 100 busy httpd processes and suddenly a big gzip starts up from > > console or cron. > > > > Under current kernels, that gzip will take ages and the httpds will > > take a 1% slowdown, which may well be exactly the behaviour which is > > desired. > > > > If we were to schedule by UID then the gzip suddenly gets 50% of the > > CPU and those httpd's all take a 50% hit, which could be quite > > serious. > > > > That's simple to fix via nicing, but people have to know to do that, > > and there will be a transition period where some disruption is > > possible. > > hmmmm. How about the following then: default to nice -10 for all > (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ > special: root already has disk space reserved to it, root has special > memory allocation allowances, etc. I dont see a reason why we couldnt by > default make all root tasks have nice -10. This would be instantly loved > by sysadmins i suspect ;-) I have no problem with doing fancy new fairness classes and things. But considering that we _need_ to have per-thread fairness and that is also what the current scheduler has and what we need to do well for obvious reasons, the best path to take is to get per-thread scheduling up to a point where it is able to replace the current scheduler, then look at more complex things after that. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Renice X for cpu schedulers 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III 2007-04-19 8:33 ` Nick Piggin @ 2007-04-19 11:59 ` Con Kolivas 2007-04-19 12:42 ` Peter Williams ` (3 more replies) 2 siblings, 4 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-19 11:59 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Ok, there are 3 known schedulers currently being "promoted" as solid replacements for the mainline scheduler which address most of the issues with mainline (and about 10 other ones not currently being promoted). The main way they do this is through attempting to maintain solid fairness. There is enough evidence mounting now from the numerous test cases fixed by much fairer designs that this is the way forward for a general purpose cpu scheduler which is what linux needs. Interactivity of just about everything that needs low latency (ie audio and video players) are easily managed by maintaining low latency between wakeups and scheduling of all these low cpu users. The one fly in the ointment for linux remains X. I am still, to this moment, completely and utterly stunned at why everyone is trying to find increasingly complex unique ways to manage X when all it needs is more cpu[1]. Now most of these are actually very good ideas about _extra_ features that would be desirable in the long run for linux, but given the ludicrous simplicity of renicing X I cannot fathom why people keep promoting these alternatives. At the time of 2.6.0 coming out we were desparately trying to get half decent interactivity within a reasonable time frame to release 2.6.0 without rewiring the whole scheduler. So I tweaked the crap out of the tunables that were already there[2]. So let's hear from the 3 people who generated the schedulers under the spotlight. These are recent snippets and by no means the only time these comments have been said. Without sounding too bold, we do know a thing or two about scheduling. CFS: On Thursday 19 April 2007 16:38, Ingo Molnar wrote: > hmmmm. How about the following then: default to nice -10 for all > (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ > special: root already has disk space reserved to it, root has special > memory allocation allowances, etc. I dont see a reason why we couldnt by > default make all root tasks have nice -10. This would be instantly loved > by sysadmins i suspect ;-) > > (distros that go the extra mile of making Xorg run under non-root could > also go another extra one foot to renice that X server to -10.) Nicksched: On Wednesday 18 April 2007 15:00, Nick Piggin wrote: > What's wrong with allowing X to get more than it's fair share of CPU > time by "fiddling with nice levels"? That's what they're there for. and Staircase-Deadline: On Thursday 19 April 2007 09:59, Con Kolivas wrote: > Remember to renice X to -10 for nicest desktop behaviour :) [1]The one caveat I can think of is that when you share X sessions across multiple users -with a fair cpu scheduler-, having them all nice 0 also makes the distribution of cpu across the multiple users very even and smooth, without the expense of burning away the other person's cpu time they'd like for compute intensive non gui things. If you make a scheduler that always favours X this becomes impossible. I've had enough users offlist ask me to help them set up multiuser environments just like this with my schedulers because they were unable to do it with mainline, even with SCHED_BATCH, short of nicing everything +19. This makes the argument for not favouring X within the scheduler with tweaks even stronger. [2] Nick was promoting renicing X with his Nicksched alternative at the time of 2.6.0 and while we were not violently opposed to renicing X, Nicksched was still very new on the scene and didn't have the luxury of extended testing and reiteration in time for 2.6 and he put the project on hold for some time after that. The correctness of his views on renicing certainly have become more obvious over time. So yes go ahead and think up great ideas for other ways of metering out cpu bandwidth for different purposes, but for X, given the absurd simplicity of renicing, why keep fighting it? Again I reiterate that most users of SD have not found the need to renice X anyway except if they stick to old habits of make -j4 on uniprocessor and the like, and I expect that those on CFS and Nicksched would also have similar experiences. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas @ 2007-04-19 12:42 ` Peter Williams 2007-04-19 13:20 ` Peter Williams 2007-04-19 13:17 ` Mark Lord ` (2 subsequent siblings) 3 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-19 12:42 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Con Kolivas wrote: > Ok, there are 3 known schedulers currently being "promoted" as solid > replacements for the mainline scheduler which address most of the issues with > mainline (and about 10 other ones not currently being promoted). The main way > they do this is through attempting to maintain solid fairness. There is > enough evidence mounting now from the numerous test cases fixed by much > fairer designs that this is the way forward for a general purpose cpu > scheduler which is what linux needs. > > Interactivity of just about everything that needs low latency (ie audio and > video players) are easily managed by maintaining low latency between wakeups > and scheduling of all these low cpu users. On a "fair" scheduler these will all get high priority (and good response) because their CPU bandwidth usage will be much smaller than their entitlement and the scheduler will be trying to help them "catch up". So (as you say) they shouldn't be a problem. > The one fly in the ointment for > linux remains X. I am still, to this moment, completely and utterly stunned > at why everyone is trying to find increasingly complex unique ways to manage > X when all it needs is more cpu[1]. Now most of these are actually very good > ideas about _extra_ features that would be desirable in the long run for > linux, but given the ludicrous simplicity of renicing X I cannot fathom why > people keep promoting these alternatives. At the time of 2.6.0 coming out we > were desparately trying to get half decent interactivity within a reasonable > time frame to release 2.6.0 without rewiring the whole scheduler. So I > tweaked the crap out of the tunables that were already there[2]. X's needs are more complex than that (from my observations) in that the part of X that processes input doesn't use much CPU but the part that does output can be quite a heavy user of CPU (e.g. do a "ls -lR /" in an xterm and watch X chew up the CPU). At the same time, the part of X that processes input needs quick responsiveness as it's part of the interactive chain where this is less so for the output part. Where X comes unstuck in the current scheduler is that when the output part goes on one of its CPU storms it ceases to look like an interactive task and gets given lower priority. Ironically, this doesn't effect the output part of X but it does effect the input part and is manifest as crappy interactive response. One wonders whether modifying X so that it has two threads: one for output and one for input; that could be scheduled separately might help. I guess it would depend on whether there is insufficient independence between the two halves. Part of this issue is that giving X a high static priority runs the risk of the CPU hog output part disrupting scheduling of other important tasks. So don't give it too big a boost. > > So let's hear from the 3 people who generated the schedulers under the > spotlight. These are recent snippets and by no means the only time these > comments have been said. Without sounding too bold, we do know a thing or two > about scheduling. > > CFS: > On Thursday 19 April 2007 16:38, Ingo Molnar wrote: >> hmmmm. How about the following then: default to nice -10 for all >> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ >> special: root already has disk space reserved to it, root has special >> memory allocation allowances, etc. I dont see a reason why we couldnt by >> default make all root tasks have nice -10. This would be instantly loved >> by sysadmins i suspect ;-) It's worth noting that the -10 mentioned is roughly equivalent (in the old scheduler) to restoring interactive task status to X in those cases where it loses it due to a CPU storm in its output part. >> >> (distros that go the extra mile of making Xorg run under non-root could >> also go another extra one foot to renice that X server to -10.) > > Nicksched: > On Wednesday 18 April 2007 15:00, Nick Piggin wrote: >> What's wrong with allowing X to get more than it's fair share of CPU >> time by "fiddling with nice levels"? That's what they're there for. > > and > > Staircase-Deadline: > On Thursday 19 April 2007 09:59, Con Kolivas wrote: >> Remember to renice X to -10 for nicest desktop behaviour :) I'd like to add the EBS scheduler (posted by Aurema Pty Ltd a couple of years back) to this list as it also recommended running X at nice -5 to -10. Also some of the "interactive bonus" mechanisms in my SPA schedulers could be removed if X was reniced. In fact, with a reniced X the spa_svr (server oriented scheduler which attempts to minimise the time tasks spend on the queue waiting for CPU access and which doesn't have interactive bonuses) might be usable on a work station. > > > [1]The one caveat I can think of is that when you share X sessions across > multiple users -with a fair cpu scheduler-, having them all nice 0 also makes > the distribution of cpu across the multiple users very even and smooth, > without the expense of burning away the other person's cpu time they'd like > for compute intensive non gui things. If you make a scheduler that always > favours X this becomes impossible. I've had enough users offlist ask me to > help them set up multiuser environments just like this with my schedulers > because they were unable to do it with mainline, even with SCHED_BATCH, short > of nicing everything +19. This makes the argument for not favouring X within > the scheduler with tweaks even stronger. > > [2] Nick was promoting renicing X with his Nicksched alternative at the time > of 2.6.0 and while we were not violently opposed to renicing X, Nicksched was > still very new on the scene and didn't have the luxury of extended testing > and reiteration in time for 2.6 and he put the project on hold for some time > after that. The correctness of his views on renicing certainly have become > more obvious over time. > > > So yes go ahead and think up great ideas for other ways of metering out cpu > bandwidth for different purposes, but for X, given the absurd simplicity of > renicing, why keep fighting it? Again I reiterate that most users of SD have > not found the need to renice X anyway except if they stick to old habits of > make -j4 on uniprocessor and the like, and I expect that those on CFS and > Nicksched would also have similar experiences. > Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 12:42 ` Peter Williams @ 2007-04-19 13:20 ` Peter Williams 2007-04-19 14:22 ` Lee Revell 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-19 13:20 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Con Kolivas wrote: >> Ok, there are 3 known schedulers currently being "promoted" as solid >> replacements for the mainline scheduler which address most of the >> issues with mainline (and about 10 other ones not currently being >> promoted). The main way they do this is through attempting to maintain >> solid fairness. There is enough evidence mounting now from the >> numerous test cases fixed by much fairer designs that this is the way >> forward for a general purpose cpu scheduler which is what linux needs. >> Interactivity of just about everything that needs low latency (ie >> audio and video players) are easily managed by maintaining low latency >> between wakeups and scheduling of all these low cpu users. > > On a "fair" scheduler these will all get high priority (and good > response) because their CPU bandwidth usage will be much smaller than > their entitlement and the scheduler will be trying to help them "catch > up". So (as you say) they shouldn't be a problem. > >> The one fly in the ointment for linux remains X. I am still, to this >> moment, completely and utterly stunned at why everyone is trying to >> find increasingly complex unique ways to manage X when all it needs is >> more cpu[1]. Now most of these are actually very good ideas about >> _extra_ features that would be desirable in the long run for linux, >> but given the ludicrous simplicity of renicing X I cannot fathom why >> people keep promoting these alternatives. At the time of 2.6.0 coming >> out we were desparately trying to get half decent interactivity within >> a reasonable time frame to release 2.6.0 without rewiring the whole >> scheduler. So I tweaked the crap out of the tunables that were already >> there[2]. > > X's needs are more complex than that (from my observations) in that the > part of X that processes input doesn't use much CPU but the part that > does output can be quite a heavy user of CPU (e.g. do a "ls -lR /" in an > xterm and watch X chew up the CPU). At the same time, the part of X > that processes input needs quick responsiveness as it's part of the > interactive chain where this is less so for the output part. > > Where X comes unstuck in the current scheduler is that when the output > part goes on one of its CPU storms it ceases to look like an interactive > task and gets given lower priority. Ironically, this doesn't effect the > output part of X but it does effect the input part and is manifest as > crappy interactive response. One wonders whether modifying X so that it > has two threads: one for output and one for input; that could be > scheduled separately might help. I guess it would depend on whether > there is insufficient independence between the two halves. I forgot to make my point here and that was that if X could be split in two neither half would need to be reniced. As a very low CPU bandwidth user the input half would get along just fine like the other interactive tasks that you mention. And the output put part isn't adversely effected by not having a boost so it would get along just fine as well and you don't want it having a boost when it's in a CPU storm anyway. Of course, if the interdependence between the two halves is such that the equivalent of priority inversion occurs between the two threads. However, that might be solved by making the division between the two halves on a dimension other than the input/output one. Peter PS I think that the tasks most likely to be adversely effected by X's CPU storms (enough to annoy the user) are audio streamers so when you're doing tests to determine the best nice value for X I suggest that would be a good criterion. Video streamers are also susceptible but glitches in video don't seem to annoy users as much as audio ones. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 13:20 ` Peter Williams @ 2007-04-19 14:22 ` Lee Revell 2007-04-20 1:32 ` Michael K. Edwards 0 siblings, 1 reply; 713+ messages in thread From: Lee Revell @ 2007-04-19 14:22 UTC (permalink / raw) To: Peter Williams Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > PS I think that the tasks most likely to be adversely effected by X's > CPU storms (enough to annoy the user) are audio streamers so when you're > doing tests to determine the best nice value for X I suggest that would > be a good criterion. Video streamers are also susceptible but glitches > in video don't seem to annoy users as much as audio ones. IMHO audio streamers should use SCHED_FIFO thread for time critical work. I think it's insane to expect the scheduler to figure out that these processes need low latency when they can just be explicit about it. "Professional" audio software does it already, on Linux as well as other OS... Lee ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 14:22 ` Lee Revell @ 2007-04-20 1:32 ` Michael K. Edwards 2007-04-20 5:25 ` Bill Huey 0 siblings, 1 reply; 713+ messages in thread From: Michael K. Edwards @ 2007-04-20 1:32 UTC (permalink / raw) To: Lee Revell Cc: Peter Williams, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, Lee Revell <rlrevell@joe-job.com> wrote: > IMHO audio streamers should use SCHED_FIFO thread for time critical > work. I think it's insane to expect the scheduler to figure out that > these processes need low latency when they can just be explicit about > it. "Professional" audio software does it already, on Linux as well > as other OS... It is certainly true that SCHED_FIFO is currently necessary in the layers of an audio application lying closest to the hardware, if you don't want to throw a monstrous hardware ring buffer at the problem. See the alsa-devel archives for a patch to aplay (sched_setscheduler plus some cleanups) that converts it from "unsafe at any speed" (on a non-RT kernel) to a rock-solid 18ms round trip from PCM in to PCM out. (The hardware and driver aren't terribly exotic for an SoC, and the measurement was done with aplay -C | aplay -P -- on a not-particularly-tuned CONFIG_PREEMPT kernel with a 12ms+ peak scheduling latency according to cyclictest. A similar test via /dev/dsp, done through a slightly modified OSS emulation layer to the same driver, measures at 40ms and is probably tuned too conservatively.) Note that SCHED_FIFO may be less necessary on an -rt kernel, but I haven't had that option on the embedded hardware I've been working with lately. Ingo, please please pretty please pick a -stable branch one of these days and provide a git repo with -rt integrated against that branch. Then I could port our chip support to it -- all of which will be GPLed after the impending code review -- after which I might have a prayer of strong-arming our chip vendor into porting their WiFi driver onto -rt. It's really a much more interesting scheduler use case than make -j200 under X, because it's a best-effort SCHED_BATCH-ish load that wants to be temporally clustered for power management reasons. (Believe it or not, a stable -rt branch with a clock-scaling-aware scheduler is the one thing that might lead to this major WiFi vendor's GPLing their driver core. They're starting to see the light on the biz dev side, and the nature of the devices their chip will go in makes them somewhat less concerned about the regulatory fig leaf aspect of a closed-source driver; but they would have to port off of the third-party real-time executive embedded within the driver, and mainline's task and timer granularity won't cut it. I can't even get more detail about _why_ it won't cut it unless there's some remotely supportable -rt base they could port to.) But I think SCHED_FIFO on a chain of tasks is fundamentally not the right way to handle low audio latency. The object with a low latency requirement isn't the task, it's the device. When it's starting to get urgent to deliver more data to the device, the task that it's waiting on should slide up the urgency scale; and if it's waiting on something else, that something else should slide up the scale; and so forth. Similarly, responding to user input is urgent; so when user input is available (by whatever mechanism), the task that's waiting for it should slide up the urgency scale, etc. In practice, you probably don't want to burden desktop Linux with priority inheritance where you don't have to. Priority queues with algorithmically efficient decrease-key operations (Fibonacci heaps and their ilk) are complicated to implement and have correspondingly high constant factors. (However, a sufficiently clever heuristic for assigning quasi-static task priorities would usually short-circuit the priority cascade; if you can keep N small in the tasks-with-unpredictable-priority queue, you can probably use a simpler flavor with O(log N) decrease-key. Ask someone who knows more about data structures than I do.) More importantly, non-real-time application coders aren't very smart about grouping data structure accesses on one side or the other of a system call that is likely to release a lock and let something else run, flushing application data out of cache. (Kernel coders aren't always smart about this either; see LKML threads a few weeks ago about racy, and cache-stall-prone, f_pos handling in VFS.) So switching tasks immediately on lock release is usually the wrong thing to do if letting the task run a little longer would allow it to reach a point where it has to block anyway. Anyway, I already described the urgency-driven strategy to the extent that I've thought it out, elsewhere in this thread. I only held this draft back because I wanted to double-check my latency measurements. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 1:32 ` Michael K. Edwards @ 2007-04-20 5:25 ` Bill Huey 2007-04-20 7:12 ` Michael K. Edwards 0 siblings, 1 reply; 713+ messages in thread From: Bill Huey @ 2007-04-20 5:25 UTC (permalink / raw) To: Michael K. Edwards Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, linux-kernel, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Thu, Apr 19, 2007 at 06:32:15PM -0700, Michael K. Edwards wrote: > But I think SCHED_FIFO on a chain of tasks is fundamentally not the > right way to handle low audio latency. The object with a low latency > requirement isn't the task, it's the device. When it's starting to > get urgent to deliver more data to the device, the task that it's > waiting on should slide up the urgency scale; and if it's waiting on > something else, that something else should slide up the scale; and so > forth. Similarly, responding to user input is urgent; so when user > input is available (by whatever mechanism), the task that's waiting > for it should slide up the urgency scale, etc. DSP operations like, particularly with digital synthesis, tend to max the CPU doing vector operations on as many processors as it can get a hold of. In a live performance critical application, it's important to be able to deliver a protected amount of CPU to a thread doing that work as well as response to external input such as controllers, etc... > In practice, you probably don't want to burden desktop Linux with > priority inheritance where you don't have to. Priority queues with > algorithmically efficient decrease-key operations (Fibonacci heaps and > their ilk) are complicated to implement and have correspondingly high > constant factors. (However, a sufficiently clever heuristic for > assigning quasi-static task priorities would usually short-circuit the > priority cascade; if you can keep N small in the > tasks-with-unpredictable-priority queue, you can probably use a > simpler flavor with O(log N) decrease-key. Ask someone who knows more > about data structures than I do.) These are app issue and not really somethings that's mutable in kernel per se with regard to the -rt patch. > More importantly, non-real-time application coders aren't very smart > about grouping data structure accesses on one side or the other of a > system call that is likely to release a lock and let something else > run, flushing application data out of cache. (Kernel coders aren't > always smart about this either; see LKML threads a few weeks ago about > racy, and cache-stall-prone, f_pos handling in VFS.) So switching > tasks immediately on lock release is usually the wrong thing to do if > letting the task run a little longer would allow it to reach a point > where it has to block anyway. I have Solaris style adaptive locks in my tree with my lockstat patch under -rt. I've also modified my lockstat patch to track readers correctly now with rwsem and the like to see where the single reader limitation in the rtmutex blows it. So far I've seen less than 10 percent of in-kernel contention events actually worth spinning on and the rest of the stats imply that the mutex owner in question is either preempted or blocked on something else. I've been trying to get folks to try this on a larger machine than my 2x AMD64 box so that I there is more data regarding Linux contention and overschedulling in -rt. > Anyway, I already described the urgency-driven strategy to the extent > that I've thought it out, elsewhere in this thread. I only held this > draft back because I wanted to double-check my latency measurements. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 5:25 ` Bill Huey @ 2007-04-20 7:12 ` Michael K. Edwards 2007-04-20 8:21 ` Bill Huey 0 siblings, 1 reply; 713+ messages in thread From: Michael K. Edwards @ 2007-04-20 7:12 UTC (permalink / raw) To: hui Bill Huey Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > DSP operations like, particularly with digital synthesis, tend to max > the CPU doing vector operations on as many processors as it can get > a hold of. In a live performance critical application, it's important > to be able to deliver a protected amount of CPU to a thread doing that > work as well as response to external input such as controllers, etc... Actual fractional CPU reservation is a bit different, and is probably best handled with "container"-type infrastructure (not quite virtualization, but not quite scheduling classes either). SGI pioneered this (in "open systems" space -- IBM probably had it first, as usual) with GRIO in XFS. (That was I/O throughput reservation of course, not "CPU bandwidth" -- but IIRC IRIX had CPU reservation too). There's a more general class of techniques in which it's worth spending idle cycles speculating along paths that might or might not be taken depending on unpredictable I/O; I'd be surprised if you couldn't approximate most of the sane balancing strategies in this area within the "economic dispatch" scheduler model. (Good JIT bytecode engines more or less do this already if you let them, with a cap on JIT cache size serving as a crude CPU throttle.) > > In practice, you probably don't want to burden desktop Linux with > > priority inheritance where you don't have to. Priority queues with > > algorithmically efficient decrease-key operations (Fibonacci heaps and > > their ilk) are complicated to implement and have correspondingly high > > constant factors. (However, a sufficiently clever heuristic for > > assigning quasi-static task priorities would usually short-circuit the > > priority cascade; if you can keep N small in the > > tasks-with-unpredictable-priority queue, you can probably use a > > simpler flavor with O(log N) decrease-key. Ask someone who knows more > > about data structures than I do.) > > These are app issue and not really somethings that's mutable in kernel > per se with regard to the -rt patch. I don't know where the -rt patch enters in. But if you need agile reprioritization with a deep runnable queue, either under direct application control or as a side effect of priority inheritance or a related OS-enforced protocol, then you need a kernel-level data structure with a fancier interface than the classic insert/find/delete-min priority queue. From what I've read (this is not my area of expertise and I don't have Knuth handy), the relatively simple heap-based implementations of priority queues can't reprioritize an entry any more quickly than find+delete+insert, which pretty much rules them out as a basis for a scalable scheduler with priority inheritance (let alone PCP emulation). > I have Solaris style adaptive locks in my tree with my lockstat patch > under -rt. I've also modified my lockstat patch to track readers > correctly now with rwsem and the like to see where the single reader > limitation in the rtmutex blows it. Ooh, that's neat. The next time I can cook up an excuse to run a kernel that won't load this damn WiFi driver, I'll try it out. Some of the people I work with are real engineers and respect in-system instrumentation. > So far I've seen less than 10 percent of in-kernel contention events > actually worth spinning on and the rest of the stats imply that the > mutex owner in question is either preempted or blocked on something > else. That's a good thing; it implies that in-kernel algorithms don't take locks needlessly as a matter of cargo-cult habit. Attempting to take a lock (other than an uncontended futex, which is practically free) should almost always transfer control to the thread that has the power to deliver the information (or the free slot) that you're looking for -- or in the case of an external data source/sink, should send you into low-power mode until time and tide give you something new to do. Think of it as a just-in-time inventory system; if you keep too much product in stock (or free warehouse space), you're wasting space and harming responsiveness to a shift in demand. Once in a while you have to play Sokoban in order to service a request promptly; that's exactly the case that priority inheritance is meant to help with. The fiddly part, on a non-real-time-no-matter-what-the-label-says system with an opaque cache architecture and mysterious hidden costs of context switching, is to minimize the lossage resulting from brutal timer- or priority-inheritance-driven preemption. Given the way people code these days -- OK, it was probably just as bad back in the day -- the only thing worse than letting the axe fall at random is to steal the CPU away the moment a contended lock is released, because the next 20 lines of code probably poke one last time at all the data structures the task had in cache right before entering the critical section. That doesn't hurt so bad on RTOS-friendly hardware -- an MMU-less system with either zero or near-infinite cache -- but it's got to make this year's Intel/AMD/whatever kotatsu stall left and right when that task gets rescheduled. > I've been trying to get folks to try this on a larger machine than my > 2x AMD64 box so that I there is more data regarding Linux contention > and overschedulling in -rt. Dave Miller, maybe? He seems to be one of the few people around here with the skills and the institutional motivation to push for scalability under the typical half-assed middle-tier application workload. Which, make no mistake, stands to benefit a lot more from a properly engineered SCHED_OTHER scheduler than any "real-time" media gadget that sells by the forklift-load at Fry's. (NUMA boxen might also benefit, but AFAICT their target applications are different enough not to count.) And if anyone from Cavium is lurking, now would be a real good time to show up and shoulder some of the load. Heck, even Intel ought to pitch in -- the Yonah team may have saved your ass for now, but you'll be playing the 32-thread throughput-per-erg stakes soon enough. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 7:12 ` Michael K. Edwards @ 2007-04-20 8:21 ` Bill Huey 0 siblings, 0 replies; 713+ messages in thread From: Bill Huey @ 2007-04-20 8:21 UTC (permalink / raw) To: Michael K. Edwards Cc: Lee Revell, Peter Williams, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Mike Galbraith, ck list, linux-kernel, Arjan van de Ven, Thomas Gleixner, Steven Rostedt, Bill Huey (hui) On Fri, Apr 20, 2007 at 12:12:29AM -0700, Michael K. Edwards wrote: > Actual fractional CPU reservation is a bit different, and is probably > best handled with "container"-type infrastructure (not quite > virtualization, but not quite scheduling classes either). SGI > pioneered this (in "open systems" space -- IBM probably had it first, > as usual) with GRIO in XFS. (That was I/O throughput reservation of I'm very aware of this having grow up on those systems and see what 30k USD of hardware can do for you with the right kernel facilties. It would be a mind blower to get OpenGL and friends back to that level of performance with regards to React/Pro's rt abilities, frame drop would just be gone and we'd own gaming. No joke. We have a number of former SGI XFS engineers here at NetApp and I should ask them about the GRIO implementation. > course, not "CPU bandwidth" -- but IIRC IRIX had CPU reservation too). > There's a more general class of techniques in which it's worth > spending idle cycles speculating along paths that might or might not > be taken depending on unpredictable I/O; I'd be surprised if you > couldn't approximate most of the sane balancing strategies in this > area within the "economic dispatch" scheduler model. (Good JIT What is that ? never heard of it before. > I don't know where the -rt patch enters in. But if you need agile > reprioritization with a deep runnable queue, either under direct > application control or as a side effect of priority inheritance or a > related OS-enforced protocol, then you need a kernel-level data > structure with a fancier interface than the classic > insert/find/delete-min priority queue. From what I've read (this is > not my area of expertise and I don't have Knuth handy), the relatively > simple heap-based implementations of priority queues can't > reprioritize an entry any more quickly than find+delete+insert, which > pretty much rules them out as a basis for a scalable scheduler with > priority inheritance (let alone PCP emulation). The -rt patch has turnstile-esque infrastructure that's stack allocated. Linux's lock hierarchy is relatively shallow (compensated with a heavy use of per CPU method and RCU-ified algorithms in place of rwlocks) so I've encountered nothing close to this that would demand such an overly sophisticated mechanism. I'm aware of PCP and preemptions thresholds. I created the lockstat infrastructure as a means of precisely measuring contention in -rt in anticipation to experiment with these techniques. I mention -rt because it's the most likely place to encounter what you're talking about, not an app. > >I have Solaris style adaptive locks in my tree with my lockstat patch > >under -rt. I've also modified my lockstat patch to track readers ... > Ooh, that's neat. The next time I can cook up an excuse to run a > kernel that won't load this damn WiFi driver, I'll try it out. Some > of the people I work with are real engineers and respect in-system > instrumentation. It's not publically released yet since I'm still stuck in .20-rc6 land and the soft lock up detector triggers. I need to forward port it and my lockstat changes to the most recent -rt patch. I've been stalled on revision control problem that I'm trying to solve with monotone for at least a month (of my own spare time). > That's a good thing; it implies that in-kernel algorithms don't take > locks needlessly as a matter of cargo-cult habit. Attempting to take The jury is still out on this until I can record what the rtmutex owner's state is in. No further conclusion can be made until then. I think this is very interesting pursuit/investigation. > a lock (other than an uncontended futex, which is practically free) > should almost always transfer control to the thread that has the power > to deliver the information (or the free slot) that you're looking for > -- or in the case of an external data source/sink, should send you > into low-power mode until time and tide give you something new to do. > Think of it as a just-in-time inventory system; if you keep too much > product in stock (or free warehouse space), you're wasting space and > harming responsiveness to a shift in demand. Once in a while you have > to play Sokoban in order to service a request promptly; that's exactly > the case that priority inheritance is meant to help with. What did you mean by this ? Victor Yodaiken's stuff ? > The fiddly part, on a non-real-time-no-matter-what-the-label-says > system with an opaque cache architecture and mysterious hidden costs > of context switching, is to minimize the lossage resulting from brutal > timer- or priority-inheritance-driven preemption. Given the way > people code these days -- OK, it was probably just as bad back in the > day -- the only thing worse than letting the axe fall at random is to > steal the CPU away the moment a contended lock is released, because My adaptive spin stuff in front of an rtmutex is design to complement Steve Rostedt's owner stealing code also in that path and prevent this from happening. I record stealing events as well. > the next 20 lines of code probably poke one last time at all the data > structures the task had in cache right before entering the critical > section. That doesn't hurt so bad on RTOS-friendly hardware -- an > MMU-less system with either zero or near-infinite cache -- but it's > got to make this year's Intel/AMD/whatever kotatsu stall left and > right when that task gets rescheduled. Cache and tlb's are a bitch. > >I've been trying to get folks to try this on a larger machine than my > >2x AMD64 box so that I there is more data regarding Linux contention > >and overschedulling in -rt. > > Dave Miller, maybe? He seems to be one of the few people around here Don't know. I do know that somebody is going to try -rt on large hardware because they go some crazy app that needs both tons of CPU and rt abilties, like IRIX. I wouldn't mind the help and access to an 8x machine. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas 2007-04-19 12:42 ` Peter Williams @ 2007-04-19 13:17 ` Mark Lord 2007-04-19 15:10 ` Con Kolivas 2007-04-20 3:57 ` Nick Piggin 2007-04-19 18:16 ` Gene Heskett 2007-04-19 19:26 ` Ray Lee 3 siblings, 2 replies; 713+ messages in thread From: Mark Lord @ 2007-04-19 13:17 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Con Kolivas wrote: s go ahead and think up great ideas for other ways of metering out cpu > bandwidth for different purposes, but for X, given the absurd simplicity of > renicing, why keep fighting it? Again I reiterate that most users of SD have > not found the need to renice X anyway except if they stick to old habits of > make -j4 on uniprocessor and the like, and I expect that those on CFS and > Nicksched would also have similar experiences. Just plain "make" (no -j2 or -j9999) is enough to kill interactivity on my 2GHz P-M single-core non-HT machine with SD. But with the very first posted version of CFS by Ingo, I can do "make -j2" no problem and still have a nicely interactive destop. -ml ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 13:17 ` Mark Lord @ 2007-04-19 15:10 ` Con Kolivas 2007-04-19 16:15 ` Mark Lord 2007-04-20 3:57 ` Nick Piggin 1 sibling, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-19 15:10 UTC (permalink / raw) To: Mark Lord Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007 23:17, Mark Lord wrote: > Con Kolivas wrote: > s go ahead and think up great ideas for other ways of metering out cpu > > > bandwidth for different purposes, but for X, given the absurd simplicity > > of renicing, why keep fighting it? Again I reiterate that most users of > > SD have not found the need to renice X anyway except if they stick to old > > habits of make -j4 on uniprocessor and the like, and I expect that those > > on CFS and Nicksched would also have similar experiences. > > Just plain "make" (no -j2 or -j9999) is enough to kill interactivity > on my 2GHz P-M single-core non-HT machine with SD. > > But with the very first posted version of CFS by Ingo, > I can do "make -j2" no problem and still have a nicely interactive destop. Cool. Then there's clearly a bug with SD that manifests on your machine as it should not have that effect at all (and doesn't on other people's machines). I suggest trying the latest version which fixes some bugs. Thanks. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 15:10 ` Con Kolivas @ 2007-04-19 16:15 ` Mark Lord 2007-04-19 18:21 ` Gene Heskett ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Mark Lord @ 2007-04-19 16:15 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Con Kolivas wrote: > On Thursday 19 April 2007 23:17, Mark Lord wrote: >> Con Kolivas wrote: >> s go ahead and think up great ideas for other ways of metering out cpu >> >>> bandwidth for different purposes, but for X, given the absurd simplicity >>> of renicing, why keep fighting it? Again I reiterate that most users of >>> SD have not found the need to renice X anyway except if they stick to old >>> habits of make -j4 on uniprocessor and the like, and I expect that those >>> on CFS and Nicksched would also have similar experiences. >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity >> on my 2GHz P-M single-core non-HT machine with SD. >> >> But with the very first posted version of CFS by Ingo, >> I can do "make -j2" no problem and still have a nicely interactive destop. > > Cool. Then there's clearly a bug with SD that manifests on your machine as it > should not have that effect at all (and doesn't on other people's machines). > I suggest trying the latest version which fixes some bugs. SD just doesn't do nearly as good as the stock scheduler, or CFS, here. I'm quite likely one of the few single-CPU/non-HT testers of this stuff. If it should ever get more widely used I think we'd hear a lot more complaints. Cheers ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 16:15 ` Mark Lord @ 2007-04-19 18:21 ` Gene Heskett 2007-04-20 0:17 ` Con Kolivas 2007-04-20 1:17 ` Ed Tomlinson 2 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-19 18:21 UTC (permalink / raw) To: Mark Lord Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Mark Lord wrote: >Con Kolivas wrote: >> On Thursday 19 April 2007 23:17, Mark Lord wrote: >>> Con Kolivas wrote: >>> s go ahead and think up great ideas for other ways of metering out cpu >>> >>>> bandwidth for different purposes, but for X, given the absurd simplicity >>>> of renicing, why keep fighting it? Again I reiterate that most users of >>>> SD have not found the need to renice X anyway except if they stick to >>>> old habits of make -j4 on uniprocessor and the like, and I expect that >>>> those on CFS and Nicksched would also have similar experiences. >>> >>> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity >>> on my 2GHz P-M single-core non-HT machine with SD. >>> >>> But with the very first posted version of CFS by Ingo, >>> I can do "make -j2" no problem and still have a nicely interactive >>> destop. >> >> Cool. Then there's clearly a bug with SD that manifests on your machine as >> it should not have that effect at all (and doesn't on other people's >> machines). I suggest trying the latest version which fixes some bugs. > >SD just doesn't do nearly as good as the stock scheduler, or CFS, here. I found the early SD's much friendlier here, but I also think that at that point I was comparing SD to stock 2.6.21-rc5 and 6, and to say that it sucked would be a slight understatement. >I'm quite likely one of the few single-CPU/non-HT testers of this stuff. >If it should ever get more widely used I think we'd hear a lot more > complaints. I'm in that row of seats too Mark. Someday I have to build a new box, that's all there is to it... -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Lots of folks confuse bad management with destiny. -- Frank Hubbard ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 16:15 ` Mark Lord 2007-04-19 18:21 ` Gene Heskett @ 2007-04-20 0:17 ` Con Kolivas 2007-04-20 1:17 ` Ed Tomlinson 2 siblings, 0 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-20 0:17 UTC (permalink / raw) To: Mark Lord Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Friday 20 April 2007 02:15, Mark Lord wrote: > Con Kolivas wrote: > > On Thursday 19 April 2007 23:17, Mark Lord wrote: > >> Con Kolivas wrote: > >> s go ahead and think up great ideas for other ways of metering out cpu > >> > >>> bandwidth for different purposes, but for X, given the absurd > >>> simplicity of renicing, why keep fighting it? Again I reiterate that > >>> most users of SD have not found the need to renice X anyway except if > >>> they stick to old habits of make -j4 on uniprocessor and the like, and > >>> I expect that those on CFS and Nicksched would also have similar > >>> experiences. > >> > >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity > >> on my 2GHz P-M single-core non-HT machine with SD. > >> > >> But with the very first posted version of CFS by Ingo, > >> I can do "make -j2" no problem and still have a nicely interactive > >> destop. > > > > Cool. Then there's clearly a bug with SD that manifests on your machine > > as it should not have that effect at all (and doesn't on other people's > > machines). I suggest trying the latest version which fixes some bugs. > > SD just doesn't do nearly as good as the stock scheduler, or CFS, here. > > I'm quite likely one of the few single-CPU/non-HT testers of this stuff. > If it should ever get more widely used I think we'd hear a lot more > complaints. You are not really one of the few. A lot of my own work is done on a single core pentium M 1.7Ghz laptop. I am not endowed with truckloads of hardware like all the paid developers are. I recall extreme frustration myself when a developer a few years ago (around 2002) said he couldn't reproduce poor behaviour on his 4GB ram 4 x Xeon machine. Even today if I add up every machine I have in my house and work at my disposal it doesn't amount to that many cpus and that much ram. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 16:15 ` Mark Lord 2007-04-19 18:21 ` Gene Heskett 2007-04-20 0:17 ` Con Kolivas @ 2007-04-20 1:17 ` Ed Tomlinson 2007-04-20 1:27 ` Linus Torvalds 2 siblings, 1 reply; 713+ messages in thread From: Ed Tomlinson @ 2007-04-20 1:17 UTC (permalink / raw) To: Mark Lord Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007 12:15, Mark Lord wrote: > Con Kolivas wrote: > > On Thursday 19 April 2007 23:17, Mark Lord wrote: > >> Con Kolivas wrote: > >> s go ahead and think up great ideas for other ways of metering out cpu > >> > >>> bandwidth for different purposes, but for X, given the absurd simplicity > >>> of renicing, why keep fighting it? Again I reiterate that most users of > >>> SD have not found the need to renice X anyway except if they stick to old > >>> habits of make -j4 on uniprocessor and the like, and I expect that those > >>> on CFS and Nicksched would also have similar experiences. > >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity > >> on my 2GHz P-M single-core non-HT machine with SD. > >> > >> But with the very first posted version of CFS by Ingo, > >> I can do "make -j2" no problem and still have a nicely interactive destop. > > > > Cool. Then there's clearly a bug with SD that manifests on your machine as it > > should not have that effect at all (and doesn't on other people's machines). > > I suggest trying the latest version which fixes some bugs. > > SD just doesn't do nearly as good as the stock scheduler, or CFS, here. > > I'm quite likely one of the few single-CPU/non-HT testers of this stuff. > If it should ever get more widely used I think we'd hear a lot more complaints. amd64 UP here. SD with several makes running works just fine. Ed Tomlinson ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 1:17 ` Ed Tomlinson @ 2007-04-20 1:27 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-20 1:27 UTC (permalink / raw) To: Ed Tomlinson Cc: Mark Lord, Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, 19 Apr 2007, Ed Tomlinson wrote: > > > > SD just doesn't do nearly as good as the stock scheduler, or CFS, here. > > > > I'm quite likely one of the few single-CPU/non-HT testers of this stuff. > > If it should ever get more widely used I think we'd hear a lot more complaints. > > amd64 UP here. SD with several makes running works just fine. The thing is, it probably depends *heavily* on just how much work the X server ends up doing. Fast video hardware? The X server doesn't need to busy-wait much. Not a lot of eye-candy? The X server is likely fast enough even with a slower card that it still gets sufficient CPU time and isn't getting dinged by any balancing. DRI vs non-DRI? Which window manager (maybe some of the user-visible lags come from there..) etc etc. Anyway, I'd ask people to look a bit at the current *regressions* instead of spending all their time on something that won't even be merged before 2.6.21 is released, and we thus have some mroe pressing issues. Please? Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 13:17 ` Mark Lord 2007-04-19 15:10 ` Con Kolivas @ 2007-04-20 3:57 ` Nick Piggin 2007-04-21 14:55 ` Mark Lord 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-20 3:57 UTC (permalink / raw) To: Mark Lord Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 09:17:25AM -0400, Mark Lord wrote: > Con Kolivas wrote: > s go ahead and think up great ideas for other ways of metering out cpu > >bandwidth for different purposes, but for X, given the absurd simplicity > >of renicing, why keep fighting it? Again I reiterate that most users of SD > >have not found the need to renice X anyway except if they stick to old > >habits of make -j4 on uniprocessor and the like, and I expect that those > >on CFS and Nicksched would also have similar experiences. > > Just plain "make" (no -j2 or -j9999) is enough to kill interactivity > on my 2GHz P-M single-core non-HT machine with SD. Is this with or without X reniced? > But with the very first posted version of CFS by Ingo, > I can do "make -j2" no problem and still have a nicely interactive destop. How well does cfs run if you have the granularity set to something like 30ms (30000000)? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 3:57 ` Nick Piggin @ 2007-04-21 14:55 ` Mark Lord 2007-04-22 12:54 ` Mark Lord 0 siblings, 1 reply; 713+ messages in thread From: Mark Lord @ 2007-04-21 14:55 UTC (permalink / raw) To: Nick Piggin Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Thu, Apr 19, 2007 at 09:17:25AM -0400, Mark Lord wrote: >> Just plain "make" (no -j2 or -j9999) is enough to kill interactivity >> on my 2GHz P-M single-core non-HT machine with SD. > > Is this with or without X reniced? That was with no manual jiggling, everything the same as with stock kernels, except that stock kernels don't kill interactivity here. >> But with the very first posted version of CFS by Ingo, >> I can do "make -j2" no problem and still have a nicely interactive destop. > > How well does cfs run if you have the granularity set to something > like 30ms (30000000)? Dunno, I've put this stuff aside for now until things settle down. With four schedulers, and lots of patches / revisions / tuning-knobs, there's just no way to keep up with it all here. Cheers ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-21 14:55 ` Mark Lord @ 2007-04-22 12:54 ` Mark Lord 2007-04-22 12:58 ` Con Kolivas 0 siblings, 1 reply; 713+ messages in thread From: Mark Lord @ 2007-04-22 12:54 UTC (permalink / raw) To: Nick Piggin Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Just to throw another possibly-overlooked variable into the mess: My system here is using the on-demand cpufreq policy governor. I wonder how that interacts with the various schedulers here? I suppose for the "make" kernel case, after a couple of seconds the cpufreq would hit max and stay there for the rest of the build, so it shouldn't really be a factor for (non-)interactivity during the build. Or should it? Cheers ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-22 12:54 ` Mark Lord @ 2007-04-22 12:58 ` Con Kolivas 0 siblings, 0 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-22 12:58 UTC (permalink / raw) To: Mark Lord Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Sunday 22 April 2007 22:54, Mark Lord wrote: > Just to throw another possibly-overlooked variable into the mess: > > My system here is using the on-demand cpufreq policy governor. > I wonder how that interacts with the various schedulers here? > > I suppose for the "make" kernel case, after a couple of seconds > the cpufreq would hit max and stay there for the rest of the build, > so it shouldn't really be a factor for (non-)interactivity during the > build. > > Or should it? Short answer: shouldn't matter :) -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas 2007-04-19 12:42 ` Peter Williams 2007-04-19 13:17 ` Mark Lord @ 2007-04-19 18:16 ` Gene Heskett 2007-04-19 21:35 ` Michael K. Edwards 2007-04-19 22:47 ` Con Kolivas 2007-04-19 19:26 ` Ray Lee 3 siblings, 2 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-19 18:16 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Con Kolivas wrote: [and I snipped a good overview] >So yes go ahead and think up great ideas for other ways of metering out cpu >bandwidth for different purposes, but for X, given the absurd simplicity of >renicing, why keep fighting it? Again I reiterate that most users of SD have >not found the need to renice X anyway except if they stick to old habits of >make -j4 on uniprocessor and the like, and I expect that those on CFS and >Nicksched would also have similar experiences. FWIW folks, I have never touched x's niceness, its running at the default -1 for all of my so-called 'tests', and I have another set to be rebooted to right now. And yes, my kernel makeit script uses -j4 by default, and has used -j8 just for effects, which weren't all that different from what I expected in 'abusing' a UP system that way. The system DID remain usable, not snappy, but usable. Having tried re-nicing X a while back, and having the rest of the system suffer in quite obvious ways for even 1 + or - from its default felt pretty bad from this users perspective. It is my considered opinion (yeah I know, I'm just a leaf in the hurricane of this list) that if X has to be re-niced from the 1 point advantage its had for ages, then something is basicly wrong with the overall scheduling, cpu or i/o, or both in combination. FWIW I'm using cfq for i/o. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Moore's Constant: Everybody sets out to do something, and everybody does something, but no one does what he sets out to do. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 18:16 ` Gene Heskett @ 2007-04-19 21:35 ` Michael K. Edwards 2007-04-19 22:47 ` Con Kolivas 1 sibling, 0 replies; 713+ messages in thread From: Michael K. Edwards @ 2007-04-19 21:35 UTC (permalink / raw) To: Gene Heskett Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, Gene Heskett <gene.heskett@gmail.com> wrote: > Having tried re-nicing X a while back, and having the rest of the system > suffer in quite obvious ways for even 1 + or - from its default felt pretty > bad from this users perspective. > > It is my considered opinion (yeah I know, I'm just a leaf in the hurricane of > this list) that if X has to be re-niced from the 1 point advantage its had > for ages, then something is basicly wrong with the overall scheduling, cpu or > i/o, or both in combination. FWIW I'm using cfq for i/o. I think I just realized why the X server is such a problem. If it gets preempted when it's not actually selecting/polling over a set of fds that includes the input devices, the scheduler doesn't know that it's a good candidate for scheduling when data arrives on those devices. (That's all that any of these dynamic priority heuristics really seem to do -- weight the scheduler towards switching to conspicuously I/O bound tasks when they become runnable, without the forced preemption on lock release that would result from a true priority inheritance mechanism.) One way of looking at this is that "fairness-driven" scheduling is a poor man's priority ceiling protocol for I/O bound workloads, with the implicit priority of an fd or lock given by how desperately the reader side needs more data in order to accomplish anything. "Nice" on a task is sort of an indirect way of boosting or dropping the base priority of the fds it commonly waits on. I recognize this is a drastic oversimplification, and possibly even a misrepresentation of the design _intent_; but I think it's fairly accurate in terms of the design _effect_. The event-driven, non-threaded design of the X server makes it particularly vulnerable to "non-interactive behavior" penalties, which is appropriate to the extent that it's an output device having trouble keeping up with rendering -- in fact, that's exactly the throttling mechanism you need in order to exert back-pressure on the X client. (Trying to exert back-pressure over Linux's local domain sockets seems to be like pushing on a rope, but that's a different problem.) That same event-driven design would prioritize input events just fine -- except the scheduler won't wake the task in order to deliver them, because as far as it's concerned the X server is getting more than enough I/O to keep it busy. It's not only not blocked on the input device, it isn't even selecting on it at the moment that its timeslice expires -- so no amount of poor-man's PCP emulation is going to help. What "more negative nice on the X server than on any CPU-bound process" seems to do is to put the X server on a hair-trigger, boosting its dynamic priority in a render-limited scenario (on some graphics cards!) just enough to cancel the penalty for non-interactive behavior. It's forced to share _some_ CPU cycles, but nobody else is allowed a long enough timeslice to keep the X server off the CPU (and insensitive to input events) for long. Not terribly efficient in terms of context switch / cache eviction overhead, but certainly friendlier to the PEBCAK (who is clearly putting totally inappropriate load on a single-threaded CPU by running both a local X server and non-SCHED_BATCH compute jobs) than a frozen mouse cursor. So what's the right answer? Not special-casing the X server, that's for sure. If this analysis is correct (and as of now it's pure speculation), any event-driven application that does compute work opportunistically in the absence of user interaction is vulnerable to the same overzealous squelching. I wouldn't design a new application that way, of course -- user interaction belongs in a separate thread on any UNIX-legacy system which assigns priorities to threads of control instead of to patterns of activity. But all sorts of Linux applications have been designed to implicitly elevate artificial throughput benchmarks over user responsiveness -- that has been the UNIX way at least since SVR4, and Linux's history of expensive thread switches prior to NPTL didn't help. If you want responsiveness when the CPU is oversubscribed -- and I for one do, which is one reason why I abandoned the Linux desktop once both Microsoft and Apple figured out how to make hyperthreading work in their favor -- you should probably think about how to get it without rewriting half of userspace. IMHO, dinking around with "fairness", as if there were any relationship these days between UIDs or process groups or any other control structure and the work that's trying to flow through the system, is not going to get you there. If this were my problem, I might start by attaching urgency to behavior instead of to thread ID, which demands a scheduler queue built around a data structure with a cheap decrease-key operation. I'd figure out how to propagate this urgency not just along lock chains but also along chains of fds that need flushing (or refilling) -- even if the reader (or writer) got preempted for unrelated reasons. Tie appropriate urgency to audio and input devices, and SCHED_FIFO can pretty much go away along with nice -10 X. Then I would use the fact that taking an uncontended futex is impressively cheap, ask Ulrich for an extra class of thread-private recursive mutexes streamlined for the never-contended case, and encourage application developers to use them to bracket any short section that would prefer not to have its cache footprint evicted out from under it. Unless things get pretty nasty, when you're too impatient to wait for a task to block, you want to preempt it at a local minimum in its working set; the easy way to do this is to have a per-CPU "soft preemption" timer that causes a per-CPU kernel thread to attempt to take the foreground task's thread-private mutex. The foreground task will then block on the next entry into a "cache-hot" section, and you can remember its appetite for CPU cycles by the urgency on its thread-private futex. For better or for worse, this is far more work than _I'm_ likely to do on a volunteer basis, and whether or not this analysis is right, it is guaranteed to fail with -ENOPATCH. Oh, by the way -- maybe someone should look at whether a little backpressure on /tmp/.X11-unix/X0 and friends helps, too. (Evidently it works right over pipes, or else hardly anything on Linux would DWIM.) That might succeed in papering over the problem, without requiring actual design effort before wading into the code. Then again, I haven't actually looked at the local domain socket code, so for all I know there's already backpressure there but X.org's excessive cleverness defeats it. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 18:16 ` Gene Heskett 2007-04-19 21:35 ` Michael K. Edwards @ 2007-04-19 22:47 ` Con Kolivas 2007-04-20 2:00 ` Gene Heskett ` (2 more replies) 1 sibling, 3 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-19 22:47 UTC (permalink / raw) To: Gene Heskett Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Friday 20 April 2007 04:16, Gene Heskett wrote: > On Thursday 19 April 2007, Con Kolivas wrote: > > [and I snipped a good overview] > > >So yes go ahead and think up great ideas for other ways of metering out > > cpu bandwidth for different purposes, but for X, given the absurd > > simplicity of renicing, why keep fighting it? Again I reiterate that most > > users of SD have not found the need to renice X anyway except if they > > stick to old habits of make -j4 on uniprocessor and the like, and I > > expect that those on CFS and Nicksched would also have similar > > experiences. > > FWIW folks, I have never touched x's niceness, its running at the default > -1 for all of my so-called 'tests', and I have another set to be rebooted > to right now. And yes, my kernel makeit script uses -j4 by default, and > has used -j8 just for effects, which weren't all that different from what I > expected in 'abusing' a UP system that way. The system DID remain usable, > not snappy, but usable. Gene, you're agreeing with me. You've shown that you're very happy with a fair distribution of cpu and leaving X at nice 0. > > Having tried re-nicing X a while back, and having the rest of the system > suffer in quite obvious ways for even 1 + or - from its default felt pretty > bad from this users perspective. > > It is my considered opinion (yeah I know, I'm just a leaf in the hurricane > of this list) that if X has to be re-niced from the 1 point advantage its > had for ages, then something is basicly wrong with the overall scheduling, > cpu or i/o, or both in combination. FWIW I'm using cfq for i/o. It's those who want X to have an unfair advantage that want it to do something "special". Your agreement that it works fine at nice 0 shows you don't want it to have an unfair advantage. Others who want it to have an unfair advantage _can_ renice it if they desire. But if the cpu scheduler gives X an unfair advantage within the kernel by default then you have _no_ choice. If you leave the choice up to userspace (renice or not) then both parties get their way. If you put it into the kernel only one party wins and there is no way for the Genes (and Cons) of this world to get it back. Your opinion is as valuable as eveyone else's Gene. It is hard to get people to speak on as frightening a playground as the linux kernel mailing list so please do. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 22:47 ` Con Kolivas @ 2007-04-20 2:00 ` Gene Heskett 2007-04-20 2:01 ` Gene Heskett 2007-04-20 5:24 ` Mike Galbraith 2 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-20 2:00 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Con Kolivas wrote: >On Friday 20 April 2007 04:16, Gene Heskett wrote: >> On Thursday 19 April 2007, Con Kolivas wrote: >> >> [and I snipped a good overview] >> >> >So yes go ahead and think up great ideas for other ways of metering out >> > cpu bandwidth for different purposes, but for X, given the absurd >> > simplicity of renicing, why keep fighting it? Again I reiterate that >> > most users of SD have not found the need to renice X anyway except if >> > they stick to old habits of make -j4 on uniprocessor and the like, and I >> > expect that those on CFS and Nicksched would also have similar >> > experiences. >> >> FWIW folks, I have never touched x's niceness, its running at the default >> -1 for all of my so-called 'tests', and I have another set to be rebooted >> to right now. And yes, my kernel makeit script uses -j4 by default, and >> has used -j8 just for effects, which weren't all that different from what >> I expected in 'abusing' a UP system that way. The system DID remain >> usable, not snappy, but usable. > >Gene, you're agreeing with me. You've shown that you're very happy with a > fair distribution of cpu and leaving X at nice 0. I was quite happy till Ingo's first patch came out, and it was even better, but I over-wrote it, and we're still figuring out just exactly what the magic twanger was that made it all click for me. OTOH, I don't think that patch passed muster with Mike G., either. We have obviously different workloads, and critical points in them. >> Having tried re-nicing X a while back, and having the rest of the system >> suffer in quite obvious ways for even 1 + or - from its default felt >> pretty bad from this users perspective. >> >> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane >> of this list) that if X has to be re-niced from the 1 point advantage its >> had for ages, then something is basicly wrong with the overall scheduling, >> cpu or i/o, or both in combination. FWIW I'm using cfq for i/o. > >It's those who want X to have an unfair advantage that want it to do >something "special". Your agreement that it works fine at nice 0 shows you >don't want it to have an unfair advantage. Others who want it to have an >unfair advantage _can_ renice it if they desire. But if the cpu scheduler >gives X an unfair advantage within the kernel by default then you have _no_ >choice. If you leave the choice up to userspace (renice or not) then both >parties get their way. If you put it into the kernel only one party wins and >there is no way for the Genes (and Cons) of this world to get it back. > >Your opinion is as valuable as eveyone else's Gene. It is hard to get people >to speak on as frightening a playground as the linux kernel mailing list so >please do. In the FWIW category, htop has always told me that x is running at -1, not zero. Now, I have NDI where this is actually set at, so I'd have to ask stupid questions here if I did wanna play with it. Which I really don't, the last time I tried to -5 x, kde got a whole lot LESS responsive. But heck, 2.6.2 was freshly minted then too and I've long since forgot how I went about that unless I used htop to change it, the most likely scenario that I can picture at this late date. As for speaking my mind, yes, and I've been slapped down a few times, as much because I do a lot of bitching and microscopic amounts of patch submission. The only patch I ever submitted was for something in the floppy driver, way back in the middle of 2.2 days, rejected because I didn't know how to use the tools correctly. I didn't, so it was a shrug and my feelings weren't hurt. Some see that as an unbalanced set of books and I'm aware of it. OTOH, I think I do a pretty good job of playing the canary here, and that should be worth something if for no other reason than I can turn into a burr under somebodies saddle when things go all aglay. But I figure if its happening to me, then if I don't fuss, and that gotcha gets into a distro kernel, there are gonna be a hell of a lot more folks than me trying to grab the microphone. BTW, I'm glad you are feeling well enough to get into this again. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) There cannot be a crisis next week. My schedule is already full. -- Henry Kissinger ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 22:47 ` Con Kolivas 2007-04-20 2:00 ` Gene Heskett @ 2007-04-20 2:01 ` Gene Heskett 2007-04-20 5:24 ` Mike Galbraith 2 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-20 2:01 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Con Kolivas wrote: >On Friday 20 April 2007 04:16, Gene Heskett wrote: >> On Thursday 19 April 2007, Con Kolivas wrote: >> >> [and I snipped a good overview] >> >> >So yes go ahead and think up great ideas for other ways of metering out >> > cpu bandwidth for different purposes, but for X, given the absurd >> > simplicity of renicing, why keep fighting it? Again I reiterate that >> > most users of SD have not found the need to renice X anyway except if >> > they stick to old habits of make -j4 on uniprocessor and the like, and I >> > expect that those on CFS and Nicksched would also have similar >> > experiences. >> >> FWIW folks, I have never touched x's niceness, its running at the default >> -1 for all of my so-called 'tests', and I have another set to be rebooted >> to right now. And yes, my kernel makeit script uses -j4 by default, and >> has used -j8 just for effects, which weren't all that different from what >> I expected in 'abusing' a UP system that way. The system DID remain >> usable, not snappy, but usable. > >Gene, you're agreeing with me. You've shown that you're very happy with a > fair distribution of cpu and leaving X at nice 0. I was quite happy till Ingo's first patch came out, and it was even better, but I over-wrote it, and we're still figuring out just exactly what the magic twanger was that made it all click for me. OTOH, I don't think that patch passed muster with Mike G., either. We have obviously different workloads, and critical points in them. >> Having tried re-nicing X a while back, and having the rest of the system >> suffer in quite obvious ways for even 1 + or - from its default felt >> pretty bad from this users perspective. >> >> It is my considered opinion (yeah I know, I'm just a leaf in the hurricane >> of this list) that if X has to be re-niced from the 1 point advantage its >> had for ages, then something is basicly wrong with the overall scheduling, >> cpu or i/o, or both in combination. FWIW I'm using cfq for i/o. > >It's those who want X to have an unfair advantage that want it to do >something "special". Your agreement that it works fine at nice 0 shows you >don't want it to have an unfair advantage. Others who want it to have an >unfair advantage _can_ renice it if they desire. But if the cpu scheduler >gives X an unfair advantage within the kernel by default then you have _no_ >choice. If you leave the choice up to userspace (renice or not) then both >parties get their way. If you put it into the kernel only one party wins and >there is no way for the Genes (and Cons) of this world to get it back. > >Your opinion is as valuable as eveyone else's Gene. It is hard to get people >to speak on as frightening a playground as the linux kernel mailing list so >please do. In the FWIW category, htop has always told me that x is running at -1, not zero. Now, I have NDI where this is actually set at, so I'd have to ask stupid questions here if I did wanna play with it. Which I really don't, the last time I tried to -5 x, kde got a whole lot LESS responsive. But heck, 2.6.2 was freshly minted then too and I've long since forgot how I went about that unless I used htop to change it, the most likely scenario that I can picture at this late date. As for speaking my mind, yes, and I've been slapped down a few times, as much because I do a lot of bitching and microscopic amounts of patch submission. The only patch I ever submitted was for something in the floppy driver, way back in the middle of 2.2 days, rejected because I didn't know how to use the tools correctly. I didn't, so it was a shrug and my feelings weren't hurt. Some see that as an unbalanced set of books and I'm aware of it. OTOH, I think I do a pretty good job of playing the canary here, and that should be worth something if for no other reason than I can turn into a burr under somebodies saddle when things go all aglay. But I figure if its happening to me, then if I don't fuss, and that gotcha gets into a distro kernel, there are gonna be a hell of a lot more folks than me trying to grab the microphone. BTW, I'm glad you are feeling well enough to get into this again. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) There cannot be a crisis next week. My schedule is already full. -- Henry Kissinger ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 22:47 ` Con Kolivas 2007-04-20 2:00 ` Gene Heskett 2007-04-20 2:01 ` Gene Heskett @ 2007-04-20 5:24 ` Mike Galbraith 2 siblings, 0 replies; 713+ messages in thread From: Mike Galbraith @ 2007-04-20 5:24 UTC (permalink / raw) To: Con Kolivas Cc: Gene Heskett, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Fri, 2007-04-20 at 08:47 +1000, Con Kolivas wrote: > It's those who want X to have an unfair advantage that want it to do > something "special". I hope you're not lumping me in with "those". If X + client had been able to get their fair share and do so in the low latency manner they need, I would have been one of the carrots instead of being the stick. -Mike ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas ` (2 preceding siblings ...) 2007-04-19 18:16 ` Gene Heskett @ 2007-04-19 19:26 ` Ray Lee 2007-04-19 22:56 ` Con Kolivas 2007-04-20 4:09 ` Nick Piggin 3 siblings, 2 replies; 713+ messages in thread From: Ray Lee @ 2007-04-19 19:26 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote: > The one fly in the ointment for > linux remains X. I am still, to this moment, completely and utterly stunned > at why everyone is trying to find increasingly complex unique ways to manage > X when all it needs is more cpu[1]. [...and hence should be reniced] The problem is that X is not unique. There's postgresql, memcached, mysql, db2, a little embedded app I wrote... all of these perform work on behalf of another process. It's just most *noticeable* with X, as pretty much everyone is running that. If we had some way for the scheduler to decide to donate part of a client process's time slice to the server it just spoke to (with an exponential dampening factor -- take 50% from the client, give 25% to the server, toss the rest on the floor), that -- from my naive point of view -- would be a step toward fixing the underlying issue. Or I might be spouting crap, who knows. The problem is real, though, and not limited to X. While I have the floor, thank you, Con, for all your work. Ray ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 19:26 ` Ray Lee @ 2007-04-19 22:56 ` Con Kolivas 2007-04-20 0:20 ` Michael K. Edwards 2007-04-20 0:56 ` Ray Lee 2007-04-20 4:09 ` Nick Piggin 1 sibling, 2 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-19 22:56 UTC (permalink / raw) To: ray-gmail Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Friday 20 April 2007 05:26, Ray Lee wrote: > On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote: > > The one fly in the ointment for > > linux remains X. I am still, to this moment, completely and utterly > > stunned at why everyone is trying to find increasingly complex unique > > ways to manage X when all it needs is more cpu[1]. > > [...and hence should be reniced] > > The problem is that X is not unique. There's postgresql, memcached, > mysql, db2, a little embedded app I wrote... all of these perform work > on behalf of another process. It's just most *noticeable* with X, as > pretty much everyone is running that. > > If we had some way for the scheduler to decide to donate part of a > client process's time slice to the server it just spoke to (with an > exponential dampening factor -- take 50% from the client, give 25% to > the server, toss the rest on the floor), that -- from my naive point > of view -- would be a step toward fixing the underlying issue. Or I > might be spouting crap, who knows. > > The problem is real, though, and not limited to X. > > While I have the floor, thank you, Con, for all your work. You're welcome and thanks for taking the floor to speak. I would say you have actually agreed with me though. X is not unique, it's just an obvious so let's not design the cpu scheduler around the problem with X. Same goes for every other application. Leaving the choice to hand out differential cpu usage when they seem to need is should be up to the users. The donation idea has been done before in some fashion or other in things like "back-boost" which Linus himself tried in 2.5.X days. It worked lovely till it did the wrong thing and wreaked havoc. As is shown repeatedly, the workarounds and the tweaks and the bonuses and the decide on who to give advantage to, when done by the cpu scheduler, is also what is its undoing as it can't always get it right. The consequences of getting it wrong on the other hand are disastrous. The cpu scheduler core is a cpu bandwidth and latency proportionator and should be nothing more or less. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 22:56 ` Con Kolivas @ 2007-04-20 0:20 ` Michael K. Edwards 2007-04-20 5:34 ` Bill Huey 2007-04-20 0:56 ` Ray Lee 1 sibling, 1 reply; 713+ messages in thread From: Michael K. Edwards @ 2007-04-20 0:20 UTC (permalink / raw) To: Con Kolivas Cc: ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote: > The cpu scheduler core is a cpu bandwidth and latency > proportionator and should be nothing more or less. Not really. The CPU scheduler is (or ought to be) what electric utilities call an economic dispatch mechanism -- a real-time controller whose goal is to service competing demands cost-effectively from a limited supply, without compromising system stability. If you live in the 1960's, coal and nuclear (and a little bit of fig-leaf hydro) are all you have, it takes you twelve hours to bring plants on and off line, and there's no live operational control or pricing signal between you and your customers. So you're stuck running your system at projected peak + operating margin, dumping excess power as waste heat most of the time, and browning or blacking people out willy-nilly when there's excess demand. Maybe you get to trade off shedding the loads with the worst transmission efficiency against degrading the customers with the most tolerance for brownouts (or the least regulatory clout). That's life without modern economic dispatch. If you live in 2007, natural gas and (outside the US) better control over nuclear plants give you more ability to ramp supply up and down with demand on something like a 15-minute cycle. Better yet, you can store a little energy "in the grid" to smooth out instantaneous demand fluctuations; if you're lucky, you also have enough fast-twitch hydro (thanks, Canada!) that you can run your coal and lame-ass nuclear very close to base load even when gas is expensive, and even pump water back uphill when demand dips. (Coal is nasty stuff and a worse contributor by far to radiation exposure than nuclear generation; but on current trends it's going to last a lot longer than oil and gas, and it's a lot easier to stockpile next to the generator.) Best of all, you have industrial customers who will trade you live control (within limits) over when and how much power they take in return for a lower price per unit energy. Some of them will even dump power back into the grid when you ask them to. So now the biggest challenge in making supply and demand meet (in the short term) is to damp all the different ways that a control feedback path might result in an oscillation -- or in runaway pricing. Because there's always some asshole greedhead who will gamble with system stability in order to game the pricing mechanism. Lots of 'em, if you're in California and your legislature is so dumb, or so bought, that they let the asshole greedheads design the whole system so they can game it to the max. (But that's a whole 'nother rant.) Embedded systems are already in 2007, and the mainline Linux scheduler frankly sucks on them, because it thinks it's back in the 1960's with a fixed supply and captive demand, pissing away "CPU bandwidth" as waste heat. Not to say it's an easy problem; even academics with a dozen publications in this area don't seem to be able to model energy usage to the nearest big O, let alone design a stable economic dispatch engine. But it helps to acknowledge what the problem is: even in a 1960's raised-floor screaming-air-conditioners screw-the-power-bill machine room, you can't actually run a half-decent CPU flat out any more without burning it to a crisp. You can act ignorant and let the PMIC brown you out when it has to. Or you can start coping in mainline the way that organizations big enough (and smart enough) to feel the heat in their pocketbooks do in their pet kernels. (Boo on Google for not sharing, and props to IBM for doing their damnedest.) And guess what? The system will actually get simpler, and stabler, and faster, and easier to maintain, because it'll be based on a real theory of operation with equations and things instead of a bunch of opaque, undocumented shotgun heuristics. This hypothetical economic-dispatch scheduler will still _have_ heuristics, of course -- you can't begin to model a modern CPU accurately on-line. But they will be contained in _data_ rather than _code_, and issues of numerical stability will be separated cleanly from the rule set. You'll be able to characterize the rule set's domain of stability, given a conservative set of assumptions about the feedback paths in the system under control, with the sort of techniques they teach in the engineering schools that none of us (me included) seem to have attended. (I went to school thinking I was going to be a physicist. Wishful thinking -- but I was young and stupid. What's your excuse? ;-) OK, it feels better to have that off my chest. Apologies to those readers -- doubtless the vast majority of LKML, including everyone else in this thread -- for whom it's irrelevant, pseudo-learned pontification with no patch attached. And my sincere thanks to Ingo, Con, and really everyone else CC'ed, without whom Linux wouldn't be as good as it is (really quite good, all things considered) and wouldn't contribute as much as it does to my own livelihood. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 0:20 ` Michael K. Edwards @ 2007-04-20 5:34 ` Bill Huey 0 siblings, 0 replies; 713+ messages in thread From: Bill Huey @ 2007-04-20 5:34 UTC (permalink / raw) To: Michael K. Edwards Cc: Con Kolivas, ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, linux-kernel, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Thu, Apr 19, 2007 at 05:20:53PM -0700, Michael K. Edwards wrote: > Embedded systems are already in 2007, and the mainline Linux scheduler > frankly sucks on them, because it thinks it's back in the 1960's with > a fixed supply and captive demand, pissing away "CPU bandwidth" as > waste heat. Not to say it's an easy problem; even academics with a > dozen publications in this area don't seem to be able to model energy > usage to the nearest big O, let alone design a stable economic > dispatch engine. But it helps to acknowledge what the problem is: > even in a 1960's raised-floor screaming-air-conditioners > screw-the-power-bill machine room, you can't actually run a > half-decent CPU flat out any more without burning it to a crisp. > stupid. What's your excuse? ;-) It's now possible to QoS significant parts of the kernel since we now have a deadline mechanism in place. In the original 2.4 kernel, TimeSys's irq-thread allowed for the processing of skbuffs in a thread under a CPU reservation run category which was use to provide QoS I believe. This basic mechanish can now be generalized to many place in the kernel and put it under scheduler control. It's just a matter of who and when somebody is going take on this task. bill ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 22:56 ` Con Kolivas 2007-04-20 0:20 ` Michael K. Edwards @ 2007-04-20 0:56 ` Ray Lee 1 sibling, 0 replies; 713+ messages in thread From: Ray Lee @ 2007-04-20 0:56 UTC (permalink / raw) To: Con Kolivas Cc: ray-gmail, Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Con Kolivas wrote: > You're welcome and thanks for taking the floor to speak. I would say you have > actually agreed with me though. X is not unique, it's just an obvious so > let's not design the cpu scheduler around the problem with X. Same goes for > every other application. Leaving the choice to hand out differential cpu > usage when they seem to need is should be up to the users. The donation idea > has been done before in some fashion or other in things like "back-boost" > which Linus himself tried in 2.5.X days. It worked lovely till it did the > wrong thing and wreaked havoc. <nod> I know. I came to the party late, or I would have played with it back then. Perhaps you could correct me, but it seems his back-boost didn't do any dampening, which means the system could get into nasty capture scenarios, where two processes bouncing messages back and forth could take over the scheduler and starve out the rest. It seems pretty obvious in hind-sight that something without exponential dampening would allow feedback loops. Regardless, perhaps we are in agreement. I just don't like the idea of having to guess how much work postgresql is going to be doing on my client processes' behalf. Worse, I don't necessarily want it to have that -10 priority when it's going and updating statistics or whatnot, or any other housekeeping activity that shouldn't make a noticeable impact on the rest of the system. Worst, I'm leery of the idea that if I get its nice level wrong, that I'm going to be affecting the overall throughput of the server. All of which are only hypothetical worries, granted. Anyway, I'll shut up now. Thanks again for stickin' with it. Ray ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-19 19:26 ` Ray Lee 2007-04-19 22:56 ` Con Kolivas @ 2007-04-20 4:09 ` Nick Piggin 2007-04-24 15:50 ` Ray Lee 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-20 4:09 UTC (permalink / raw) To: ray-gmail Cc: Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 12:26:03PM -0700, Ray Lee wrote: > On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote: > >The one fly in the ointment for > >linux remains X. I am still, to this moment, completely and utterly stunned > >at why everyone is trying to find increasingly complex unique ways to > >manage > >X when all it needs is more cpu[1]. > [...and hence should be reniced] > > The problem is that X is not unique. There's postgresql, memcached, > mysql, db2, a little embedded app I wrote... all of these perform work > on behalf of another process. It's just most *noticeable* with X, as > pretty much everyone is running that. But for most of those apps, we don't actually care if they do fairly degrade in performance as other loads on the system ramp up. However the user prefers X to be given priority in these situations. Whether that is the design of X, x clients, or the human condition really doesn't matter two hoots to the scheduler. > If we had some way for the scheduler to decide to donate part of a > client process's time slice to the server it just spoke to (with an > exponential dampening factor -- take 50% from the client, give 25% to > the server, toss the rest on the floor), that -- from my naive point > of view -- would be a step toward fixing the underlying issue. Or I > might be spouting crap, who knows. Firstly, lots of clients in your list are remote. X usually isn't. However for X, a syscall or something to donate time might not be such a bad idea... but given a couple of X clients and a server against a parallel make, this is probably just going to make the clients slow down as well without giving enough priority to the server. X isn't special so much because it does work on behalf of others (as you said, lots of things do that). It is special simply because we _want_ rendering to have priority of the CPU (if you shifed CPU intensive rendering to the clients, you'd most likely want to give them priority to); nice, right? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-20 4:09 ` Nick Piggin @ 2007-04-24 15:50 ` Ray Lee 2007-04-24 16:23 ` Matt Mackall 0 siblings, 1 reply; 713+ messages in thread From: Ray Lee @ 2007-04-24 15:50 UTC (permalink / raw) To: Nick Piggin Cc: ray-gmail, Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Thu, Apr 19, 2007 at 12:26:03PM -0700, Ray Lee wrote: >> On 4/19/07, Con Kolivas <kernel@kolivas.org> wrote: >>> The one fly in the ointment for >>> linux remains X. I am still, to this moment, completely and utterly stunned >>> at why everyone is trying to find increasingly complex unique ways to >>> manage >>> X when all it needs is more cpu[1]. >> [...and hence should be reniced] >> >> The problem is that X is not unique. There's postgresql, memcached, >> mysql, db2, a little embedded app I wrote... all of these perform work >> on behalf of another process. It's just most *noticeable* with X, as >> pretty much everyone is running that. > > But for most of those apps, we don't actually care if they do fairly > degrade in performance as other loads on the system ramp up. (Who's this 'we' kemosabe? I do. Desktop systems are increasingly using databases for their day-to-day tasks. As they should, a db is not something that should be reinvented poorly.) > However > the user prefers X to be given priority in these situations. Whether > that is the design of X, x clients, or the human condition really > doesn't matter two hoots to the scheduler. Hmm, let's try this again. Anything that communicates out of process as part of its normal usage for Getting Work Done gets impacted by the scheduler. That means pipelines in the shell, d-bus on the desktop, and lots of other things that follow the unix philosophy of lots of little programs communicating. >> If we had some way for the scheduler to decide to donate part of a >> client process's time slice to the server it just spoke to (with an >> exponential dampening factor -- take 50% from the client, give 25% to >> the server, toss the rest on the floor), that -- from my naive point >> of view -- would be a step toward fixing the underlying issue. Or I >> might be spouting crap, who knows. > > Firstly, lots of clients in your list are remote. X usually isn't. They really aren't, unless you happen to work somewhere that can afford to dedicate a box to a db, which suddenly makes the scheduler a dull topic. For example, I have a db and web server installed on my laptop, so that the few times that I have to do web app programming (while wearing a mustache and glasses so that I don't have to admit to it in polite company), I can be functional with just one computer. > However for X, a syscall or something to donate time might not be > such a bad idea... We have one already, it's called write(). We have another called read(), too. Okay, so they have some data related side-effects other than the scheduler hints, but I claim the scheduler hint is already implicitly there. > but given a couple of X clients and a server > against a parallel make, this is probably just going to make the > clients slow down as well without giving enough priority to the > server. Do you have data, or at least a theory to back up that hypothesis? > X isn't special so much because it does work on behalf of others > (as you said, lots of things do that). It is special simply because > we _want_ rendering to have priority of the CPU Really not. I'm trying to get across that this is a general problem with interprocess communication, or any systems that rely on multiple processes to make forward progress on a problem. Sure, let the clients make forward progress until they can't any more. If they stop making forward progress by blocking on a read or sleeping after a write to another process, then there's a big hint there as to who should get focus next. > (if you shifed CPU > intensive rendering to the clients, you'd most likely want to give > them priority to); nice, right? They'd have it automatically, if they were spending their time computing rather than rendering. Ray ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Renice X for cpu schedulers 2007-04-24 15:50 ` Ray Lee @ 2007-04-24 16:23 ` Matt Mackall 0 siblings, 0 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-24 16:23 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, ray-gmail, Con Kolivas, Ingo Molnar, Andrew Morton, Linus Torvalds, William Lee Irwin III, Peter Williams, Mike Galbraith, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Tue, Apr 24, 2007 at 08:50:20AM -0700, Ray Lee wrote: > > Firstly, lots of clients in your list are remote. X usually isn't. > > They really aren't, unless you happen to work somewhere that can afford > to dedicate a box to a db, which suddenly makes the scheduler a dull > topic. > > For example, I have a db and web server installed on my laptop, so > that the few times that I have to do web app programming (while wearing > a mustache and glasses so that I don't have to admit to it in polite > company), I can be functional with just one computer. Indeed. The vast majority of people doing "LAMP" web services are doing it on a single machine. Or VM for that matter. It seems that this is a lot like the priority inheritance problem. If a nice -19 process blocks on the db running at nice 0, the db ought to get a boost until it wakes the original process up. The same should apply at the level of dynamic priorities at the same nice level. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall 2007-04-19 3:18 ` Nick Piggin @ 2007-04-21 13:40 ` Bill Davidsen 2 siblings, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-21 13:40 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner Linus Torvalds wrote: > > On Wed, 18 Apr 2007, Matt Mackall wrote: >> Why is X special? Because it does work on behalf of other processes? >> Lots of things do this. Perhaps a scheduler should focus entirely on >> the implicit and directed wakeup matrix and optimizing that >> instead[1]. > > I 100% agree - the perfect scheduler would indeed take into account where > the wakeups come from, and try to "weigh" processes that help other > processes make progress more. That would naturally give server processes > more CPU power, because they help others > > I don't believe for a second that "fairness" means "give everybody the > same amount of CPU". That's a totally illogical measure of fairness. All > processes are _not_ created equal. > > That said, even trying to do "fairness by effective user ID" would > probably already do a lot. In a desktop environment, X would get as much > CPU time as the user processes, simply because it's in a different > protection domain (and that's really what "effective user ID" means: it's > not about "users", it's really about "protection domains"). > > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. > You probably want to consider the controlling terminal as well... do you want to have people starting 'at' jobs competing on equal footing with people typing at a terminal? I'm not offering an answer, just raising the question. And for some database applications, everyone in a group may connect with the same login-id, then do sub authorization to the database application. euid may be an issue there as well. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:26 ` William Lee Irwin III @ 2007-04-17 6:50 ` Davide Libenzi 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:11 ` Nick Piggin 1 sibling, 2 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-17 6:50 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, Nick Piggin wrote: > > All things are not equal; they all have different properties. I like > > Exactly. So we have to explore those properties and evaluate performance > (in all meanings of the word). That's only logical. I had a quick look at Ingo's code yesterday. Ingo is always smart to prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) And even this code does that pretty nicely. The deadline designs looks good, although I think the final "key" calculation code will end up quite different from what it looks now. I would suggest to thoroughly test all your alternatives before deciding. Some code and design may look very good and small at the beginning, but when you start patching it to cover all the dark spots, you effectively end up with another thing (in both design and code footprint). About O(1), I never thought it was a must (besides a good marketing material), and O(log(N)) *may* be just fine (to be verified, of course). - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:50 ` Davide Libenzi @ 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams ` (3 more replies) 2007-04-17 7:11 ` Nick Piggin 1 sibling, 4 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 7:09 UTC (permalink / raw) To: Davide Libenzi Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > I had a quick look at Ingo's code yesterday. Ingo is always smart to > prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) > And even this code does that pretty nicely. The deadline designs looks > good, although I think the final "key" calculation code will end up quite > different from what it looks now. The additive nice_offset breaks nice levels. A multiplicative priority weighting of a different, nonnegative metric of cpu utilization from what's now used is required for nice levels to work. I've been trying to point this out politely by strongly suggesting testing whether nice levels work. On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > I would suggest to thoroughly test all your alternatives before deciding. > Some code and design may look very good and small at the beginning, but > when you start patching it to cover all the dark spots, you effectively > end up with another thing (in both design and code footprint). > About O(1), I never thought it was a must (besides a good marketing > material), and O(log(N)) *may* be just fine (to be verified, of course). The trouble with thorough testing right now is that no one agrees on what the tests should be and a number of the testcases are not in great shape. An agreed-upon set of testcases for basic correctness should be devised and the implementations of those testcases need to be maintainable code and the tests set up for automated testing and changing their parameters without recompiling via command-line options. Once there's a standard regression test suite for correctness, one needs to be devised for performance, including interactive performance. The primary difficulty I see along these lines is finding a way to automate tests of graphics and input device response performance. Others, like how deterministically priorities are respected over progressively smaller time intervals and noninteractive workload performance are nowhere near as difficult to arrange and in many cases already exist. Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III @ 2007-04-17 7:22 ` Peter Williams 2007-04-17 7:23 ` Nick Piggin ` (2 subsequent siblings) 3 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-17 7:22 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: >> I had a quick look at Ingo's code yesterday. Ingo is always smart to >> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) >> And even this code does that pretty nicely. The deadline designs looks >> good, although I think the final "key" calculation code will end up quite >> different from what it looks now. > > The additive nice_offset breaks nice levels. A multiplicative priority > weighting of a different, nonnegative metric of cpu utilization from > what's now used is required for nice levels to work. I've been trying > to point this out politely by strongly suggesting testing whether nice > levels work. > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: >> I would suggest to thoroughly test all your alternatives before deciding. >> Some code and design may look very good and small at the beginning, but >> when you start patching it to cover all the dark spots, you effectively >> end up with another thing (in both design and code footprint). >> About O(1), I never thought it was a must (besides a good marketing >> material), and O(log(N)) *may* be just fine (to be verified, of course). > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. At this point, I'd like direct everyone's attention to the simloads package: <http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz> which contains a set of programs designed to be used in the construction of CPU scheduler tests. Of particular use is the aspin program which can be used to launch tasks with specified sleep/wake characteristics. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams @ 2007-04-17 7:23 ` Nick Piggin 2007-04-17 7:27 ` Davide Libenzi 2007-04-17 7:33 ` Ingo Molnar 3 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote: > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. Definitely. It would be really good if we could have interactivity regression tests too (see my earlier wishful email). The problem with a lot of the scripted interactivity tests I see is that they don't really capture the complexities of the interactions between, say, an interactive X session. Others just go straight for trying to exploit the design by making lots of high priority processes runnablel at once. This just provides an unrealistic decoy and you end up trying to tune for the wrong thing. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams 2007-04-17 7:23 ` Nick Piggin @ 2007-04-17 7:27 ` Davide Libenzi 2007-04-17 7:33 ` Nick Piggin 2007-04-17 7:33 ` Ingo Molnar 3 siblings, 1 reply; 713+ messages in thread From: Davide Libenzi @ 2007-04-17 7:27 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > I would suggest to thoroughly test all your alternatives before deciding. > > Some code and design may look very good and small at the beginning, but > > when you start patching it to cover all the dark spots, you effectively > > end up with another thing (in both design and code footprint). > > About O(1), I never thought it was a must (besides a good marketing > > material), and O(log(N)) *may* be just fine (to be verified, of course). > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. What I meant was, that the rules (requirements and associated test cases) for this new Scheduler Amazing Race should be set forward, and not kept a moving target to fit&follow one or the other implementation. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:27 ` Davide Libenzi @ 2007-04-17 7:33 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:33 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote: > On Tue, 17 Apr 2007, William Lee Irwin III wrote: > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > > I would suggest to thoroughly test all your alternatives before deciding. > > > Some code and design may look very good and small at the beginning, but > > > when you start patching it to cover all the dark spots, you effectively > > > end up with another thing (in both design and code footprint). > > > About O(1), I never thought it was a must (besides a good marketing > > > material), and O(log(N)) *may* be just fine (to be verified, of course). > > > > The trouble with thorough testing right now is that no one agrees on > > what the tests should be and a number of the testcases are not in great > > shape. An agreed-upon set of testcases for basic correctness should be > > devised and the implementations of those testcases need to be > > maintainable code and the tests set up for automated testing and > > changing their parameters without recompiling via command-line options. > > > > Once there's a standard regression test suite for correctness, one > > needs to be devised for performance, including interactive performance. > > The primary difficulty I see along these lines is finding a way to > > automate tests of graphics and input device response performance. Others, > > like how deterministically priorities are respected over progressively > > smaller time intervals and noninteractive workload performance are > > nowhere near as difficult to arrange and in many cases already exist. > > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. > > What I meant was, that the rules (requirements and associated test cases) > for this new Scheduler Amazing Race should be set forward, and not kept a > moving target to fit&follow one or the other implementation. Exactly. Well I don't mind if it is a moving target as such, just as long as the decisions are rational (no "blah is more important because I say so"). ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III ` (2 preceding siblings ...) 2007-04-17 7:27 ` Davide Libenzi @ 2007-04-17 7:33 ` Ingo Molnar 2007-04-17 7:40 ` Nick Piggin 2007-04-17 9:05 ` William Lee Irwin III 3 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 7:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > > prepare a main dish (feature) with a nice sider (code cleanup) to > > Linus ;) And even this code does that pretty nicely. The deadline > > designs looks good, although I think the final "key" calculation > > code will end up quite different from what it looks now. > > The additive nice_offset breaks nice levels. A multiplicative priority > weighting of a different, nonnegative metric of cpu utilization from > what's now used is required for nice levels to work. I've been trying > to point this out politely by strongly suggesting testing whether nice > levels work. granted, CFS's nice code is still incomplete, but you err quite significantly with this extreme statement that they are "broken". nice levels certainly work to a fair degree even in the current code and much of the focus is elsewhere - just try it. (In fact i claim that CFS's nice levels often work _better_ than the mainline scheduler's nice level support, for the testcases that matter to users.) The precise behavior of nice levels, as i pointed it out in previous mails, is largely 'uninteresting' and it has changed multiple times in the past 10 years. What matters to users is mainly: whether X reniced to -10 does get enough CPU time and whether stuff reniced to +19 doesnt take away too much CPU time from the rest of the system. _How_ a Linux scheduler achieves this is an internal matter and certainly CFS does it in a hacky way at the moment. All the rest, 'CPU bandwidth utilization' or whatever abstract metric we could come up with is just a fancy academic technicality that has no real significance to any of the testers who are trying CFS right now. Sure we prefer final solutions that are clean and make sense (because such things are the easiest to maintain long-term), and often such final solutions are quite close to academic concepts, and i think Davide correctly observed this by saying that "the final key calculation code will end up quite different from what it looks now", but your extreme-end claim of 'breakage' for something that is just plain incomplete is not really a fair characterisation at this point. Anyone who thinks that there exists only two kinds of code: 100% correct and 100% incorrect with no shades of grey inbetween is in reality a sort of an extremist: whom, depending on mood and affection, we could call either a 'coding purist' or a 'coding taliban' ;-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:33 ` Ingo Molnar @ 2007-04-17 7:40 ` Nick Piggin 2007-04-17 7:58 ` Ingo Molnar 2007-04-17 9:05 ` William Lee Irwin III 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:40 UTC (permalink / raw) To: Ingo Molnar Cc: William Lee Irwin III, Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > > * William Lee Irwin III <wli@holomorphy.com> wrote: > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > > > prepare a main dish (feature) with a nice sider (code cleanup) to > > > Linus ;) And even this code does that pretty nicely. The deadline > > > designs looks good, although I think the final "key" calculation > > > code will end up quite different from what it looks now. > > > > The additive nice_offset breaks nice levels. A multiplicative priority > > weighting of a different, nonnegative metric of cpu utilization from > > what's now used is required for nice levels to work. I've been trying > > to point this out politely by strongly suggesting testing whether nice > > levels work. > > granted, CFS's nice code is still incomplete, but you err quite > significantly with this extreme statement that they are "broken". > > nice levels certainly work to a fair degree even in the current code and > much of the focus is elsewhere - just try it. (In fact i claim that > CFS's nice levels often work _better_ than the mainline scheduler's nice > level support, for the testcases that matter to users.) > > The precise behavior of nice levels, as i pointed it out in previous > mails, is largely 'uninteresting' and it has changed multiple times in > the past 10 years. > > What matters to users is mainly: whether X reniced to -10 does get > enough CPU time and whether stuff reniced to +19 doesnt take away too > much CPU time from the rest of the system. I agree there. > _How_ a Linux scheduler > achieves this is an internal matter and certainly CFS does it in a hacky > way at the moment. > > All the rest, 'CPU bandwidth utilization' or whatever abstract metric we > could come up with is just a fancy academic technicality that has no > real significance to any of the testers who are trying CFS right now. > > Sure we prefer final solutions that are clean and make sense (because > such things are the easiest to maintain long-term), and often such final > solutions are quite close to academic concepts, and i think Davide > correctly observed this by saying that "the final key calculation code > will end up quite different from what it looks now", but your > extreme-end claim of 'breakage' for something that is just plain > incomplete is not really a fair characterisation at this point. > > Anyone who thinks that there exists only two kinds of code: 100% correct > and 100% incorrect with no shades of grey inbetween is in reality a sort > of an extremist: whom, depending on mood and affection, we could call > either a 'coding purist' or a 'coding taliban' ;-) Only if you are an extremist-naming extremist with no shades of grey. Others, like myself, also include 'coding al-qaeda' and 'coding john howard' in that scale. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:40 ` Nick Piggin @ 2007-04-17 7:58 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 7:58 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > Anyone who thinks that there exists only two kinds of code: 100% > > correct and 100% incorrect with no shades of grey inbetween is in > > reality a sort of an extremist: whom, depending on mood and > > affection, we could call either a 'coding purist' or a 'coding > > taliban' ;-) > > Only if you are an extremist-naming extremist with no shades of grey. > Others, like myself, also include 'coding al-qaeda' and 'coding john > howard' in that scale. heh ;) You, you ... nitpicking extremist! ;) And beware that you just commited another act of extremism too: > I agree there. because you just went to the extreme position of saying that "i agree with this portion 100%", instead of saying "this seems to be 91.5% correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200". and the nasty thing is, that in reality even shades of grey, if you print them out, are just a set of extreme black dots on an extreme white sheet of paper! ;) [ so i guess we've got to consider the scope of extremism too: the larger the scope, the more limiting and hence the more dangerous it is. ] Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:33 ` Ingo Molnar 2007-04-17 7:40 ` Nick Piggin @ 2007-04-17 9:05 ` William Lee Irwin III 2007-04-17 9:24 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 9:05 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> The additive nice_offset breaks nice levels. A multiplicative priority >> weighting of a different, nonnegative metric of cpu utilization from >> what's now used is required for nice levels to work. I've been trying >> to point this out politely by strongly suggesting testing whether nice >> levels work. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > granted, CFS's nice code is still incomplete, but you err quite > significantly with this extreme statement that they are "broken". I used the word relatively loosely. Nothing extreme is going on. Maybe the phrasing exaggerated the force of the opinion. I'm sorry about having misspoke so. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > nice levels certainly work to a fair degree even in the current code and > much of the focus is elsewhere - just try it. (In fact i claim that > CFS's nice levels often work _better_ than the mainline scheduler's nice > level support, for the testcases that matter to users.) Al Boldi's testcase appears to reveal some issues. I'm plotting a testcase of my own if I can ever get past responding to email. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > The precise behavior of nice levels, as i pointed it out in previous > mails, is largely 'uninteresting' and it has changed multiple times in > the past 10 years. I expect that whether a scheduler can handle such prioritization has a rather strong predictive quality regarding whether it can handle, say, CKRM controls. I remain convinced that there should be some target behavior and that some attempt should be made to achieve it. I don't think any particular behavior is best, just that the behavior should be well-defined. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > What matters to users is mainly: whether X reniced to -10 does get > enough CPU time and whether stuff reniced to +19 doesnt take away too > much CPU time from the rest of the system. _How_ a Linux scheduler > achieves this is an internal matter and certainly CFS does it in a hacky > way at the moment. It's not so far out. Basically just changing the key calculation in a relatively simple manner should get things into relatively good shape. It can, of course, be done other ways (I did it a rather different way in vdls, though that method is not likely to be considered desirable). I can't really write a testcase for such loose semantics, so the above description is useless to me. These squishy sorts of definitions of semantics are also uninformative to users, who, I would argue, do have some interest in what nice levels mean. There have been at least a small number of concerns about the strength of nice levels, and it would reveal issues surrounding that area earlier if there were an objective one could test to see if it were achieved. It's furthermore a user-visible change in system call semantics we should be more careful about changing out from beneath users. So I see a lot of good reasons to pin down nice numbers. Incompleteness is not a particularly mortal sin, but the proliferation of competing schedulers is creating a need for standards, and that's what I'm really on about. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > All the rest, 'CPU bandwidth utilization' or whatever abstract metric we > could come up with is just a fancy academic technicality that has no > real significance to any of the testers who are trying CFS right now. I could say "percent cpu" if it sounds less like formal jargon, which "CPU bandwidth utilization" isn't really. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > Sure we prefer final solutions that are clean and make sense (because > such things are the easiest to maintain long-term), and often such final > solutions are quite close to academic concepts, and i think Davide > correctly observed this by saying that "the final key calculation code > will end up quite different from what it looks now", but your > extreme-end claim of 'breakage' for something that is just plain > incomplete is not really a fair characterisation at this point. It wasn't meant to be quite as strong a statement as it came out. Sorry about that. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > Anyone who thinks that there exists only two kinds of code: 100% correct > and 100% incorrect with no shades of grey inbetween is in reality a sort > of an extremist: whom, depending on mood and affection, we could call > either a 'coding purist' or a 'coding taliban' ;-) I've made no such claims. Also rest assured that the tone of the critique is not hostile, and wasn't meant to sound that way. Also, given the general comments it appears clear that some statistical metric of deviation from the intended behavior furthermore qualified by timescale is necessary, so this appears to be headed toward a sort of performance metric as opposed to a pass/fail test anyway. However, to even measure this at all, some statement of intention is required. I'd prefer that there be a Linux-standard semantics for nice so results are more directly comparable and so that users also get similar nice behavior from the scheduler as it varies over time and possibly implementations if users should care to switch them out with some scheduler patch or other. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:05 ` William Lee Irwin III @ 2007-04-17 9:24 ` Ingo Molnar 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 22:08 ` Matt Mackall 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 9:24 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > [...] Also rest assured that the tone of the critique is not hostile, > and wasn't meant to sound that way. ok :) (And i guess i was too touchy - sorry about coming out swinging.) > Also, given the general comments it appears clear that some > statistical metric of deviation from the intended behavior furthermore > qualified by timescale is necessary, so this appears to be headed > toward a sort of performance metric as opposed to a pass/fail test > anyway. However, to even measure this at all, some statement of > intention is required. I'd prefer that there be a Linux-standard > semantics for nice so results are more directly comparable and so that > users also get similar nice behavior from the scheduler as it varies > over time and possibly implementations if users should care to switch > them out with some scheduler patch or other. yeah. If you could come up with a sane definition that also translates into low overhead on the algorithm side that would be great! The only good generic definition i could come up with (nice levels are isolated buckets with a constant maximum relative percentage of CPU time available to every active bucket) resulted in having a per-nice-level array of rbtree roots, which did not look worth the hassle at first sight :-) until now the main approach for nice levels in Linux was always: "implement your main scheduling logic for nice 0 and then look for some low-overhead method that can be glued to it that does something that behaves like nice levels". Feel free to turn that around into a more natural approach, but the algorithm should remain fairly simple i think. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:24 ` Ingo Molnar @ 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 10:01 ` Ingo Molnar 2007-04-17 11:31 ` William Lee Irwin III 2007-04-17 22:08 ` Matt Mackall 1 sibling, 2 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 9:57 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> Also, given the general comments it appears clear that some >> statistical metric of deviation from the intended behavior furthermore >> qualified by timescale is necessary, so this appears to be headed >> toward a sort of performance metric as opposed to a pass/fail test [...] On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > yeah. If you could come up with a sane definition that also translates > into low overhead on the algorithm side that would be great! The only > good generic definition i could come up with (nice levels are isolated > buckets with a constant maximum relative percentage of CPU time > available to every active bucket) resulted in having a per-nice-level > array of rbtree roots, which did not look worth the hassle at first > sight :-) Interesting! That's what vdls did, except its fundamental data structure was more like a circular buffer data structure (resembling Davide Libenzi's timer ring in concept, but with all the details different). I'm not entirely sure how that would've turned out performancewise if I'd done any tuning at all. I was mostly interested in doing something like what I heard Bob Mullens did in 1976 for basic pedagogical value about schedulers to prepare for writing patches for gang scheduling as opposed to creating a viable replacement for the mainline scheduler. I'm relatively certain a different key calculation will suffice, but it may disturb other desired semantics since they really need to be nonnegative for multiplying by a scaling factor corresponding to its nice number to work properly. Well, as the cfs code now stands, it would correspond to negative keys. Dividing positive keys by the nice scaling factor is my first thought of how to extend the method to the current key semantics. Or such are my thoughts on the subject. I expect that all that's needed is to fiddle with those numbers a bit. There's quite some capacity for expression there given the precision. On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > until now the main approach for nice levels in Linux was always: > "implement your main scheduling logic for nice 0 and then look for some > low-overhead method that can be glued to it that does something that > behaves like nice levels". Feel free to turn that around into a more > natural approach, but the algorithm should remain fairly simple i think. Part of my insistence was because it seemed to be relatively close to a one-liner, though I'm not entirely sure what particular computation to use to handle the signedness of the keys. I guess I could pick some particular nice semantics myself and then sweep the extant schedulers to use them after getting a testcase hammered out. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:57 ` William Lee Irwin III @ 2007-04-17 10:01 ` Ingo Molnar 2007-04-17 11:31 ` William Lee Irwin III 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 10:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > > > until now the main approach for nice levels in Linux was always: > > "implement your main scheduling logic for nice 0 and then look for > > some low-overhead method that can be glued to it that does something > > that behaves like nice levels". Feel free to turn that around into a > > more natural approach, but the algorithm should remain fairly simple > > i think. > > Part of my insistence was because it seemed to be relatively close to > a one-liner, though I'm not entirely sure what particular computation > to use to handle the signedness of the keys. I guess I could pick some > particular nice semantics myself and then sweep the extant schedulers > to use them after getting a testcase hammered out. i'd love to have a oneliner solution :-) wrt. signedness: note that in v2 i have made rq_running signed, and most calculations (especially those related to nice) are signed values. (On 64-bit systems this all isnt a big issue - most of the arithmetics gymnastics in CFS are done to keep deltas within 32 bits, so that divisions and multiplications are sane.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 10:01 ` Ingo Molnar @ 2007-04-17 11:31 ` William Lee Irwin III 1 sibling, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 11:31 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:57:49AM -0700, William Lee Irwin III wrote: > Interesting! That's what vdls did, except its fundamental data structure > was more like a circular buffer data structure (resembling Davide > Libenzi's timer ring in concept, but with all the details different). > I'm not entirely sure how that would've turned out performancewise if > I'd done any tuning at all. I was mostly interested in doing something > like what I heard Bob Mullens did in 1976 for basic pedagogical value > about schedulers to prepare for writing patches for gang scheduling as > opposed to creating a viable replacement for the mainline scheduler. Con helped me dredge up the vdls bits, so here is the last version I before I got tired of toying with the idea. It's not all that clean, with a fair amount of debug code floating around and a number of idiocies (it seems there was a plot to use a heap somewhere I forgot about entirely, never mind other cruft), but I thought I should at least say something more provable than "there was a patch I never posted." Enjoy! -- wli diff -prauN linux-2.6.0-test11/fs/proc/array.c sched-2.6.0-test11-5/fs/proc/array.c --- linux-2.6.0-test11/fs/proc/array.c 2003-11-26 12:44:26.000000000 -0800 +++ sched-2.6.0-test11-5/fs/proc/array.c 2003-12-17 07:37:11.000000000 -0800 @@ -162,7 +162,7 @@ static inline char * task_state(struct t "Uid:\t%d\t%d\t%d\t%d\n" "Gid:\t%d\t%d\t%d\t%d\n", get_task_state(p), - (p->sleep_avg/1024)*100/(1000000000/1024), + 0UL, /* was ->sleep_avg */ p->tgid, p->pid, p->pid ? p->real_parent->pid : 0, p->pid && p->ptrace ? p->parent->pid : 0, @@ -345,7 +345,7 @@ int proc_pid_stat(struct task_struct *ta read_unlock(&tasklist_lock); res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \ -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n", +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d %d\n", task->pid, task->comm, state, @@ -390,8 +390,8 @@ int proc_pid_stat(struct task_struct *ta task->cnswap, task->exit_signal, task_cpu(task), - task->rt_priority, - task->policy); + task_prio(task), + task_sched_policy(task)); if(mm) mmput(mm); return res; diff -prauN linux-2.6.0-test11/include/asm-i386/thread_info.h sched-2.6.0-test11-5/include/asm-i386/thread_info.h --- linux-2.6.0-test11/include/asm-i386/thread_info.h 2003-11-26 12:43:06.000000000 -0800 +++ sched-2.6.0-test11-5/include/asm-i386/thread_info.h 2003-12-17 04:55:22.000000000 -0800 @@ -114,6 +114,8 @@ static inline struct thread_info *curren #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ #define TIF_IRET 5 /* return with iret */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 +#define TIF_PREEMPT 18 #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) diff -prauN linux-2.6.0-test11/include/linux/binomial.h sched-2.6.0-test11-5/include/linux/binomial.h --- linux-2.6.0-test11/include/linux/binomial.h 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/binomial.h 2003-12-20 15:53:33.000000000 -0800 @@ -0,0 +1,16 @@ +/* + * Simple binomial heaps. + */ + +struct binomial { + unsigned priority, degree; + struct binomial *parent, *child, *sibling; +}; + + +struct binomial *binomial_minimum(struct binomial **); +void binomial_union(struct binomial **, struct binomial **, struct binomial **); +void binomial_insert(struct binomial **, struct binomial *); +struct binomial *binomial_extract_min(struct binomial **); +void binomial_decrease(struct binomial **, struct binomial *, unsigned); +void binomial_delete(struct binomial **, struct binomial *); diff -prauN linux-2.6.0-test11/include/linux/init_task.h sched-2.6.0-test11-5/include/linux/init_task.h --- linux-2.6.0-test11/include/linux/init_task.h 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/init_task.h 2003-12-18 05:51:16.000000000 -0800 @@ -56,6 +56,12 @@ .siglock = SPIN_LOCK_UNLOCKED, \ } +#define INIT_SCHED_INFO(info) \ +{ \ + .run_list = LIST_HEAD_INIT((info).run_list), \ + .policy = 1 /* SCHED_POLICY_TS */, \ +} + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) @@ -67,14 +73,10 @@ .usage = ATOMIC_INIT(2), \ .flags = 0, \ .lock_depth = -1, \ - .prio = MAX_PRIO-20, \ - .static_prio = MAX_PRIO-20, \ - .policy = SCHED_NORMAL, \ + .sched_info = INIT_SCHED_INFO(tsk.sched_info), \ .cpus_allowed = CPU_MASK_ALL, \ .mm = NULL, \ .active_mm = &init_mm, \ - .run_list = LIST_HEAD_INIT(tsk.run_list), \ - .time_slice = HZ, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \ diff -prauN linux-2.6.0-test11/include/linux/sched.h sched-2.6.0-test11-5/include/linux/sched.h --- linux-2.6.0-test11/include/linux/sched.h 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/sched.h 2003-12-23 03:47:45.000000000 -0800 @@ -126,6 +126,8 @@ extern unsigned long nr_iowait(void); #define SCHED_NORMAL 0 #define SCHED_FIFO 1 #define SCHED_RR 2 +#define SCHED_BATCH 3 +#define SCHED_IDLE 4 struct sched_param { int sched_priority; @@ -281,10 +283,14 @@ struct signal_struct { #define MAX_USER_RT_PRIO 100 #define MAX_RT_PRIO MAX_USER_RT_PRIO - -#define MAX_PRIO (MAX_RT_PRIO + 40) - -#define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define NICE_QLEN 128 +#define MIN_TS_PRIO MAX_RT_PRIO +#define MAX_TS_PRIO (40*NICE_QLEN) +#define MIN_BATCH_PRIO (MAX_RT_PRIO + MAX_TS_PRIO) +#define MAX_BATCH_PRIO 100 +#define MAX_PRIO (MIN_BATCH_PRIO + MAX_BATCH_PRIO) +#define USER_PRIO(prio) ((prio) - MAX_RT_PRIO) +#define MAX_USER_PRIO USER_PRIO(MAX_PRIO) /* * Some day this will be a full-fledged user tracking system.. @@ -330,6 +336,36 @@ struct k_itimer { struct io_context; /* See blkdev.h */ void exit_io_context(void); +struct rt_data { + int prio, rt_policy; + unsigned long quantum, ticks; +}; + +/* XXX: do %cpu estimation for ts wakeup levels */ +struct ts_data { + int nice; + unsigned long ticks, frac_cpu; + unsigned long sample_start, sample_ticks; +}; + +struct bt_data { + int prio; + unsigned long ticks; +}; + +union class_data { + struct rt_data rt; + struct ts_data ts; + struct bt_data bt; +}; + +struct sched_info { + int idx; /* queue index, used by all classes */ + unsigned long policy; /* scheduling policy */ + struct list_head run_list; /* list links for priority queues */ + union class_data cl_data; /* class-specific data */ +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -339,18 +375,9 @@ struct task_struct { int lock_depth; /* Lock depth */ - int prio, static_prio; - struct list_head run_list; - prio_array_t *array; - - unsigned long sleep_avg; - long interactive_credit; - unsigned long long timestamp; - int activated; + struct sched_info sched_info; - unsigned long policy; cpumask_t cpus_allowed; - unsigned int time_slice, first_time_slice; struct list_head tasks; struct list_head ptrace_children; @@ -391,7 +418,6 @@ struct task_struct { int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ - unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; unsigned long it_real_incr, it_prof_incr, it_virt_incr; struct timer_list real_timer; @@ -520,12 +546,14 @@ extern void node_nr_running_init(void); #define node_nr_running_init() {} #endif -extern void set_user_nice(task_t *p, long nice); -extern int task_prio(task_t *p); -extern int task_nice(task_t *p); -extern int task_curr(task_t *p); -extern int idle_cpu(int cpu); - +void set_user_nice(task_t *task, long nice); +int task_prio(task_t *task); +int task_nice(task_t *task); +int task_sched_policy(task_t *task); +void set_task_sched_policy(task_t *task, int policy); +int rt_task(task_t *task); +int task_curr(task_t *task); +int idle_cpu(int cpu); void yield(void); /* @@ -844,6 +872,21 @@ static inline int need_resched(void) return unlikely(test_thread_flag(TIF_NEED_RESCHED)); } +static inline void set_task_queued(task_t *task) +{ + set_tsk_thread_flag(task, TIF_QUEUED); +} + +static inline void clear_task_queued(task_t *task) +{ + clear_tsk_thread_flag(task, TIF_QUEUED); +} + +static inline int task_queued(task_t *task) +{ + return test_tsk_thread_flag(task, TIF_QUEUED); +} + extern void __cond_resched(void); static inline void cond_resched(void) { diff -prauN linux-2.6.0-test11/kernel/Makefile sched-2.6.0-test11-5/kernel/Makefile --- linux-2.6.0-test11/kernel/Makefile 2003-11-26 12:43:24.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/Makefile 2003-12-17 03:30:08.000000000 -0800 @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o intermodule.o extable.o params.o posix-timers.o + rcupdate.o intermodule.o extable.o params.o posix-timers.o sched/ obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o diff -prauN linux-2.6.0-test11/kernel/exit.c sched-2.6.0-test11-5/kernel/exit.c --- linux-2.6.0-test11/kernel/exit.c 2003-11-26 12:45:29.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/exit.c 2003-12-17 07:04:02.000000000 -0800 @@ -225,7 +225,7 @@ void reparent_to_init(void) /* Set the exit signal to SIGCHLD so we signal init on exit */ current->exit_signal = SIGCHLD; - if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0)) + if (task_nice(current) < 0) set_user_nice(current, 0); /* cpus_allowed? */ /* rt_priority? */ diff -prauN linux-2.6.0-test11/kernel/fork.c sched-2.6.0-test11-5/kernel/fork.c --- linux-2.6.0-test11/kernel/fork.c 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/fork.c 2003-12-23 06:22:59.000000000 -0800 @@ -836,6 +836,9 @@ struct task_struct *copy_process(unsigne atomic_inc(&p->user->__count); atomic_inc(&p->user->processes); + clear_tsk_thread_flag(p, TIF_SIGPENDING); + clear_tsk_thread_flag(p, TIF_QUEUED); + /* * If multiple threads are within copy_process(), then this check * triggers too late. This doesn't hurt, the check is only there @@ -861,13 +864,21 @@ struct task_struct *copy_process(unsigne p->state = TASK_UNINTERRUPTIBLE; copy_flags(clone_flags, p); - if (clone_flags & CLONE_IDLETASK) + if (clone_flags & CLONE_IDLETASK) { p->pid = 0; - else { + set_task_sched_policy(p, SCHED_IDLE); + } else { + if (task_sched_policy(p) == SCHED_IDLE) { + memset(&p->sched_info, 0, sizeof(struct sched_info)); + set_task_sched_policy(p, SCHED_NORMAL); + set_user_nice(p, 0); + } p->pid = alloc_pidmap(); if (p->pid == -1) goto bad_fork_cleanup; } + if (p->pid == 1) + BUG_ON(task_nice(p)); retval = -EFAULT; if (clone_flags & CLONE_PARENT_SETTID) if (put_user(p->pid, parent_tidptr)) @@ -875,8 +886,7 @@ struct task_struct *copy_process(unsigne p->proc_dentry = NULL; - INIT_LIST_HEAD(&p->run_list); - + INIT_LIST_HEAD(&p->sched_info.run_list); INIT_LIST_HEAD(&p->children); INIT_LIST_HEAD(&p->sibling); INIT_LIST_HEAD(&p->posix_timers); @@ -885,8 +895,6 @@ struct task_struct *copy_process(unsigne spin_lock_init(&p->alloc_lock); spin_lock_init(&p->switch_lock); spin_lock_init(&p->proc_lock); - - clear_tsk_thread_flag(p, TIF_SIGPENDING); init_sigpending(&p->pending); p->it_real_value = p->it_virt_value = p->it_prof_value = 0; @@ -898,7 +906,6 @@ struct task_struct *copy_process(unsigne p->tty_old_pgrp = 0; p->utime = p->stime = 0; p->cutime = p->cstime = 0; - p->array = NULL; p->lock_depth = -1; /* -1 = no lock */ p->start_time = get_jiffies_64(); p->security = NULL; @@ -948,33 +955,6 @@ struct task_struct *copy_process(unsigne p->pdeath_signal = 0; /* - * Share the timeslice between parent and child, thus the - * total amount of pending timeslices in the system doesn't change, - * resulting in more scheduling fairness. - */ - local_irq_disable(); - p->time_slice = (current->time_slice + 1) >> 1; - /* - * The remainder of the first timeslice might be recovered by - * the parent if the child exits early enough. - */ - p->first_time_slice = 1; - current->time_slice >>= 1; - p->timestamp = sched_clock(); - if (!current->time_slice) { - /* - * This case is rare, it happens when the parent has only - * a single jiffy left from its timeslice. Taking the - * runqueue lock is not a problem. - */ - current->time_slice = 1; - preempt_disable(); - scheduler_tick(0, 0); - local_irq_enable(); - preempt_enable(); - } else - local_irq_enable(); - /* * Ok, add it to the run-queues and make it * visible to the rest of the system. * diff -prauN linux-2.6.0-test11/kernel/sched/Makefile sched-2.6.0-test11-5/kernel/sched/Makefile --- linux-2.6.0-test11/kernel/sched/Makefile 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/Makefile 2003-12-17 03:32:21.000000000 -0800 @@ -0,0 +1 @@ +obj-y = util.o ts.o idle.o rt.o batch.o diff -prauN linux-2.6.0-test11/kernel/sched/batch.c sched-2.6.0-test11-5/kernel/sched/batch.c --- linux-2.6.0-test11/kernel/sched/batch.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/batch.c 2003-12-19 21:32:49.000000000 -0800 @@ -0,0 +1,190 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +struct batch_queue { + int base, tasks; + task_t *curr; + unsigned long bitmap[BITS_TO_LONGS(MAX_BATCH_PRIO)]; + struct list_head queue[MAX_BATCH_PRIO]; +}; + +static int batch_quantum = 1024; +static DEFINE_PER_CPU(struct batch_queue, batch_queues); + +static int batch_init(struct policy *policy, int cpu) +{ + int k; + struct batch_queue *queue = &per_cpu(batch_queues, cpu); + + policy->queue = (struct queue *)queue; + for (k = 0; k < MAX_BATCH_PRIO; ++k) + INIT_LIST_HEAD(&queue->queue[k]); + return 0; +} + +static int batch_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + + cpustat->nice += user_ticks; + cpustat->system += sys_ticks; + + task->sched_info.cl_data.bt.ticks--; + if (!task->sched_info.cl_data.bt.ticks) { + int new_idx; + + task->sched_info.cl_data.bt.ticks = batch_quantum; + new_idx = (task->sched_info.idx + task->sched_info.cl_data.bt.prio) + % MAX_BATCH_PRIO; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); + list_move_tail(&task->sched_info.run_list, + &queue->queue[new_idx]); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + task->sched_info.idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + set_need_resched(); + } + return 0; +} + +static void batch_yield(struct queue *__queue, task_t *task) +{ + int new_idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + new_idx = (queue->base + MAX_BATCH_PRIO - 1) % MAX_BATCH_PRIO; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); + list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + task->sched_info.idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + set_need_resched(); +} + +static task_t *batch_curr(struct queue *__queue) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + return queue->curr; +} + +static void batch_set_curr(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + queue->curr = task; +} + +static task_t *batch_best(struct queue *__queue) +{ + int idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + idx = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + BUG_ON(idx >= MAX_BATCH_PRIO); + BUG_ON(list_empty(&queue->queue[idx])); + return list_entry(queue->queue[idx].next, task_t, sched_info.run_list); +} + +static void batch_enqueue(struct queue *__queue, task_t *task) +{ + int idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + idx = (queue->base + task->sched_info.cl_data.bt.prio) % MAX_BATCH_PRIO; + if (!test_bit(idx, queue->bitmap)) + __set_bit(idx, queue->bitmap); + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); + task->sched_info.idx = idx; + task->sched_info.cl_data.bt.ticks = batch_quantum; + queue->tasks++; + if (!queue->curr) + queue->curr = task; +} + +static void batch_dequeue(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + queue->tasks--; + if (!queue->tasks) + queue->curr = NULL; + else if (task == queue->curr) + queue->curr = batch_best(__queue); +} + +static int batch_preempt(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + if (!queue->curr) + return 1; + else + return task->sched_info.cl_data.bt.prio + < queue->curr->sched_info.cl_data.bt.prio; +} + +static int batch_tasks(struct queue *__queue) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + return queue->tasks; +} + +static int batch_nice(struct queue *queue, task_t *task) +{ + return 20; +} + +static int batch_prio(task_t *task) +{ + return USER_PRIO(task->sched_info.cl_data.bt.prio + MIN_BATCH_PRIO); +} + +static void batch_setprio(task_t *task, int prio) +{ + BUG_ON(prio < 0); + BUG_ON(prio >= MAX_BATCH_PRIO); + task->sched_info.cl_data.bt.prio = prio; +} + +struct queue_ops batch_ops = { + .init = batch_init, + .fini = nop_fini, + .tick = batch_tick, + .yield = batch_yield, + .curr = batch_curr, + .set_curr = batch_set_curr, + .tasks = batch_tasks, + .best = batch_best, + .enqueue = batch_enqueue, + .dequeue = batch_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = batch_preempt, + .nice = batch_nice, + .renice = nop_renice, + .prio = batch_prio, + .setprio = batch_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy batch_policy = { + .ops = &batch_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/idle.c sched-2.6.0-test11-5/kernel/sched/idle.c --- linux-2.6.0-test11/kernel/sched/idle.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/idle.c 2003-12-19 17:31:39.000000000 -0800 @@ -0,0 +1,99 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +static DEFINE_PER_CPU(task_t *, idle_tasks) = NULL; + +static int idle_nice(struct queue *queue, task_t *task) +{ + return 20; +} + +static int idle_tasks(struct queue *queue) +{ + task_t **idle = (task_t **)queue; + return !!(*idle); +} + +static task_t *idle_task(struct queue *queue) +{ + return *((task_t **)queue); +} + +static void idle_yield(struct queue *queue, task_t *task) +{ + set_need_resched(); +} + +static void idle_enqueue(struct queue *queue, task_t *task) +{ + task_t **idle = (task_t **)queue; + *idle = task; +} + +static void idle_dequeue(struct queue *queue, task_t *task) +{ +} + +static int idle_preempt(struct queue *queue, task_t *task) +{ + return 0; +} + +static int idle_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + runqueue_t *rq = &per_cpu(runqueues, smp_processor_id()); + + if (atomic_read(&rq->nr_iowait) > 0) + cpustat->iowait += sys_ticks; + else + cpustat->idle += sys_ticks; + return 1; +} + +static int idle_init(struct policy *policy, int cpu) +{ + policy->queue = (struct queue *)&per_cpu(idle_tasks, cpu); + return 0; +} + +static int idle_prio(task_t *task) +{ + return MAX_USER_PRIO; +} + +static void idle_setprio(task_t *task, int prio) +{ +} + +static struct queue_ops idle_ops = { + .init = idle_init, + .fini = nop_fini, + .tick = idle_tick, + .yield = idle_yield, + .curr = idle_task, + .set_curr = queue_nop, + .tasks = idle_tasks, + .best = idle_task, + .enqueue = idle_enqueue, + .dequeue = idle_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = idle_preempt, + .nice = idle_nice, + .renice = nop_renice, + .prio = idle_prio, + .setprio = idle_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy idle_policy = { + .ops = &idle_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/queue.h sched-2.6.0-test11-5/kernel/sched/queue.h --- linux-2.6.0-test11/kernel/sched/queue.h 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/queue.h 2003-12-23 03:58:02.000000000 -0800 @@ -0,0 +1,104 @@ +#define SCHED_POLICY_RT 0 +#define SCHED_POLICY_TS 1 +#define SCHED_POLICY_BATCH 2 +#define SCHED_POLICY_IDLE 3 + +#define RT_POLICY_FIFO 0 +#define RT_POLICY_RR 1 + +#define NODE_THRESHOLD 125 + +struct queue; +struct queue_ops; + +struct policy { + struct queue *queue; + struct queue_ops *ops; +}; + +extern struct policy rt_policy, ts_policy, batch_policy, idle_policy; + +struct runqueue { + spinlock_t lock; + int curr; + task_t *__curr; + unsigned long policy_bitmap; + struct policy *policies[BITS_PER_LONG]; + unsigned long nr_running, nr_switches, nr_uninterruptible; + struct mm_struct *prev_mm; + int prev_cpu_load[NR_CPUS]; +#ifdef CONFIG_NUMA + atomic_t *node_nr_running; + int prev_node_load[MAX_NUMNODES]; +#endif + task_t *migration_thread; + struct list_head migration_queue; + + atomic_t nr_iowait; +}; + +typedef struct runqueue runqueue_t; + +struct queue_ops { + int (*init)(struct policy *, int); + void (*fini)(struct policy *, int); + task_t *(*curr)(struct queue *); + void (*set_curr)(struct queue *, task_t *); + task_t *(*best)(struct queue *); + int (*tick)(struct queue *, task_t *, int, int); + int (*tasks)(struct queue *); + void (*enqueue)(struct queue *, task_t *); + void (*dequeue)(struct queue *, task_t *); + void (*start_wait)(struct queue *, task_t *); + void (*stop_wait)(struct queue *, task_t *); + void (*sleep)(struct queue *, task_t *); + void (*wake)(struct queue *, task_t *); + int (*preempt)(struct queue *, task_t *); + void (*yield)(struct queue *, task_t *); + int (*prio)(task_t *); + void (*setprio)(task_t *, int); + int (*nice)(struct queue *, task_t *); + void (*renice)(struct queue *, task_t *, int); + unsigned long (*timeslice)(struct queue *, task_t *); + void (*set_timeslice)(struct queue *, task_t *, unsigned long); +}; + +DECLARE_PER_CPU(runqueue_t, runqueues); + +int find_first_circular_bit(unsigned long *, int, int); +void queue_nop(struct queue *, task_t *); +void nop_renice(struct queue *, task_t *, int); +void nop_fini(struct policy *, int); +unsigned long nop_timeslice(struct queue *, task_t *); +void nop_set_timeslice(struct queue *, task_t *, unsigned long); + +/* #define DEBUG_SCHED */ + +#ifdef DEBUG_SCHED +#define __check_task_policy(idx) \ +do { \ + unsigned long __idx__ = (idx); \ + if (__idx__ > SCHED_POLICY_IDLE) { \ + printk("invalid policy 0x%lx\n", __idx__); \ + BUG(); \ + } \ +} while (0) + +#define check_task_policy(task) \ +do { \ + __check_task_policy((task)->sched_info.policy); \ +} while (0) + +#define check_policy(policy) \ +do { \ + BUG_ON((policy) != &rt_policy && \ + (policy) != &ts_policy && \ + (policy) != &batch_policy && \ + (policy) != &idle_policy); \ +} while (0) + +#else /* !DEBUG_SCHED */ +#define __check_task_policy(idx) do { } while (0) +#define check_task_policy(task) do { } while (0) +#define check_policy(policy) do { } while (0) +#endif /* !DEBUG_SCHED */ diff -prauN linux-2.6.0-test11/kernel/sched/rt.c sched-2.6.0-test11-5/kernel/sched/rt.c --- linux-2.6.0-test11/kernel/sched/rt.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/rt.c 2003-12-19 18:16:07.000000000 -0800 @@ -0,0 +1,208 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +#ifdef DEBUG_SCHED +#define check_rt_policy(task) \ +do { \ + BUG_ON((task)->sched_info.policy != SCHED_POLICY_RT); \ + BUG_ON((task)->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR \ + && \ + (task)->sched_info.cl_data.rt.rt_policy!=RT_POLICY_FIFO); \ + BUG_ON((task)->sched_info.cl_data.rt.prio < 0); \ + BUG_ON((task)->sched_info.cl_data.rt.prio >= MAX_RT_PRIO); \ +} while (0) +#else +#define check_rt_policy(task) do { } while (0) +#endif + +struct rt_queue { + unsigned long bitmap[BITS_TO_LONGS(MAX_RT_PRIO)]; + struct list_head queue[MAX_RT_PRIO]; + task_t *curr; + int tasks; +}; + +static DEFINE_PER_CPU(struct rt_queue, rt_queues); + +static int rt_init(struct policy *policy, int cpu) +{ + int k; + struct rt_queue *queue = &per_cpu(rt_queues, cpu); + + policy->queue = (struct queue *)queue; + for (k = 0; k < MAX_RT_PRIO; ++k) + INIT_LIST_HEAD(&queue->queue[k]); + return 0; +} + +static void rt_yield(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio])) + set_need_resched(); + list_add_tail(&task->sched_info.run_list, + &queue->queue[task->sched_info.cl_data.rt.prio]); + check_rt_policy(task); +} + +static int rt_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + check_rt_policy(task); + cpustat->user += user_ticks; + cpustat->system += sys_ticks; + if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) { + task->sched_info.cl_data.rt.ticks--; + if (!task->sched_info.cl_data.rt.ticks) { + task->sched_info.cl_data.rt.ticks = + task->sched_info.cl_data.rt.quantum; + rt_yield(queue, task); + } + } + check_rt_policy(task); + return 0; +} + +static task_t *rt_curr(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + task_t *task = queue->curr; + check_rt_policy(task); + return task; +} + +static void rt_set_curr(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + queue->curr = task; + check_rt_policy(task); +} + +static task_t *rt_best(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + task_t *task; + int idx; + idx = find_first_bit(queue->bitmap, MAX_RT_PRIO); + BUG_ON(idx >= MAX_RT_PRIO); + task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list); + check_rt_policy(task); + return task; +} + +static void rt_enqueue(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + if (!test_bit(task->sched_info.cl_data.rt.prio, queue->bitmap)) + __set_bit(task->sched_info.cl_data.rt.prio, queue->bitmap); + list_add_tail(&task->sched_info.run_list, + &queue->queue[task->sched_info.cl_data.rt.prio]); + check_rt_policy(task); + queue->tasks++; + if (!queue->curr) + queue->curr = task; +} + +static void rt_dequeue(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio])) + __clear_bit(task->sched_info.cl_data.rt.prio, queue->bitmap); + queue->tasks--; + check_rt_policy(task); + if (!queue->tasks) + queue->curr = NULL; + else if (task == queue->curr) + queue->curr = rt_best(__queue); +} + +static int rt_preempt(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + if (!queue->curr) + return 1; + check_rt_policy(queue->curr); + return task->sched_info.cl_data.rt.prio + < queue->curr->sched_info.cl_data.rt.prio; +} + +static int rt_tasks(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + return queue->tasks; +} + +static int rt_nice(struct queue *queue, task_t *task) +{ + check_rt_policy(task); + return -20; +} + +static unsigned long rt_timeslice(struct queue *queue, task_t *task) +{ + check_rt_policy(task); + if (task->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR) + return 0; + else + return task->sched_info.cl_data.rt.quantum; +} + +static void rt_set_timeslice(struct queue *queue, task_t *task, unsigned long n) +{ + check_rt_policy(task); + if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) + task->sched_info.cl_data.rt.quantum = n; + check_rt_policy(task); +} + +static void rt_setprio(task_t *task, int prio) +{ + check_rt_policy(task); + BUG_ON(prio < 0); + BUG_ON(prio >= MAX_RT_PRIO); + task->sched_info.cl_data.rt.prio = prio; +} + +static int rt_prio(task_t *task) +{ + check_rt_policy(task); + return USER_PRIO(task->sched_info.cl_data.rt.prio); +} + +static struct queue_ops rt_ops = { + .init = rt_init, + .fini = nop_fini, + .tick = rt_tick, + .yield = rt_yield, + .curr = rt_curr, + .set_curr = rt_set_curr, + .tasks = rt_tasks, + .best = rt_best, + .enqueue = rt_enqueue, + .dequeue = rt_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = rt_preempt, + .nice = rt_nice, + .renice = nop_renice, + .prio = rt_prio, + .setprio = rt_setprio, + .timeslice = rt_timeslice, + .set_timeslice = rt_set_timeslice, +}; + +struct policy rt_policy = { + .ops = &rt_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/ts.c sched-2.6.0-test11-5/kernel/sched/ts.c --- linux-2.6.0-test11/kernel/sched/ts.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/ts.c 2003-12-23 08:24:55.000000000 -0800 @@ -0,0 +1,841 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +#ifdef DEBUG_SCHED +#define check_ts_policy(task) \ +do { \ + BUG_ON((task)->sched_info.policy != SCHED_POLICY_TS); \ +} while (0) + +#define check_nice(__queue__) \ +({ \ + int __k__, __count__ = 0; \ + if ((__queue__)->tasks < 0) { \ + printk("negative nice task count %d\n", \ + (__queue__)->tasks); \ + BUG(); \ + } \ + for (__k__ = 0; __k__ < NICE_QLEN; ++__k__) { \ + task_t *__task__; \ + if (list_empty(&(__queue__)->queue[__k__])) { \ + if (test_bit(__k__, (__queue__)->bitmap)) { \ + printk("wrong nice bit set\n"); \ + BUG(); \ + } \ + } else { \ + if (!test_bit(__k__, (__queue__)->bitmap)) { \ + printk("wrong nice bit clear\n"); \ + BUG(); \ + } \ + } \ + list_for_each_entry(__task__, \ + &(__queue__)->queue[__k__], \ + sched_info.run_list) { \ + check_ts_policy(__task__); \ + if (__task__->sched_info.idx != __k__) { \ + printk("nice index mismatch\n"); \ + BUG(); \ + } \ + ++__count__; \ + } \ + } \ + if ((__queue__)->tasks != __count__) { \ + printk("wrong nice task count\n"); \ + printk("expected %d, got %d\n", \ + (__queue__)->tasks, \ + __count__); \ + BUG(); \ + } \ + __count__; \ +}) + +#define check_queue(__queue) \ +do { \ + int __k, __count = 0; \ + if ((__queue)->tasks < 0) { \ + printk("negative queue task count %d\n", \ + (__queue)->tasks); \ + BUG(); \ + } \ + for (__k = 0; __k < 40; ++__k) { \ + struct nice_queue *__nice; \ + if (list_empty(&(__queue)->nices[__k])) { \ + if (test_bit(__k, (__queue)->bitmap)) { \ + printk("wrong queue bit set\n"); \ + BUG(); \ + } \ + } else { \ + if (!test_bit(__k, (__queue)->bitmap)) { \ + printk("wrong queue bit clear\n"); \ + BUG(); \ + } \ + } \ + list_for_each_entry(__nice, \ + &(__queue)->nices[__k], \ + list) { \ + __count += check_nice(__nice); \ + if (__nice->idx != __k) { \ + printk("queue index mismatch\n"); \ + BUG(); \ + } \ + } \ + } \ + if ((__queue)->tasks != __count) { \ + printk("wrong queue task count\n"); \ + printk("expected %d, got %d\n", \ + (__queue)->tasks, \ + __count); \ + BUG(); \ + } \ +} while (0) + +#else /* !DEBUG_SCHED */ +#define check_ts_policy(task) do { } while (0) +#define check_nice(nice) do { } while (0) +#define check_queue(queue) do { } while (0) +#endif + +/* + * Hybrid deadline/multilevel scheduling. Cpu utilization + * -dependent deadlines at wake. Queue rotation every 50ms or when + * demotions empty the highest level, setting demoted deadlines + * relative to the new highest level. Intra-level RR quantum at 10ms. + */ +struct nice_queue { + int idx, nice, base, tasks, level_quantum, expired; + unsigned long bitmap[BITS_TO_LONGS(NICE_QLEN)]; + struct list_head list, queue[NICE_QLEN]; + task_t *curr; +}; + +/* + * Deadline schedule nice levels with priority-dependent deadlines, + * default quantum of 100ms. Queue rotates at demotions emptying the + * highest level, setting the demoted deadline relative to the new + * highest level. + */ +struct ts_queue { + struct nice_queue nice_levels[40]; + struct list_head nices[40]; + int base, quantum, tasks; + unsigned long bitmap[BITS_TO_LONGS(40)]; + struct nice_queue *curr; +}; + +/* + * Make these sysctl-tunable. + */ +static int nice_quantum = 100; +static int rr_quantum = 10; +static int level_quantum = 50; +static int sample_interval = HZ; + +static DEFINE_PER_CPU(struct ts_queue, ts_queues); + +static task_t *nice_best(struct nice_queue *); +static struct nice_queue *ts_best_nice(struct ts_queue *); + +static void nice_init(struct nice_queue *queue) +{ + int k; + + INIT_LIST_HEAD(&queue->list); + for (k = 0; k < NICE_QLEN; ++k) { + INIT_LIST_HEAD(&queue->queue[k]); + } +} + +static int ts_init(struct policy *policy, int cpu) +{ + int k; + struct ts_queue *queue = &per_cpu(ts_queues, cpu); + + policy->queue = (struct queue *)queue; + queue->quantum = nice_quantum; + + for (k = 0; k < 40; ++k) { + nice_init(&queue->nice_levels[k]); + queue->nice_levels[k].nice = k; + INIT_LIST_HEAD(&queue->nices[k]); + } + return 0; +} + +static int task_deadline(task_t *task) +{ + u64 frac_cpu = task->sched_info.cl_data.ts.frac_cpu; + frac_cpu *= (u64)NICE_QLEN; + frac_cpu >>= 32; + return (int)min((u32)(NICE_QLEN - 1), (u32)frac_cpu); +} + +static void nice_rotate_queue(struct nice_queue *queue) +{ + int idx, new_idx, deadline, idxdiff; + task_t *task = queue->curr; + + check_nice(queue); + + /* shit what if idxdiff == NICE_QLEN - 1?? */ + idx = queue->curr->sched_info.idx; + idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN; + deadline = min(1 + task_deadline(task), NICE_QLEN - idxdiff - 1); + new_idx = (idx + deadline) % NICE_QLEN; +#if 0 + if (idx == new_idx) { + /* + * buggy; it sets queue->base = idx because in this case + * we have task_deadline(task) == 0 + */ + new_idx = (idx - task_deadline(task) + NICE_QLEN) % NICE_QLEN; + if (queue->base != new_idx) + queue->base = new_idx; + return; + } + BUG_ON(!deadline); + BUG_ON(queue->base <= new_idx && new_idx <= idx); + BUG_ON(idx < queue->base && queue->base <= new_idx); + BUG_ON(new_idx <= idx && idx < queue->base); + if (0 && idx == new_idx) { + printk("FUCKUP: pid = %d, tdl = %d, dl = %d, idx = %d, " + "base = %d, diff = %d, fcpu = 0x%lx\n", + queue->curr->pid, + task_deadline(queue->curr), + deadline, + idx, + queue->base, + idxdiff, + task->sched_info.cl_data.ts.frac_cpu); + BUG(); + } +#else + /* + * RR in the last deadline + * special-cased so as not to trip BUG_ON()'s below + */ + if (idx == new_idx) { + /* if we got here these two things must hold */ + BUG_ON(idxdiff != NICE_QLEN - 1); + BUG_ON(deadline); + list_move_tail(&task->sched_info.run_list, &queue->queue[idx]); + if (queue->expired) { + queue->level_quantum = level_quantum; + queue->expired = 0; + } + return; + } +#endif + task->sched_info.idx = new_idx; + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&task->sched_info.run_list, + &queue->queue[new_idx]); + + /* expired until list drains */ + if (!list_empty(&queue->queue[idx])) + queue->expired = 1; + else { + int k, w, m = NICE_QLEN % BITS_PER_LONG; + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + + for (w = 0, k = 0; k < NICE_QLEN/BITS_PER_LONG; ++k) + w += hweight_long(queue->bitmap[k]); + if (NICE_QLEN % BITS_PER_LONG) + w += hweight_long(queue->bitmap[k] & ((1UL << m) - 1)); + if (w > 1) + queue->base = (queue->base + 1) % NICE_QLEN; + queue->level_quantum = level_quantum; + queue->expired = 0; + } + check_nice(queue); +} + +static void nice_tick(struct nice_queue *queue, task_t *task) +{ + int idx = task->sched_info.idx; + BUG_ON(!task_queued(task)); + BUG_ON(task != queue->curr); + BUG_ON(!test_bit(idx, queue->bitmap)); + BUG_ON(list_empty(&queue->queue[idx])); + check_ts_policy(task); + check_nice(queue); + + if (task->sched_info.cl_data.ts.ticks) + task->sched_info.cl_data.ts.ticks--; + + if (queue->level_quantum > level_quantum) { + WARN_ON(1); + queue->level_quantum = 1; + } + + if (!queue->expired) { + if (queue->level_quantum) + queue->level_quantum--; + } else if (0 && queue->queue[idx].prev != &task->sched_info.run_list) { + int queued = 0, new_idx = (queue->base + 1) % NICE_QLEN; + task_t *curr, *sav; + task_t *victim = list_entry(queue->queue[idx].prev, + task_t, + sched_info.run_list); + victim->sched_info.idx = new_idx; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); +#if 1 + list_for_each_entry_safe(curr, sav, &queue->queue[new_idx], sched_info.run_list) { + if (victim->sched_info.cl_data.ts.frac_cpu + < curr->sched_info.cl_data.ts.frac_cpu) { + queued = 1; + list_move(&victim->sched_info.run_list, + curr->sched_info.run_list.prev); + break; + } + } + if (!queued) + list_move_tail(&victim->sched_info.run_list, + &queue->queue[new_idx]); +#else + list_move(&victim->sched_info.run_list, &queue->queue[new_idx]); +#endif + BUG_ON(list_empty(&queue->queue[idx])); + } + + if (!queue->level_quantum && !queue->expired) { + check_nice(queue); + nice_rotate_queue(queue); + check_nice(queue); + set_need_resched(); + } else if (!task->sched_info.cl_data.ts.ticks) { + int idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN; + check_nice(queue); + task->sched_info.cl_data.ts.ticks = rr_quantum; + BUG_ON(!test_bit(idx, queue->bitmap)); + BUG_ON(list_empty(&queue->queue[idx])); + if (queue->expired) + nice_rotate_queue(queue); + else if (idxdiff == NICE_QLEN - 1) + list_move_tail(&task->sched_info.run_list, + &queue->queue[idx]); + else { + int new_idx = (idx + 1) % NICE_QLEN; + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + task->sched_info.idx = new_idx; + list_add(&task->sched_info.run_list, + &queue->queue[new_idx]); + } + check_nice(queue); + set_need_resched(); + } + check_nice(queue); + check_ts_policy(task); +} + +static void ts_rotate_queue(struct ts_queue *queue) +{ + int idx, new_idx, idxdiff, off, deadline; + + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + + /* shit what if idxdiff == 39?? */ + check_queue(queue); + idx = queue->curr->idx; + idxdiff = (idx - queue->base + 40) % 40; + off = (int)(queue->curr - queue->nice_levels); + deadline = min(1 + off, 40 - idxdiff - 1); + new_idx = (idx + deadline) % 40; + if (idx == new_idx) { + new_idx = (idx - off + 40) % 40; + if (queue->base != new_idx) + queue->base = new_idx; + return; + } + BUG_ON(!deadline); + BUG_ON(queue->base <= new_idx && new_idx <= idx); + BUG_ON(idx < queue->base && queue->base <= new_idx); + BUG_ON(new_idx <= idx && idx < queue->base); + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&queue->curr->list, &queue->nices[new_idx]); + queue->curr->idx = new_idx; + + if (list_empty(&queue->nices[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + queue->base = (queue->base + 1) % 40; + } + check_queue(queue); +} + +static int ts_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + int nice_idx = (int)(queue->curr - queue->nice_levels); + unsigned long sample_end, delta; + + check_queue(queue); + check_ts_policy(task); + BUG_ON(!nice); + BUG_ON(nice_idx != task->sched_info.cl_data.ts.nice); + BUG_ON(!test_bit(nice->idx, queue->bitmap)); + BUG_ON(list_empty(&queue->nices[nice->idx])); + + sample_end = jiffies; + delta = sample_end - task->sched_info.cl_data.ts.sample_start; + if (delta) + task->sched_info.cl_data.ts.sample_ticks++; + else { + task->sched_info.cl_data.ts.sample_start = jiffies; + task->sched_info.cl_data.ts.sample_ticks = 1; + } + + if (delta >= sample_interval) { + u64 frac_cpu; + frac_cpu = (u64)task->sched_info.cl_data.ts.sample_ticks << 32; + do_div(frac_cpu, delta); + frac_cpu = 2*frac_cpu + task->sched_info.cl_data.ts.frac_cpu; + do_div(frac_cpu, 3); + frac_cpu = min(frac_cpu, (1ULL << 32) - 1); + task->sched_info.cl_data.ts.frac_cpu = (unsigned long)frac_cpu; + task->sched_info.cl_data.ts.sample_start = sample_end; + task->sched_info.cl_data.ts.sample_ticks = 0; + } + + cpustat->user += user_ticks; + cpustat->system += sys_ticks; + nice_tick(nice, task); + if (queue->quantum > nice_quantum) { + queue->quantum = 0; + WARN_ON(1); + } else if (queue->quantum) + queue->quantum--; + if (!queue->quantum) { + queue->quantum = nice_quantum; + ts_rotate_queue(queue); + set_need_resched(); + } + check_queue(queue); + check_ts_policy(task); + return 0; +} + +static void nice_yield(struct nice_queue *queue, task_t *task) +{ + int idx, new_idx = (queue->base + NICE_QLEN - 1) % NICE_QLEN; + + check_nice(queue); + check_ts_policy(task); + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]); + idx = task->sched_info.idx; + task->sched_info.idx = new_idx; + set_need_resched(); + + if (list_empty(&queue->queue[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + queue->curr = nice_best(queue); +#if 0 + if (queue->curr->sched_info.idx != queue->base) + queue->base = queue->curr->sched_info.idx; +#endif + check_nice(queue); + check_ts_policy(task); +} + +/* + * This is somewhat problematic; nice_yield() only parks tasks on + * the end of their current nice levels. + */ +static void ts_yield(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + + check_queue(queue); + check_ts_policy(task); + nice_yield(nice, task); + + /* + * If there's no one to yield to, move the whole nice level. + * If this is problematic, setting nice-dependent deadlines + * on a single unified queue may be in order. + */ + if (nice->tasks == 1) { + int idx, new_idx = (queue->base + 40 - 1) % 40; + idx = nice->idx; + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&nice->list, &queue->nices[new_idx]); + if (list_empty(&queue->nices[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + nice->idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + 40); + BUG_ON(queue->base >= 40); + BUG_ON(!test_bit(queue->base, queue->bitmap)); + queue->curr = ts_best_nice(queue); + } + check_queue(queue); + check_ts_policy(task); +} + +static task_t *ts_curr(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + task_t *task = queue->curr->curr; + check_queue(queue); + if (task) + check_ts_policy(task); + return task; +} + +static void ts_set_curr(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + queue->curr = nice; + nice->curr = task; + check_queue(queue); + check_ts_policy(task); +} + +static task_t *nice_best(struct nice_queue *queue) +{ + task_t *task; + int idx = find_first_circular_bit(queue->bitmap, + queue->base, + NICE_QLEN); + check_nice(queue); + if (idx >= NICE_QLEN) + return NULL; + BUG_ON(list_empty(&queue->queue[idx])); + BUG_ON(!test_bit(idx, queue->bitmap)); + task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list); + check_nice(queue); + check_ts_policy(task); + return task; +} + +static struct nice_queue *ts_best_nice(struct ts_queue *queue) +{ + int idx = find_first_circular_bit(queue->bitmap, queue->base, 40); + check_queue(queue); + if (idx >= 40) + return NULL; + BUG_ON(list_empty(&queue->nices[idx])); + BUG_ON(!test_bit(idx, queue->bitmap)); + return list_entry(queue->nices[idx].next, struct nice_queue, list); +} + +static task_t *ts_best(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = ts_best_nice(queue); + return nice ? nice_best(nice) : NULL; +} + +static void nice_enqueue(struct nice_queue *queue, task_t *task) +{ + task_t *curr, *sav; + int queued = 0, idx, deadline, base, idxdiff; + check_nice(queue); + check_ts_policy(task); + + /* don't livelock when queue->expired */ + deadline = min(!!queue->expired + task_deadline(task), NICE_QLEN - 1); + idx = (queue->base + deadline) % NICE_QLEN; + + if (!test_bit(idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[idx])); + __set_bit(idx, queue->bitmap); + } + +#if 1 + /* keep nice level's queue sorted -- use binomial heaps here soon */ + list_for_each_entry_safe(curr, sav, &queue->queue[idx], sched_info.run_list) { + if (task->sched_info.cl_data.ts.frac_cpu + >= curr->sched_info.cl_data.ts.frac_cpu) { + list_add(&task->sched_info.run_list, + curr->sched_info.run_list.prev); + queued = 1; + break; + } + } + if (!queued) + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); +#else + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); +#endif + task->sched_info.idx = idx; + /* if (!task->sched_info.cl_data.ts.ticks) */ + task->sched_info.cl_data.ts.ticks = rr_quantum; + + if (queue->tasks) + BUG_ON(!queue->curr); + else { + BUG_ON(queue->curr); + queue->curr = task; + } + queue->tasks++; + check_nice(queue); + check_ts_policy(task); +} + +static void ts_enqueue(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + if (!nice->tasks) { + int idx = (queue->base + task->sched_info.cl_data.ts.nice) % 40; + if (!test_bit(idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[idx])); + __set_bit(idx, queue->bitmap); + } + list_add_tail(&nice->list, &queue->nices[idx]); + nice->idx = idx; + if (!queue->curr) + queue->curr = nice; + } + nice_enqueue(nice, task); + queue->tasks++; + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + check_queue(queue); + check_ts_policy(task); +} + +static void nice_dequeue(struct nice_queue *queue, task_t *task) +{ + check_nice(queue); + check_ts_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.idx])) { + BUG_ON(!test_bit(task->sched_info.idx, queue->bitmap)); + __clear_bit(task->sched_info.idx, queue->bitmap); + } + queue->tasks--; + if (task == queue->curr) { + queue->curr = nice_best(queue); +#if 0 + if (queue->curr) + queue->base = queue->curr->sched_info.idx; +#endif + } + check_nice(queue); + check_ts_policy(task); +} + +static void ts_dequeue(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + + BUG_ON(!queue->tasks); + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + + nice_dequeue(nice, task); + queue->tasks--; + if (!nice->tasks) { + list_del_init(&nice->list); + if (list_empty(&queue->nices[nice->idx])) { + BUG_ON(!test_bit(nice->idx, queue->bitmap)); + __clear_bit(nice->idx, queue->bitmap); + } + if (nice == queue->curr) + queue->curr = ts_best_nice(queue); + } + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + if (queue->base >= 40) + queue->base = 0; + check_queue(queue); + check_ts_policy(task); +} + +static int ts_tasks(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + check_queue(queue); + return queue->tasks; +} + +static int ts_nice(struct queue *__queue, task_t *task) +{ + int nice = task->sched_info.cl_data.ts.nice - 20; + check_ts_policy(task); + BUG_ON(nice < -20); + BUG_ON(nice >= 20); + return nice; +} + +static void ts_renice(struct queue *queue, task_t *task, int nice) +{ + check_queue((struct ts_queue *)queue); + check_ts_policy(task); + BUG_ON(nice < -20); + BUG_ON(nice >= 20); + task->sched_info.cl_data.ts.nice = nice + 20; + check_queue((struct ts_queue *)queue); +} + +static int nice_task_prio(struct nice_queue *nice, task_t *task) +{ + if (!task_queued(task)) + return task_deadline(task); + else { + int prio = task->sched_info.idx - nice->base; + return prio < 0 ? prio + NICE_QLEN : prio; + } +} + +static int ts_nice_prio(struct ts_queue *ts, struct nice_queue *nice) +{ + if (list_empty(&nice->list)) + return (int)(nice - ts->nice_levels); + else { + int prio = nice->idx - ts->base; + return prio < 0 ? prio + 40 : prio; + } +} + +/* 100% fake priority to report heuristics and the like */ +static int ts_prio(task_t *task) +{ + int policy_idx; + struct policy *policy; + struct ts_queue *ts; + struct nice_queue *nice; + + policy_idx = task->sched_info.policy; + policy = per_cpu(runqueues, task_cpu(task)).policies[policy_idx]; + ts = (struct ts_queue *)policy->queue; + nice = &ts->nice_levels[task->sched_info.cl_data.ts.nice]; + return 40*ts_nice_prio(ts, nice) + nice_task_prio(nice, task); +} + +static void ts_setprio(task_t *task, int prio) +{ +} + +static void ts_start_wait(struct queue *__queue, task_t *task) +{ +} + +static void ts_stop_wait(struct queue *__queue, task_t *task) +{ +} + +static void ts_sleep(struct queue *__queue, task_t *task) +{ +} + +static void ts_wake(struct queue *__queue, task_t *task) +{ +} + +static int nice_preempt(struct nice_queue *queue, task_t *task) +{ + check_nice(queue); + check_ts_policy(task); + /* assume FB style preemption at wakeup */ + if (!task_queued(task) || !queue->curr) + return 1; + else { + int delta_t, delta_q; + delta_t = (task->sched_info.idx - queue->base + NICE_QLEN) + % NICE_QLEN; + delta_q = (queue->curr->sched_info.idx - queue->base + + NICE_QLEN) + % NICE_QLEN; + if (delta_t < delta_q) + return 1; + else if (task->sched_info.cl_data.ts.frac_cpu + < queue->curr->sched_info.cl_data.ts.frac_cpu) + return 1; + else + return 0; + } + check_nice(queue); +} + +static int ts_preempt(struct queue *__queue, task_t *task) +{ + int curr_nice; + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + + check_queue(queue); + check_ts_policy(task); + if (!queue->curr) + return 1; + + curr_nice = (int)(nice - queue->nice_levels); + + /* preempt when nice number is lower, or the above for matches */ + if (task->sched_info.cl_data.ts.nice != curr_nice) + return task->sched_info.cl_data.ts.nice < curr_nice; + else + return nice_preempt(nice, task); +} + +static struct queue_ops ts_ops = { + .init = ts_init, + .fini = nop_fini, + .tick = ts_tick, + .yield = ts_yield, + .curr = ts_curr, + .set_curr = ts_set_curr, + .tasks = ts_tasks, + .best = ts_best, + .enqueue = ts_enqueue, + .dequeue = ts_dequeue, + .start_wait = ts_start_wait, + .stop_wait = ts_stop_wait, + .sleep = ts_sleep, + .wake = ts_wake, + .preempt = ts_preempt, + .nice = ts_nice, + .renice = ts_renice, + .prio = ts_prio, + .setprio = ts_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy ts_policy = { + .ops = &ts_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/util.c sched-2.6.0-test11-5/kernel/sched/util.c --- linux-2.6.0-test11/kernel/sched/util.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/util.c 2003-12-19 08:43:20.000000000 -0800 @@ -0,0 +1,37 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <asm/page.h> +#include "queue.h" + +int find_first_circular_bit(unsigned long *addr, int start, int end) +{ + int bit = find_next_bit(addr, end, start); + if (bit < end) + return bit; + bit = find_first_bit(addr, start); + if (bit < start) + return bit; + return end; +} + +void queue_nop(struct queue *queue, task_t *task) +{ +} + +void nop_renice(struct queue *queue, task_t *task, int nice) +{ +} + +void nop_fini(struct policy *policy, int cpu) +{ +} + +unsigned long nop_timeslice(struct queue *queue, task_t *task) +{ + return 0; +} + +void nop_set_timeslice(struct queue *queue, task_t *task, unsigned long n) +{ +} diff -prauN linux-2.6.0-test11/kernel/sched.c sched-2.6.0-test11-5/kernel/sched.c --- linux-2.6.0-test11/kernel/sched.c 2003-11-26 12:45:17.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched.c 2003-12-21 06:06:32.000000000 -0800 @@ -15,6 +15,8 @@ * and per-CPU runqueues. Cleanups and useful suggestions * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. + * 2003-12-17 Total rewrite and generalized scheduler policies + * by William Irwin. */ #include <linux/mm.h> @@ -38,6 +40,8 @@ #include <linux/cpu.h> #include <linux/percpu.h> +#include "sched/queue.h" + #ifdef CONFIG_NUMA #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu)) #else @@ -45,181 +49,79 @@ #endif /* - * Convert user-nice values [ -20 ... 0 ... 19 ] - * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ], - * and back. - */ -#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20) -#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20) -#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio) - -/* - * 'User priority' is the nice value converted to something we - * can work with better when scaling various scheduler parameters, - * it's a [ 0 ... 39 ] range. - */ -#define USER_PRIO(p) ((p)-MAX_RT_PRIO) -#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio) -#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) -#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\ - (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1))) - -/* - * Some helpers for converting nanosecond timing to jiffy resolution - */ -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) -#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) - -/* - * These are the 'tuning knobs' of the scheduler: - * - * Minimum timeslice is 10 msecs, default timeslice is 100 msecs, - * maximum timeslice is 200 msecs. Timeslices get refilled after - * they expire. - */ -#define MIN_TIMESLICE ( 10 * HZ / 1000) -#define MAX_TIMESLICE (200 * HZ / 1000) -#define ON_RUNQUEUE_WEIGHT 30 -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 -#define PRIO_BONUS_RATIO 25 -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) -#define INTERACTIVE_DELTA 2 -#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS) -#define STARVATION_LIMIT (MAX_SLEEP_AVG) -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) -#define NODE_THRESHOLD 125 -#define CREDIT_LIMIT 100 - -/* - * If a task is 'interactive' then we reinsert it in the active - * array after it has expired its current timeslice. (it will not - * continue to run immediately, it will still roundrobin with - * other interactive tasks.) - * - * This part scales the interactivity limit depending on niceness. - * - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. - * Here are a few examples of different nice levels: - * - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] - * - * (the X axis represents the possible -5 ... 0 ... +5 dynamic - * priority range a task can explore, a value of '1' means the - * task is rated interactive.) - * - * Ie. nice +19 tasks can never get 'interactive' enough to be - * reinserted into the active array. And only heavily CPU-hog nice -20 - * tasks will be expired. Default nice 0 tasks are somewhere between, - * it takes some effort for them to get interactive, but it's not - * too hard. - */ - -#define CURRENT_BONUS(p) \ - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ - MAX_SLEEP_AVG) - -#ifdef CONFIG_SMP -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ - num_online_cpus()) -#else -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) -#endif - -#define SCALE(v1,v1_max,v2_max) \ - (v1) * (v2_max) / (v1_max) - -#define DELTA(p) \ - (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \ - INTERACTIVE_DELTA) - -#define TASK_INTERACTIVE(p) \ - ((p)->prio <= (p)->static_prio - DELTA(p)) - -#define JUST_INTERACTIVE_SLEEP(p) \ - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) - -#define HIGH_CREDIT(p) \ - ((p)->interactive_credit > CREDIT_LIMIT) - -#define LOW_CREDIT(p) \ - ((p)->interactive_credit < -CREDIT_LIMIT) - -#define TASK_PREEMPTS_CURR(p, rq) \ - ((p)->prio < (rq)->curr->prio) - -/* - * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ] - * to time slice values. - * - * The higher a thread's priority, the bigger timeslices - * it gets during one round of execution. But even the lowest - * priority thread gets MIN_TIMESLICE worth of execution time. - * - * task_timeslice() is the interface that is used by the scheduler. - */ - -#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \ - ((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1))) - -static inline unsigned int task_timeslice(task_t *p) -{ - return BASE_TIMESLICE(p); -} - -/* - * These are the runqueue data structures: - */ - -#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long)) - -typedef struct runqueue runqueue_t; - -struct prio_array { - int nr_active; - unsigned long bitmap[BITMAP_SIZE]; - struct list_head queue[MAX_PRIO]; -}; - -/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues * (such as the load balancing or the thread migration code), lock * acquire operations must be ordered by ascending &runqueue. */ -struct runqueue { - spinlock_t lock; - unsigned long nr_running, nr_switches, expired_timestamp, - nr_uninterruptible; - task_t *curr, *idle; - struct mm_struct *prev_mm; - prio_array_t *active, *expired, arrays[2]; - int prev_cpu_load[NR_CPUS]; -#ifdef CONFIG_NUMA - atomic_t *node_nr_running; - int prev_node_load[MAX_NUMNODES]; -#endif - task_t *migration_thread; - struct list_head migration_queue; +DEFINE_PER_CPU(struct runqueue, runqueues); - atomic_t nr_iowait; +struct policy *policies[] = { + &rt_policy, + &ts_policy, + &batch_policy, + &idle_policy, + NULL, }; -static DEFINE_PER_CPU(struct runqueue, runqueues); - #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() (&__get_cpu_var(runqueues)) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) +#define rq_curr(rq) (rq)->__curr +#define cpu_curr(cpu) rq_curr(cpu_rq(cpu)) + +static inline struct policy *task_policy(task_t *task) +{ + unsigned long idx; + struct policy *policy; + idx = task->sched_info.policy; + __check_task_policy(idx); + policy = task_rq(task)->policies[idx]; + check_policy(policy); + return policy; +} + +static inline struct policy *rq_policy(runqueue_t *rq) +{ + unsigned long idx; + task_t *task; + struct policy *policy; + + task = rq_curr(rq); + BUG_ON(!task); + BUG_ON((unsigned long)task < PAGE_OFFSET); + idx = task->sched_info.policy; + __check_task_policy(idx); + policy = rq->policies[idx]; + check_policy(policy); + return policy; +} + +static int __task_nice(task_t *task) +{ + struct policy *policy = task_policy(task); + return policy->ops->nice(policy->queue, task); +} + +static inline void set_rq_curr(runqueue_t *rq, task_t *task) +{ + rq->curr = task->sched_info.policy; + __check_task_policy(rq->curr); + rq->__curr = task; +} + +static inline int task_preempts_curr(task_t *task, runqueue_t *rq) +{ + check_task_policy(rq_curr(rq)); + check_task_policy(task); + if (rq_curr(rq)->sched_info.policy != task->sched_info.policy) + return task->sched_info.policy < rq_curr(rq)->sched_info.policy; + else { + struct policy *policy = rq_policy(rq); + return policy->ops->preempt(policy->queue, task); + } +} /* * Default context-switch locking: @@ -227,7 +129,7 @@ static DEFINE_PER_CPU(struct runqueue, r #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(rq, p) (rq_curr(rq) == (p)) #endif #ifdef CONFIG_NUMA @@ -320,53 +222,32 @@ static inline void rq_unlock(runqueue_t } /* - * Adding/removing a task to/from a priority array: + * Adding/removing a task to/from a policy's queue. + * We dare not BUG_ON() a wrong task_queued() as boot-time + * calls may trip it. */ -static inline void dequeue_task(struct task_struct *p, prio_array_t *array) +static inline void dequeue_task(task_t *task, runqueue_t *rq) { - array->nr_active--; - list_del(&p->run_list); - if (list_empty(array->queue + p->prio)) - __clear_bit(p->prio, array->bitmap); + struct policy *policy = task_policy(task); + BUG_ON(!task_queued(task)); + policy->ops->dequeue(policy->queue, task); + if (!policy->ops->tasks(policy->queue)) { + BUG_ON(!test_bit(task->sched_info.policy, &rq->policy_bitmap)); + __clear_bit(task->sched_info.policy, &rq->policy_bitmap); + } + clear_task_queued(task); } -static inline void enqueue_task(struct task_struct *p, prio_array_t *array) +static inline void enqueue_task(task_t *task, runqueue_t *rq) { - list_add_tail(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); - array->nr_active++; - p->array = array; -} - -/* - * effective_prio - return the priority that is based on the static - * priority but is modified by bonuses/penalties. - * - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] - * into the -5 ... 0 ... +5 bonus/penalty range. - * - * We use 25% of the full 0...39 priority range so that: - * - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. - * - * Both properties are important to certain workloads. - */ -static int effective_prio(task_t *p) -{ - int bonus, prio; - - if (rt_task(p)) - return p->prio; - - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; - - prio = p->static_prio - bonus; - if (prio < MAX_RT_PRIO) - prio = MAX_RT_PRIO; - if (prio > MAX_PRIO-1) - prio = MAX_PRIO-1; - return prio; + struct policy *policy = task_policy(task); + BUG_ON(task_queued(task)); + if (!policy->ops->tasks(policy->queue)) { + BUG_ON(test_bit(task->sched_info.policy, &rq->policy_bitmap)); + __set_bit(task->sched_info.policy, &rq->policy_bitmap); + } + policy->ops->enqueue(policy->queue, task); + set_task_queued(task); } /* @@ -374,134 +255,34 @@ static int effective_prio(task_t *p) */ static inline void __activate_task(task_t *p, runqueue_t *rq) { - enqueue_task(p, rq->active); + enqueue_task(p, rq); nr_running_inc(rq); } -static void recalc_task_prio(task_t *p, unsigned long long now) -{ - unsigned long long __sleep_time = now - p->timestamp; - unsigned long sleep_time; - - if (__sleep_time > NS_MAX_SLEEP_AVG) - sleep_time = NS_MAX_SLEEP_AVG; - else - sleep_time = (unsigned long)__sleep_time; - - if (likely(sleep_time > 0)) { - /* - * User tasks that sleep a long time are categorised as - * idle and will get just interactive status to stay active & - * prevent them suddenly becoming cpu hogs and starving - * other processes. - */ - if (p->mm && p->activated != -1 && - sleep_time > JUST_INTERACTIVE_SLEEP(p)){ - p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - - AVG_TIMESLICE); - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } else { - /* - * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. - */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; - - /* - * Tasks with low interactive_credit are limited to - * one timeslice worth of sleep avg bonus. - */ - if (LOW_CREDIT(p) && - sleep_time > JIFFIES_TO_NS(task_timeslice(p))) - sleep_time = - JIFFIES_TO_NS(task_timeslice(p)); - - /* - * Non high_credit tasks waking from uninterruptible - * sleep are limited in their sleep_avg rise as they - * are likely to be cpu hogs waiting on I/O - */ - if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){ - if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p)) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - JUST_INTERACTIVE_SLEEP(p)){ - p->sleep_avg = - JUST_INTERACTIVE_SLEEP(p); - sleep_time = 0; - } - } - - /* - * This code gives a bonus to interactive tasks. - * - * The boost works by updating the 'average sleep time' - * value here, based on ->timestamp. The more time a task - * spends sleeping, the higher the average gets - and the - * higher the priority boost gets as well. - */ - p->sleep_avg += sleep_time; - - if (p->sleep_avg > NS_MAX_SLEEP_AVG){ - p->sleep_avg = NS_MAX_SLEEP_AVG; - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } - } - } - - p->prio = effective_prio(p); -} - /* * activate_task - move a task to the runqueue and do priority recalculation * * Update all the scheduling statistics stuff. (sleep average * calculation, priority modifiers, etc.) */ -static inline void activate_task(task_t *p, runqueue_t *rq) +static inline void activate_task(task_t *task, runqueue_t *rq) { - unsigned long long now = sched_clock(); - - recalc_task_prio(p, now); - - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (!p->activated){ - /* - * Tasks which were woken up by interrupts (ie. hw events) - * are most likely of interactive nature. So we give them - * the credit of extending their sleep time to the period - * of time they spend on the runqueue, waiting for execution - * on a CPU, first time around: - */ - if (in_interrupt()) - p->activated = 2; - else - /* - * Normal first-time wakeups get a credit too for on-runqueue - * time, but it will be weighted down: - */ - p->activated = 1; - } - p->timestamp = now; - - __activate_task(p, rq); + struct policy *policy = task_policy(task); + policy->ops->wake(policy->queue, task); + __activate_task(task, rq); } /* * deactivate_task - remove a task from the runqueue. */ -static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) +static inline void deactivate_task(task_t *task, runqueue_t *rq) { + struct policy *policy = task_policy(task); nr_running_dec(rq); - if (p->state == TASK_UNINTERRUPTIBLE) + if (task->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; - dequeue_task(p, p->array); - p->array = NULL; + policy->ops->sleep(policy->queue, task); + dequeue_task(task, rq); } /* @@ -625,7 +406,7 @@ repeat_lock_task: rq = task_rq_lock(p, &flags); old_state = p->state; if (old_state & state) { - if (!p->array) { + if (!task_queued(p)) { /* * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. @@ -644,14 +425,13 @@ repeat_lock_task: * Tasks on involuntary sleep don't earn * sleep_avg beyond just interactive state. */ - p->activated = -1; } if (sync) __activate_task(p, rq); else { activate_task(p, rq); - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); + if (task_preempts_curr(p, rq)) + resched_task(rq_curr(rq)); } success = 1; } @@ -679,68 +459,26 @@ int wake_up_state(task_t *p, unsigned in * This function will do some initial scheduler statistics housekeeping * that must be done for every newly created process. */ -void wake_up_forked_process(task_t * p) +void wake_up_forked_process(task_t *task) { unsigned long flags; runqueue_t *rq = task_rq_lock(current, &flags); - p->state = TASK_RUNNING; - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. - */ - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->interactive_credit = 0; - - p->prio = effective_prio(p); - set_task_cpu(p, smp_processor_id()); - - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - nr_running_inc(rq); - } + task->state = TASK_RUNNING; + set_task_cpu(task, smp_processor_id()); + if (unlikely(!task_queued(current))) + __activate_task(task, rq); + else + activate_task(task, rq); task_rq_unlock(rq, &flags); } /* - * Potentially available exiting-child timeslices are - * retrieved here - this way the parent does not get - * penalized for creating too many threads. - * - * (this cannot be used to 'generate' timeslices - * artificially, because any timeslice recovered here - * was given away by the parent in the first place.) + * Policies that depend on trapping fork() and exit() may need to + * put a hook here. */ -void sched_exit(task_t * p) +void sched_exit(task_t *task) { - unsigned long flags; - - local_irq_save(flags); - if (p->first_time_slice) { - p->parent->time_slice += p->time_slice; - if (unlikely(p->parent->time_slice > MAX_TIMESLICE)) - p->parent->time_slice = MAX_TIMESLICE; - } - local_irq_restore(flags); - /* - * If the child was a (relative-) CPU hog then decrease - * the sleep_avg of the parent as well. - */ - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / - (EXIT_WEIGHT + 1); } /** @@ -1128,18 +866,18 @@ out: * pull_task - move a task from a remote runqueue to the local runqueue. * Both runqueues must be locked. */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +static inline void pull_task(runqueue_t *src_rq, task_t *p, runqueue_t *this_rq, int this_cpu) { - dequeue_task(p, src_array); + dequeue_task(p, src_rq); nr_running_dec(src_rq); set_task_cpu(p, this_cpu); nr_running_inc(this_rq); - enqueue_task(p, this_rq->active); + enqueue_task(p, this_rq); /* * Note that idle threads have a prio of MAX_PRIO, for this test * to be always true for them. */ - if (TASK_PREEMPTS_CURR(p, this_rq)) + if (task_preempts_curr(p, this_rq)) set_need_resched(); } @@ -1150,14 +888,14 @@ static inline void pull_task(runqueue_t * ((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \ * cache_decay_ticks)) && !task_running(rq, p) && \ * cpu_isset(this_cpu, (p)->cpus_allowed)) + * + * Since there isn't a timestamp anymore, this needs adjustment. */ static inline int can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle) { - unsigned long delta = sched_clock() - tsk->timestamp; - - if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks))) + if (!idle) return 0; if (task_running(rq, tsk)) return 0; @@ -1176,11 +914,8 @@ can_migrate_task(task_t *tsk, runqueue_t */ static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, this_cpu = smp_processor_id(); runqueue_t *busiest; - prio_array_t *array; - struct list_head *head, *curr; - task_t *tmp; busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask); if (!busiest) @@ -1192,37 +927,6 @@ static void load_balance(runqueue_t *thi */ imbalance /= 2; - /* - * We first consider expired tasks. Those will likely not be - * executed in the near future, and they are most likely to - * be cache-cold, thus switching CPUs has the least effect - * on them. - */ - if (busiest->expired->nr_active) - array = busiest->expired; - else - array = busiest->active; - -new_array: - /* Start searching at priority 0: */ - idx = 0; -skip_bitmap: - if (!idx) - idx = sched_find_first_bit(array->bitmap); - else - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); - if (idx >= MAX_PRIO) { - if (array == busiest->expired) { - array = busiest->active; - goto new_array; - } - goto out_unlock; - } - - head = array->queue + idx; - curr = head->prev; -skip_queue: - tmp = list_entry(curr, task_t, run_list); /* * We do not migrate tasks that are: @@ -1231,21 +935,19 @@ skip_queue: * 3) are cache-hot on their current CPU. */ - curr = curr->prev; + do { + struct policy *policy; + task_t *task; + + policy = rq_migrate_policy(busiest); + if (!policy) + break; + task = policy->migrate(policy->queue); + if (!task) + break; + pull_task(busiest, task, this_rq, this_cpu); + } while (!idle && --imbalance); - if (!can_migrate_task(tmp, busiest, this_cpu, idle)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } out_unlock: spin_unlock(&busiest->lock); out: @@ -1356,10 +1058,10 @@ EXPORT_PER_CPU_SYMBOL(kstat); */ void scheduler_tick(int user_ticks, int sys_ticks) { - int cpu = smp_processor_id(); + int idle, cpu = smp_processor_id(); struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + struct policy *policy; runqueue_t *rq = this_rq(); - task_t *p = current; if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_ticks); @@ -1373,98 +1075,28 @@ void scheduler_tick(int user_ticks, int sys_ticks = 0; } - if (p == rq->idle) { - if (atomic_read(&rq->nr_iowait) > 0) - cpustat->iowait += sys_ticks; - else - cpustat->idle += sys_ticks; - rebalance_tick(rq, 1); - return; - } - if (TASK_NICE(p) > 0) - cpustat->nice += user_ticks; - else - cpustat->user += user_ticks; - cpustat->system += sys_ticks; - - /* Task might have expired already, but not scheduled off yet */ - if (p->array != rq->active) { - set_tsk_need_resched(p); - goto out; - } spin_lock(&rq->lock); - /* - * The task was running during this tick - update the - * time slice counter. Note: we do not update a thread's - * priority until it either goes to sleep or uses up its - * timeslice. This makes it possible for interactive tasks - * to use up their timeslices at their highest priority levels. - */ - if (unlikely(rt_task(p))) { - /* - * RR tasks need a special form of timeslice management. - * FIFO tasks have no timeslices. - */ - if ((p->policy == SCHED_RR) && !--p->time_slice) { - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - set_tsk_need_resched(p); - - /* put it at the end of the queue: */ - dequeue_task(p, rq->active); - enqueue_task(p, rq->active); - } - goto out_unlock; - } - if (!--p->time_slice) { - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - - if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { - enqueue_task(p, rq->expired); - } else - enqueue_task(p, rq->active); - } else { - /* - * Prevent a too long timeslice allowing a task to monopolize - * the CPU. We do this by splitting up the timeslice into - * smaller pieces. - * - * Note: this does not mean the task's timeslices expire or - * get lost in any way, they just might be preempted by - * another task of equal priority. (one with higher - * priority would have preempted this task already.) We - * requeue this task to the end of the list on this priority - * level, which is in essence a round-robin of tasks with - * equal priority. - * - * This only applies to tasks in the interactive - * delta range with at least TIMESLICE_GRANULARITY to requeue. - */ - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - - p->time_slice) % TIMESLICE_GRANULARITY(p)) && - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && - (p->array == rq->active)) { - - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - enqueue_task(p, rq->active); - } - } -out_unlock: + policy = rq_policy(rq); + idle = policy->ops->tick(policy->queue, current, user_ticks, sys_ticks); spin_unlock(&rq->lock); -out: - rebalance_tick(rq, 0); + rebalance_tick(rq, idle); } void scheduling_functions_start_here(void) { } +static inline task_t *find_best_task(runqueue_t *rq) +{ + int idx; + struct policy *policy; + + BUG_ON(!rq->policy_bitmap); + idx = __ffs(rq->policy_bitmap); + __check_task_policy(idx); + policy = rq->policies[idx]; + check_policy(policy); + return policy->ops->best(policy->queue); +} + /* * schedule() is the main scheduler function. */ @@ -1472,11 +1104,7 @@ asmlinkage void schedule(void) { task_t *prev, *next; runqueue_t *rq; - prio_array_t *array; - struct list_head *queue; - unsigned long long now; - unsigned long run_time; - int idx; + struct policy *policy; /* * Test if we are atomic. Since do_exit() needs to call into @@ -1494,22 +1122,9 @@ need_resched: preempt_disable(); prev = current; rq = this_rq(); + policy = rq_policy(rq); release_kernel_lock(prev); - now = sched_clock(); - if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG)) - run_time = now - prev->timestamp; - else - run_time = NS_MAX_SLEEP_AVG; - - /* - * Tasks with interactive credits get charged less run_time - * at high sleep_avg to delay them losing their interactive - * status - */ - if (HIGH_CREDIT(prev)) - run_time /= (CURRENT_BONUS(prev) ? : 1); - spin_lock_irq(&rq->lock); /* @@ -1530,66 +1145,27 @@ need_resched: prev->nvcsw++; break; case TASK_RUNNING: + policy->ops->start_wait(policy->queue, prev); prev->nivcsw++; } + pick_next_task: - if (unlikely(!rq->nr_running)) { #ifdef CONFIG_SMP + if (unlikely(!rq->nr_running)) load_balance(rq, 1, cpu_to_node_mask(smp_processor_id())); - if (rq->nr_running) - goto pick_next_task; #endif - next = rq->idle; - rq->expired_timestamp = 0; - goto switch_tasks; - } - - array = rq->active; - if (unlikely(!array->nr_active)) { - /* - * Switch the active and expired arrays. - */ - rq->active = rq->expired; - rq->expired = array; - array = rq->active; - rq->expired_timestamp = 0; - } - - idx = sched_find_first_bit(array->bitmap); - queue = array->queue + idx; - next = list_entry(queue->next, task_t, run_list); - - if (next->activated > 0) { - unsigned long long delta = now - next->timestamp; - - if (next->activated == 1) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; - - array = next->array; - dequeue_task(next, array); - recalc_task_prio(next, next->timestamp + delta); - enqueue_task(next, array); - } - next->activated = 0; -switch_tasks: + next = find_best_task(rq); + BUG_ON(!next); prefetch(next); clear_tsk_need_resched(prev); RCU_qsctr(task_cpu(prev))++; - prev->sleep_avg -= run_time; - if ((long)prev->sleep_avg <= 0){ - prev->sleep_avg = 0; - if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev))) - prev->interactive_credit--; - } - prev->timestamp = now; - if (likely(prev != next)) { - next->timestamp = now; rq->nr_switches++; - rq->curr = next; - prepare_arch_switch(rq, next); + policy = task_policy(next); + policy->ops->set_curr(policy->queue, next); + set_rq_curr(rq, next); prev = context_switch(rq, prev, next); barrier(); @@ -1845,45 +1421,46 @@ void scheduling_functions_end_here(void) void set_user_nice(task_t *p, long nice) { unsigned long flags; - prio_array_t *array; runqueue_t *rq; - int old_prio, new_prio, delta; + struct policy *policy; + int delta, queued; - if (TASK_NICE(p) == nice || nice < -20 || nice > 19) + if (nice < -20 || nice > 19) return; /* * We have to be careful, if called from sys_setpriority(), * the task might be in the middle of scheduling on another CPU. */ rq = task_rq_lock(p, &flags); + delta = nice - __task_nice(p); + if (!delta) { + if (p->pid == 0 || p->pid == 1) + printk("no change in nice, set_user_nice() nops!\n"); + goto out_unlock; + } + + policy = task_policy(p); + /* * The RT priorities are set via setscheduler(), but we still * allow the 'normal' nice value to be set - but as expected * it wont have any effect on scheduling until the task is * not SCHED_NORMAL: */ - if (rt_task(p)) { - p->static_prio = NICE_TO_PRIO(nice); - goto out_unlock; - } - array = p->array; - if (array) - dequeue_task(p, array); - - old_prio = p->prio; - new_prio = NICE_TO_PRIO(nice); - delta = new_prio - old_prio; - p->static_prio = NICE_TO_PRIO(nice); - p->prio += delta; + queued = task_queued(p); + if (queued) + dequeue_task(p, rq); + + policy->ops->renice(policy->queue, p, nice); - if (array) { - enqueue_task(p, array); + if (queued) { + enqueue_task(p, rq); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: */ if (delta < 0 || (delta > 0 && task_running(rq, p))) - resched_task(rq->curr); + resched_task(rq_curr(rq)); } out_unlock: task_rq_unlock(rq, &flags); @@ -1919,7 +1496,7 @@ asmlinkage long sys_nice(int increment) if (increment > 40) increment = 40; - nice = PRIO_TO_NICE(current->static_prio) + increment; + nice = task_nice(current) + increment; if (nice < -20) nice = -20; if (nice > 19) @@ -1935,6 +1512,12 @@ asmlinkage long sys_nice(int increment) #endif +static int __task_prio(task_t *task) +{ + struct policy *policy = task_policy(task); + return policy->ops->prio(task); +} + /** * task_prio - return the priority value of a given task. * @p: the task in question. @@ -1943,29 +1526,111 @@ asmlinkage long sys_nice(int increment) * RT tasks are offset by -200. Normal tasks are centered * around 0, value goes from -16 to +15. */ -int task_prio(task_t *p) +int task_prio(task_t *task) { - return p->prio - MAX_RT_PRIO; + int prio; + unsigned long flags; + runqueue_t *rq; + + rq = task_rq_lock(task, &flags); + prio = __task_prio(task); + task_rq_unlock(rq, &flags); + return prio; } /** * task_nice - return the nice value of a given task. * @p: the task in question. */ -int task_nice(task_t *p) +int task_nice(task_t *task) { - return TASK_NICE(p); + int nice; + unsigned long flags; + runqueue_t *rq; + + + rq = task_rq_lock(task, &flags); + nice = __task_nice(task); + task_rq_unlock(rq, &flags); + return nice; } EXPORT_SYMBOL(task_nice); +int task_sched_policy(task_t *task) +{ + check_task_policy(task); + switch (task->sched_info.policy) { + case SCHED_POLICY_RT: + if (task->sched_info.cl_data.rt.rt_policy + == RT_POLICY_RR) + return SCHED_RR; + else + return SCHED_FIFO; + case SCHED_POLICY_TS: + return SCHED_NORMAL; + case SCHED_POLICY_BATCH: + return SCHED_BATCH; + case SCHED_POLICY_IDLE: + return SCHED_IDLE; + default: + BUG(); + return -1; + } +} +EXPORT_SYMBOL(task_sched_policy); + +void set_task_sched_policy(task_t *task, int policy) +{ + check_task_policy(task); + BUG_ON(task_queued(task)); + switch (policy) { + case SCHED_FIFO: + task->sched_info.policy = SCHED_POLICY_RT; + task->sched_info.cl_data.rt.rt_policy = RT_POLICY_FIFO; + break; + case SCHED_RR: + task->sched_info.policy = SCHED_POLICY_RT; + task->sched_info.cl_data.rt.rt_policy = RT_POLICY_RR; + break; + case SCHED_NORMAL: + task->sched_info.policy = SCHED_POLICY_TS; + break; + case SCHED_BATCH: + task->sched_info.policy = SCHED_POLICY_BATCH; + break; + case SCHED_IDLE: + task->sched_info.policy = SCHED_POLICY_IDLE; + break; + default: + BUG(); + break; + } + check_task_policy(task); +} +EXPORT_SYMBOL(set_task_sched_policy); + +int rt_task(task_t *task) +{ + check_task_policy(task); + return !!(task->sched_info.policy == SCHED_POLICY_RT); +} +EXPORT_SYMBOL(rt_task); + /** * idle_cpu - is a given cpu idle currently? * @cpu: the processor in question. */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + int idle; + unsigned long flags; + runqueue_t *rq = cpu_rq(cpu); + + spin_lock_irqsave(&rq->lock, flags); + idle = !!(rq->curr == SCHED_POLICY_IDLE); + spin_unlock_irqrestore(&rq->lock, flags); + return idle; } EXPORT_SYMBOL_GPL(idle_cpu); @@ -1985,11 +1650,10 @@ static inline task_t *find_process_by_pi static int setscheduler(pid_t pid, int policy, struct sched_param __user *param) { struct sched_param lp; - int retval = -EINVAL; - int oldprio; - prio_array_t *array; + int queued, retval = -EINVAL; unsigned long flags; runqueue_t *rq; + struct policy *rq_policy; task_t *p; if (!param || pid < 0) @@ -2017,7 +1681,7 @@ static int setscheduler(pid_t pid, int p rq = task_rq_lock(p, &flags); if (policy < 0) - policy = p->policy; + policy = task_sched_policy(p); else { retval = -EINVAL; if (policy != SCHED_FIFO && policy != SCHED_RR && @@ -2047,29 +1711,23 @@ static int setscheduler(pid_t pid, int p if (retval) goto out_unlock; - array = p->array; - if (array) + queued = task_queued(p); + if (queued) deactivate_task(p, task_rq(p)); retval = 0; - p->policy = policy; - p->rt_priority = lp.sched_priority; - oldprio = p->prio; - if (policy != SCHED_NORMAL) - p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority; - else - p->prio = p->static_prio; - if (array) { + set_task_sched_policy(p, policy); + check_task_policy(p); + rq_policy = rq->policies[p->sched_info.policy]; + check_policy(rq_policy); + rq_policy->ops->setprio(p, lp.sched_priority); + if (queued) { __activate_task(p, task_rq(p)); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (rq->curr == p) { - if (p->prio > oldprio) - resched_task(rq->curr); - } else if (p->prio < rq->curr->prio) - resched_task(rq->curr); + resched_task(rq_curr(rq)); } out_unlock: @@ -2121,7 +1779,7 @@ asmlinkage long sys_sched_getscheduler(p if (p) { retval = security_task_getscheduler(p); if (!retval) - retval = p->policy; + retval = task_sched_policy(p); } read_unlock(&tasklist_lock); @@ -2153,7 +1811,7 @@ asmlinkage long sys_sched_getparam(pid_t if (retval) goto out_unlock; - lp.sched_priority = p->rt_priority; + lp.sched_priority = task_prio(p); read_unlock(&tasklist_lock); /* @@ -2262,32 +1920,13 @@ out_unlock: */ asmlinkage long sys_sched_yield(void) { + struct policy *policy; runqueue_t *rq = this_rq_lock(); - prio_array_t *array = current->array; - - /* - * We implement yielding by moving the task into the expired - * queue. - * - * (special rule: RT tasks will just roundrobin in the active - * array.) - */ - if (likely(!rt_task(current))) { - dequeue_task(current, array); - enqueue_task(current, rq->expired); - } else { - list_del(¤t->run_list); - list_add_tail(¤t->run_list, array->queue + current->prio); - } - /* - * Since we are going to call schedule() anyway, there's - * no need to preempt: - */ + policy = rq_policy(rq); + policy->ops->yield(policy->queue, current); _raw_spin_unlock(&rq->lock); preempt_enable_no_resched(); - schedule(); - return 0; } @@ -2387,6 +2026,19 @@ asmlinkage long sys_sched_get_priority_m return ret; } +static inline unsigned long task_timeslice(task_t *task) +{ + unsigned long flags, timeslice; + struct policy *policy; + runqueue_t *rq; + + rq = task_rq_lock(task, &flags); + policy = task_policy(task); + timeslice = policy->ops->timeslice(policy->queue, task); + task_rq_unlock(rq, &flags); + return timeslice; +} + /** * sys_sched_rr_get_interval - return the default timeslice of a process. * @pid: pid of the process. @@ -2414,8 +2066,7 @@ asmlinkage long sys_sched_rr_get_interva if (retval) goto out_unlock; - jiffies_to_timespec(p->policy & SCHED_FIFO ? - 0 : task_timeslice(p), &t); + jiffies_to_timespec(task_timeslice(p), &t); read_unlock(&tasklist_lock); retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; out_nounlock: @@ -2523,17 +2174,22 @@ void show_state(void) void __init init_idle(task_t *idle, int cpu) { runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle)); + struct policy *policy; unsigned long flags; local_irq_save(flags); double_rq_lock(idle_rq, rq); - - idle_rq->curr = idle_rq->idle = idle; + policy = rq_policy(rq); + BUG_ON(policy != task_policy(idle)); + printk("deactivating, have %d tasks\n", + policy->ops->tasks(policy->queue)); deactivate_task(idle, rq); - idle->array = NULL; - idle->prio = MAX_PRIO; + set_task_sched_policy(idle, SCHED_IDLE); idle->state = TASK_RUNNING; set_task_cpu(idle, cpu); + activate_task(idle, rq); + nr_running_dec(rq); + set_rq_curr(rq, idle); double_rq_unlock(idle_rq, rq); set_tsk_need_resched(idle); local_irq_restore(flags); @@ -2804,38 +2460,27 @@ __init static void init_kstat(void) { void __init sched_init(void) { runqueue_t *rq; - int i, j, k; + int i, j; /* Init the kstat counters */ init_kstat(); for (i = 0; i < NR_CPUS; i++) { - prio_array_t *array; - rq = cpu_rq(i); - rq->active = rq->arrays; - rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); INIT_LIST_HEAD(&rq->migration_queue); atomic_set(&rq->nr_iowait, 0); nr_running_init(rq); - - for (j = 0; j < 2; j++) { - array = rq->arrays + j; - for (k = 0; k < MAX_PRIO; k++) { - INIT_LIST_HEAD(array->queue + k); - __clear_bit(k, array->bitmap); - } - // delimiter for bitsearch - __set_bit(MAX_PRIO, array->bitmap); - } + memcpy(rq->policies, policies, sizeof(policies)); + for (j = 0; j < BITS_PER_LONG && rq->policies[j]; ++j) + rq->policies[j]->ops->init(rq->policies[j], i); } /* * We have to do a little magic to get the first * thread right in SMP mode. */ rq = this_rq(); - rq->curr = current; - rq->idle = current; + set_task_sched_policy(current, SCHED_NORMAL); + set_rq_curr(rq, current); set_task_cpu(current, smp_processor_id()); wake_up_forked_process(current); diff -prauN linux-2.6.0-test11/lib/Makefile sched-2.6.0-test11-5/lib/Makefile --- linux-2.6.0-test11/lib/Makefile 2003-11-26 12:42:55.000000000 -0800 +++ sched-2.6.0-test11-5/lib/Makefile 2003-12-20 15:09:16.000000000 -0800 @@ -5,7 +5,7 @@ lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \ bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \ - kobject.o idr.o div64.o parser.o + kobject.o idr.o div64.o parser.o binomial.o lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o diff -prauN linux-2.6.0-test11/lib/binomial.c sched-2.6.0-test11-5/lib/binomial.c --- linux-2.6.0-test11/lib/binomial.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/lib/binomial.c 2003-12-20 17:32:09.000000000 -0800 @@ -0,0 +1,138 @@ +#include <linux/kernel.h> +#include <linux/binomial.h> + +struct binomial *binomial_minimum(struct binomial **heap) +{ + struct binomial *minimum, *tmp; + + for (minimum = NULL, tmp = *heap; tmp; tmp = tmp->sibling) { + if (!minimum || minimum->priority > tmp->priority) + minimum = tmp; + } + return minimum; +} + +static void binomial_link(struct binomial *left, struct binomial *right) +{ + left->parent = right; + left->sibling = right->child; + right->child = left; + right->degree++; +} + +static void binomial_merge(struct binomial **both, struct binomial **left, + struct binomial **right) +{ + while (*left && *right) { + if ((*left)->degree < (*right)->degree) { + *both = *left; + left = &(*left)->sibling; + } else { + *both = *right; + right = &(*right)->sibling; + } + both = &(*both)->sibling; + } + /* + * for more safety: + * *left = *right = NULL; + */ +} + +void binomial_union(struct binomial **both, struct binomial **left, + struct binomial **right) +{ + struct binomial *prev, *tmp, *next; + + binomial_merge(both, left, right); + if (!(tmp = *both)) + return; + + for (prev = NULL, next = tmp->sibling; next; next = tmp->sibling) { + if ((next->sibling && next->sibling->degree == tmp->degree) + || tmp->degree != next->degree) { + prev = tmp; + tmp = next; + } else if (tmp->priority <= next->priority) { + tmp->sibling = next->sibling; + binomial_link(next, tmp); + } else { + if (!prev) + *both = next; + else + prev->sibling = next; + binomial_link(tmp, next); + tmp = next; + } + } +} + +void binomial_insert(struct binomial **heap, struct binomial *element) +{ + element->parent = NULL; + element->child = NULL; + element->sibling = NULL; + element->degree = 0; + binomial_union(heap, heap, &element); +} + +static void binomial_reverse(struct binomial **in, struct binomial **out) +{ + while (*in) { + struct binomial *tmp = *in; + *in = (*in)->sibling; + tmp->sibling = *out; + *out = tmp; + } +} + +struct binomial *binomial_extract_min(struct binomial **heap) +{ + struct binomial *tmp, *minimum, *last, *min_last, *new_heap; + + minimum = last = min_last = new_heap = NULL; + for (tmp = *heap; tmp; last = tmp, tmp = tmp->sibling) { + if (!minimum || tmp->priority < minimum->priority) { + minimum = tmp; + min_last = last; + } + } + if (min_last && minimum) + min_last->sibling = minimum->sibling; + else if (minimum) + (*heap)->sibling = minimum->sibling; + else + return NULL; + binomial_reverse(&minimum->child, &new_heap); + binomial_union(heap, heap, &new_heap); + return minimum; +} + +void binomial_decrease(struct binomial **heap, struct binomial *element, + unsigned increment) +{ + struct binomial *tmp, *last = NULL; + + element->priority -= min(element->priority, increment); + last = element; + tmp = last->parent; + while (tmp && last->priority < tmp->priority) { + unsigned tmp_prio = tmp->priority; + tmp->priority = last->priority; + last->priority = tmp_prio; + last = tmp; + tmp = tmp->parent; + } +} + +void binomial_delete(struct binomial **heap, struct binomial *element) +{ + struct binomial *tmp, *last = element; + for (tmp = last->parent; tmp; last = tmp, tmp = tmp->parent) { + unsigned tmp_prio = tmp->priority; + tmp->priority = last->priority; + last->priority = tmp_prio; + } + binomial_reverse(&last->child, &tmp); + binomial_union(heap, heap, &tmp); +} diff -prauN linux-2.6.0-test11/mm/oom_kill.c sched-2.6.0-test11-5/mm/oom_kill.c --- linux-2.6.0-test11/mm/oom_kill.c 2003-11-26 12:44:16.000000000 -0800 +++ sched-2.6.0-test11-5/mm/oom_kill.c 2003-12-17 07:07:53.000000000 -0800 @@ -158,7 +158,6 @@ static void __oom_kill_task(task_t *p) * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ - p->time_slice = HZ; p->flags |= PF_MEMALLOC | PF_MEMDIE; /* This process has hardware access, be more careful. */ ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:24 ` Ingo Molnar 2007-04-17 9:57 ` William Lee Irwin III @ 2007-04-17 22:08 ` Matt Mackall 2007-04-17 22:32 ` William Lee Irwin III 1 sibling, 1 reply; 713+ messages in thread From: Matt Mackall @ 2007-04-17 22:08 UTC (permalink / raw) To: Ingo Molnar Cc: William Lee Irwin III, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > > * William Lee Irwin III <wli@holomorphy.com> wrote: > > > [...] Also rest assured that the tone of the critique is not hostile, > > and wasn't meant to sound that way. > > ok :) (And i guess i was too touchy - sorry about coming out swinging.) > > > Also, given the general comments it appears clear that some > > statistical metric of deviation from the intended behavior furthermore > > qualified by timescale is necessary, so this appears to be headed > > toward a sort of performance metric as opposed to a pass/fail test > > anyway. However, to even measure this at all, some statement of > > intention is required. I'd prefer that there be a Linux-standard > > semantics for nice so results are more directly comparable and so that > > users also get similar nice behavior from the scheduler as it varies > > over time and possibly implementations if users should care to switch > > them out with some scheduler patch or other. > > yeah. If you could come up with a sane definition that also translates > into low overhead on the algorithm side that would be great! How's this: If you're running two identical CPU hog tasks A and B differing only by nice level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a constant f(Anice - Bnice). Other definitions make things hard to analyze and probably not well-bounded when confronted with > 2 tasks. I -think- this implies keeping a separate scaled CPU usage counter, where the scaling factor is a trivial exponential function of nice level where f(0) == 1. Then you schedule based on this scaled usage counter rather than unscaled. I also suspect we want to keep the exponential base small so that the maximal difference is 10x-100x. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:08 ` Matt Mackall @ 2007-04-17 22:32 ` William Lee Irwin III 2007-04-17 22:39 ` Matt Mackall 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 22:32 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: >> yeah. If you could come up with a sane definition that also translates >> into low overhead on the algorithm side that would be great! On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote: > How's this: > If you're running two identical CPU hog tasks A and B differing only by nice > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a > constant f(Anice - Bnice). > Other definitions make things hard to analyze and probably not > well-bounded when confronted with > 2 tasks. > I -think- this implies keeping a separate scaled CPU usage counter, > where the scaling factor is a trivial exponential function of nice > level where f(0) == 1. Then you schedule based on this scaled usage > counter rather than unscaled. > I also suspect we want to keep the exponential base small so that the > maximal difference is 10x-100x. I'm already working with this as my assumed nice semantics (actually something with a specific exponential base, suggested in other emails) until others start saying they want something different and agree. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:32 ` William Lee Irwin III @ 2007-04-17 22:39 ` Matt Mackall 2007-04-17 22:59 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Matt Mackall @ 2007-04-17 22:39 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > >> yeah. If you could come up with a sane definition that also translates > >> into low overhead on the algorithm side that would be great! > > On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote: > > How's this: > > If you're running two identical CPU hog tasks A and B differing only by nice > > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a > > constant f(Anice - Bnice). > > Other definitions make things hard to analyze and probably not > > well-bounded when confronted with > 2 tasks. > > I -think- this implies keeping a separate scaled CPU usage counter, > > where the scaling factor is a trivial exponential function of nice > > level where f(0) == 1. Then you schedule based on this scaled usage > > counter rather than unscaled. > > I also suspect we want to keep the exponential base small so that the > > maximal difference is 10x-100x. > > I'm already working with this as my assumed nice semantics (actually > something with a specific exponential base, suggested in other emails) > until others start saying they want something different and agree. Good. This has a couple nice mathematical properties, including "bounded unfairness" which I mentioned earlier. What base are you looking at? -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:39 ` Matt Mackall @ 2007-04-17 22:59 ` William Lee Irwin III 2007-04-17 22:57 ` Matt Mackall 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 22:59 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: >> I'm already working with this as my assumed nice semantics (actually >> something with a specific exponential base, suggested in other emails) >> until others start saying they want something different and agree. On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote: > Good. This has a couple nice mathematical properties, including > "bounded unfairness" which I mentioned earlier. What base are you > looking at? I'm working with the following suggestion: On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > That value has the property that a nice=10 task gets 1/10th the cpu of a > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > would be fairly easy to explain to admins and users so that they can > know what to expect from nicing tasks. I'm not likely to write the testcase until this upcoming weekend, though. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:59 ` William Lee Irwin III @ 2007-04-17 22:57 ` Matt Mackall 2007-04-18 4:29 ` William Lee Irwin III 2007-04-18 7:29 ` James Bruce 0 siblings, 2 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-17 22:57 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: > >> I'm already working with this as my assumed nice semantics (actually > >> something with a specific exponential base, suggested in other emails) > >> until others start saying they want something different and agree. > > On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote: > > Good. This has a couple nice mathematical properties, including > > "bounded unfairness" which I mentioned earlier. What base are you > > looking at? > > I'm working with the following suggestion: > > > On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > > That value has the property that a nice=10 task gets 1/10th the cpu of a > > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > > would be fairly easy to explain to admins and users so that they can > > know what to expect from nicing tasks. > > I'm not likely to write the testcase until this upcoming weekend, though. So that means there's a 10000:1 ratio between nice 20 and nice -19. In that sort of dynamic range, you're likely to have non-trivial numerical accuracy issues in integer/fixed-point math. (Especially if your clock is jiffies-scale, which a significant number of machines will continue to be.) I really think if we want to have vastly different ratios, we probably want to be looking at BATCH and RT scheduling classes instead. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:57 ` Matt Mackall @ 2007-04-18 4:29 ` William Lee Irwin III 2007-04-18 4:42 ` Davide Libenzi 2007-04-18 7:29 ` James Bruce 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-18 4:29 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: >>> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 >>> That value has the property that a nice=10 task gets 1/10th the cpu of a >>> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that >>> would be fairly easy to explain to admins and users so that they can >>> know what to expect from nicing tasks. On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: >> I'm not likely to write the testcase until this upcoming weekend, though. On Tue, Apr 17, 2007 at 05:57:23PM -0500, Matt Mackall wrote: > So that means there's a 10000:1 ratio between nice 20 and nice -19. In > that sort of dynamic range, you're likely to have non-trivial > numerical accuracy issues in integer/fixed-point math. > (Especially if your clock is jiffies-scale, which a significant number > of machines will continue to be.) > I really think if we want to have vastly different ratios, we probably > want to be looking at BATCH and RT scheduling classes instead. 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and even 1000**(1/39.0) ~= 1.19378 still seems weak. I suspect that in order to get low nice numbers strong enough without making high nice numbers too strong something sub-exponential may need to be used. Maybe just picking percentages outright as opposed to some particular function. We may also be better off defining it in terms of a share weighting as opposed to two tasks in competition. In such a manner the extension to N tasks is more automatic. f(n) would be a univariate function of nice numbers and two tasks in competition with nice numbers n_1 and n_2 would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In the exponential case f(n) = K*e**(r*n) this ends up as 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for other choices it's not so. f(n) = n+K for K >= 20 results in a share weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n when n <= 0 is highly plausible. An exponent or an additive constant may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21, and the ratio of shares is 420, which is still arithmeticaly feasible. -10 vs. 0 and 0 vs. 10 are both 10:1. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 4:29 ` William Lee Irwin III @ 2007-04-18 4:42 ` Davide Libenzi 0 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-18 4:42 UTC (permalink / raw) To: William Lee Irwin III Cc: Matt Mackall, Ingo Molnar, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and > even 1000**(1/39.0) ~= 1.19378 still seems weak. > > I suspect that in order to get low nice numbers strong enough without > making high nice numbers too strong something sub-exponential may need > to be used. Maybe just picking percentages outright as opposed to some > particular function. > > We may also be better off defining it in terms of a share weighting as > opposed to two tasks in competition. In such a manner the extension to > N tasks is more automatic. f(n) would be a univariate function of nice > numbers and two tasks in competition with nice numbers n_1 and n_2 > would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In > the exponential case f(n) = K*e**(r*n) this ends up as > 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for > other choices it's not so. f(n) = n+K for K >= 20 results in a share > weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear > in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n > when n <= 0 is highly plausible. An exponent or an additive constant > may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21, > and the ratio of shares is 420, which is still arithmeticaly feasible. > -10 vs. 0 and 0 vs. 10 are both 10:1. This makes more sense, and the ratio at the extremes is something reasonable. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:57 ` Matt Mackall 2007-04-18 4:29 ` William Lee Irwin III @ 2007-04-18 7:29 ` James Bruce 1 sibling, 0 replies; 713+ messages in thread From: James Bruce @ 2007-04-18 7:29 UTC (permalink / raw) To: linux-kernel; +Cc: ck Matt Mackall wrote: > On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: >> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: >> I'm working with the following suggestion: >> >> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: >>> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 >>> That value has the property that a nice=10 task gets 1/10th the cpu of a >>> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that >>> would be fairly easy to explain to admins and users so that they can >>> know what to expect from nicing tasks. >> I'm not likely to write the testcase until this upcoming weekend, though. > > So that means there's a 10000:1 ratio between nice 20 and nice -19. In > that sort of dynamic range, you're likely to have non-trivial > numerical accuracy issues in integer/fixed-point math. Well, you *are* specifying vastly different priorities. The question is how many nice=20 tasks should it take to interfere with a nice=-19 task? If you've only got a 100:1 ratio, 100 nice=20 tasks will take ~50% of the CPU away from a nice=-19 task. I don't think that's ideal, as in my mind a -19 task shouldn't have to care how many nice=20 tasks there are (within reason). IMHO, if a user is running a CPU hog at nice=-19, and expecting nice=20 tasks to run immediately, I don't think the scheduler is the problem. > (Especially if your clock is jiffies-scale, which a significant number > of machines will continue to be.) > > I really think if we want to have vastly different ratios, we probably > want to be looking at BATCH and RT scheduling classes instead. I, like all users, can live with anything, but there should be a clear specification of what the user should expect. Magic changes in the function at nice=0, or no real clear meaning at all (mainline), are both things that don't help the users to figure that out. I like the exponential base because shifting all tasks up or down one nice level does not change the relative cpu distribution (i.e. two tasks {nice=-5,nice=0} get the same relative cpu distribution as if they were {nice=0,nice=5}. An exponential base is the only way that property can hold. Now, perhaps implementation issues may prevent something like the "1.2589" ratio rule from being realized, but I'm not sure we should throw it out _before_ we know that it's actually a problem. This is the same sort of resistance that the timekeeping code updates faces (using nanoseconds everywhere instead of "natural" clock bases), but that got addressed eventually. - Jim Bruce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:50 ` Davide Libenzi 2007-04-17 7:09 ` William Lee Irwin III @ 2007-04-17 7:11 ` Nick Piggin 2007-04-17 7:21 ` Davide Libenzi 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:11 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > On Tue, 17 Apr 2007, Nick Piggin wrote: > > > > All things are not equal; they all have different properties. I like > > > > Exactly. So we have to explore those properties and evaluate performance > > (in all meanings of the word). That's only logical. > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) > And even this code does that pretty nicely. The deadline designs looks > good, although I think the final "key" calculation code will end up quite > different from what it looks now. > I would suggest to thoroughly test all your alternatives before deciding. > Some code and design may look very good and small at the beginning, but > when you start patching it to cover all the dark spots, you effectively > end up with another thing (in both design and code footprint). > About O(1), I never thought it was a must (besides a good marketing > material), and O(log(N)) *may* be just fine (to be verified, of course). To be clear, I'm not saying O(logN) itself is a big problem. Type plot [10:100] x with lines, log(x) with lines, 1 with lines into gnuplot. I was just trying to point out that we need to evalute things. Considering how long we've had this scheduler with its known deficiencies, let's pick a new one wisely. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:11 ` Nick Piggin @ 2007-04-17 7:21 ` Davide Libenzi 0 siblings, 0 replies; 713+ messages in thread From: Davide Libenzi @ 2007-04-17 7:21 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, Nick Piggin wrote: > To be clear, I'm not saying O(logN) itself is a big problem. Type > > plot [10:100] x with lines, log(x) with lines, 1 with lines Haha, Nick, I know how a log() looks like :) The Time Ring I posted as example (that nothing is other than a ring-based bucket sort), keeps O(1) if you can concede some timer clustering. - Davide ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau 2007-04-17 6:09 ` William Lee Irwin III @ 2007-04-17 6:23 ` Peter Williams 2007-04-17 6:44 ` Nick Piggin 2007-04-17 8:44 ` Ingo Molnar 2 siblings, 2 replies; 713+ messages in thread From: Peter Williams @ 2007-04-17 6:23 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: >>>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: >>>>> Mike Galbraith wrote: >>>>>> Demystify what? The casual observer need only read either your attempt >>>>>> at writing a scheduler, or my attempts at fixing the one we have, to see >>>>>> that it was high time for someone with the necessary skills to step in. >>>>> Make that "someone with the necessary clout". >>>> No, I was brutally honest to both of us, but quite correct. >>>> >>>>>> Now progress can happen, which was _not_ happening before. >>>>>> >>>>> This is true. >>>> Yup, and progress _is_ happening now, quite rapidly. >>> Progress as in progress on Ingo's scheduler. I still don't know how we'd >>> decide when to replace the mainline scheduler or with what. >>> >>> I don't think we can say Ingo's is better than the alternatives, can we? >>> If there is some kind of bakeoff, then I'd like one of Con's designs to >>> be involved, and mine, and Peter's... >> I myself was thinking of this as the chance for a much needed >> simplification of the scheduling code and if this can be done with the >> result being "reasonable" it then gives us the basis on which to propose >> improvements based on the ideas of others such as you mention. >> >> As the size of the cpusched indicates, trying to evaluate alternative >> proposals based on the current O(1) scheduler is fraught. Hopefully, > > I don't know why. The problem is that you can't really evaluate good > proposals by looking at the code (you can say that one is bad, ie. the > current one, which has a huge amount of temporal complexity and is > explicitly unfair), but it is pretty hard to say one behaves well. I meant that it's indicative of the amount of work that you have to do to implement a new scheduling discipline for evaluation. > > And my scheduler for example cuts down the amount of policy code and > code size significantly. Yours is one of the smaller patches mainly because you perpetuate (or you did in the last one I looked at) the (horrible to my eyes) dual array (active/expired) mechanism. That this idea was bad should have been apparent to all as soon as the decision was made to excuse some tasks from being moved from the active array to the expired array. This essentially meant that there would be circumstances where extreme unfairness (to the extent of starvation in some cases) -- the very things that the mechanism was originally designed to ensure (as far as I can gather). Right about then in the development of the O(1) scheduler alternative solutions should have been sought. Other hints that it was a bad idea was the need to transfer time slices between children and parents during fork() and exit(). This disregard for the dual array mechanism has prevented me from looking at the rest of your scheduler in any great detail so I can't comment on any other ideas that may be in there. > I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... I like Con's scheduler (partly because it uses a single array) but mainly because it's nice and simple. However, his earlier schedulers were prone to starvation (admittedly, only if you went out of your way to make it happen) and I tried to convince him to use the anti starvation mechanism in my SPA schedulers but was unsuccessful. I haven't looked at his latest scheduler that sparked all this furore so can't comment on it. > > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? In the highly unlikely event that you can't separate them on technical grounds, Occam's razor recommends choosing the simplest solution. :-) To digress, my main concern is that load balancing is being lumped in with this new change. It's becoming "accept this beg lump of new code or nothing". I'd rather see a good fix to the intra runqueue/CPU scheduler problem implemented first and then if there really are any outstanding problems with the load balancer attack them later. Them all being mixed up together gives me a nasty deja vu of impending disaster. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:23 ` Peter Williams @ 2007-04-17 6:44 ` Nick Piggin 2007-04-17 7:48 ` Peter Williams 2007-04-17 8:44 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 6:44 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >And my scheduler for example cuts down the amount of policy code and > >code size significantly. > > Yours is one of the smaller patches mainly because you perpetuate (or > you did in the last one I looked at) the (horrible to my eyes) dual > array (active/expired) mechanism. Actually, I wasn't comparing with other out of tree schedulers (but it is good to know mine is among the smaller ones). I was comparing with the mainline scheduler, which also has the dual arrays. > That this idea was bad should have > been apparent to all as soon as the decision was made to excuse some > tasks from being moved from the active array to the expired array. This My patch doesn't implement any such excusing. > essentially meant that there would be circumstances where extreme > unfairness (to the extent of starvation in some cases) -- the very > things that the mechanism was originally designed to ensure (as far as I > can gather). Right about then in the development of the O(1) scheduler > alternative solutions should have been sought. Fairness has always been my first priority, and I consider it a bug if it is possible for any process to get more CPU time than a CPU hog over the long term. Or over another task doing the same thing, for that matter. > Other hints that it was a bad idea was the need to transfer time slices > between children and parents during fork() and exit(). I don't see how that has anything to do with dual arrays. If you put a new child at the back of the queue, then your various interactive shell commands that typically do a lot of dependant forking get slowed right down behind your compile job. If you give a new child its own timeslice irrespective of the parent, then you have things like 'make' (which doesn't use a lot of CPU time) spawning off lots of high priority children. You need to do _something_ (Ingo's does). I don't see why this would be tied with a dual array. FWIW, mine doesn't do anything on exit() like most others, but it may need more tuning in this area. > This disregard for the dual array mechanism has prevented me from > looking at the rest of your scheduler in any great detail so I can't > comment on any other ideas that may be in there. Well I wasn't really asking you to review it. As I said, everyone has their own idea of what a good design does, and review can't really distinguish between the better of two reasonable designs. A fair evaluation of the alternatives seems like a good idea though. Nobody is actually against this, are they? > >I haven't looked at Con's ones for a while, > >but I believe they are also much more straightforward than mainline... > > I like Con's scheduler (partly because it uses a single array) but > mainly because it's nice and simple. However, his earlier schedulers > were prone to starvation (admittedly, only if you went out of your way > to make it happen) and I tried to convince him to use the anti > starvation mechanism in my SPA schedulers but was unsuccessful. I > haven't looked at his latest scheduler that sparked all this furore so > can't comment on it. I agree starvation or unfairness is unacceptable for a new scheduler. > >For example, let's say all else is equal between them, then why would > >we go with the O(logN) implementation rather than the O(1)? > > In the highly unlikely event that you can't separate them on technical > grounds, Occam's razor recommends choosing the simplest solution. :-) O(logN) vs O(1) is technical grounds. But yeah, see my earlier comment: simplicity would be a factor too. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:44 ` Nick Piggin @ 2007-04-17 7:48 ` Peter Williams 2007-04-17 7:56 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 7:48 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> And my scheduler for example cuts down the amount of policy code and >>> code size significantly. >> Yours is one of the smaller patches mainly because you perpetuate (or >> you did in the last one I looked at) the (horrible to my eyes) dual >> array (active/expired) mechanism. > > Actually, I wasn't comparing with other out of tree schedulers (but it > is good to know mine is among the smaller ones). I was comparing with > the mainline scheduler, which also has the dual arrays. > > >> That this idea was bad should have >> been apparent to all as soon as the decision was made to excuse some >> tasks from being moved from the active array to the expired array. This > > My patch doesn't implement any such excusing. > > >> essentially meant that there would be circumstances where extreme >> unfairness (to the extent of starvation in some cases) -- the very >> things that the mechanism was originally designed to ensure (as far as I >> can gather). Right about then in the development of the O(1) scheduler >> alternative solutions should have been sought. > > Fairness has always been my first priority, and I consider it a bug > if it is possible for any process to get more CPU time than a CPU hog > over the long term. Or over another task doing the same thing, for > that matter. > > >> Other hints that it was a bad idea was the need to transfer time slices >> between children and parents during fork() and exit(). > > I don't see how that has anything to do with dual arrays. It's totally to do with the dual arrays. The only real purpose of the time slice in O(1) (regardless of what its perceived purpose was) was to control the switching between the arrays. > If you put > a new child at the back of the queue, then your various interactive > shell commands that typically do a lot of dependant forking get slowed > right down behind your compile job. If you give a new child its own > timeslice irrespective of the parent, then you have things like 'make' > (which doesn't use a lot of CPU time) spawning off lots of high > priority children. This is an artefact of trying to control nice using time slices while using them for controlling array switching and whatever else they were being used for. Priority (static and dynamic) is the the best way to implement nice. > > You need to do _something_ (Ingo's does). I don't see why this would > be tied with a dual array. FWIW, mine doesn't do anything on exit() > like most others, but it may need more tuning in this area. > > >> This disregard for the dual array mechanism has prevented me from >> looking at the rest of your scheduler in any great detail so I can't >> comment on any other ideas that may be in there. > > Well I wasn't really asking you to review it. As I said, everyone > has their own idea of what a good design does, and review can't really > distinguish between the better of two reasonable designs. > > A fair evaluation of the alternatives seems like a good idea though. > Nobody is actually against this, are they? No. It would be nice if the basic ideas that each scheduler tries to implement could be extracted and explained though. This could lead to a melding of ideas that leads to something quite good. > > >>> I haven't looked at Con's ones for a while, >>> but I believe they are also much more straightforward than mainline... >> I like Con's scheduler (partly because it uses a single array) but >> mainly because it's nice and simple. However, his earlier schedulers >> were prone to starvation (admittedly, only if you went out of your way >> to make it happen) and I tried to convince him to use the anti >> starvation mechanism in my SPA schedulers but was unsuccessful. I >> haven't looked at his latest scheduler that sparked all this furore so >> can't comment on it. > > I agree starvation or unfairness is unacceptable for a new scheduler. > > >>> For example, let's say all else is equal between them, then why would >>> we go with the O(logN) implementation rather than the O(1)? >> In the highly unlikely event that you can't separate them on technical >> grounds, Occam's razor recommends choosing the simplest solution. :-) > > O(logN) vs O(1) is technical grounds. In that case I'd go O(1) provided that the k factor for the O(1) wasn't greater than O(logN)'s k factor multiplied by logMaxN. > > But yeah, see my earlier comment: simplicity would be a factor too. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:48 ` Peter Williams @ 2007-04-17 7:56 ` Nick Piggin 2007-04-17 13:16 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 7:56 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >>Other hints that it was a bad idea was the need to transfer time slices > >>between children and parents during fork() and exit(). > > > >I don't see how that has anything to do with dual arrays. > > It's totally to do with the dual arrays. The only real purpose of the > time slice in O(1) (regardless of what its perceived purpose was) was to > control the switching between the arrays. The O(1) design is pretty convoluted in that regard. In my scheduler, the only purpose of the arrays is to renew time slices. The fork/exit logic is added to make interactivity better. Ingo's scheduler has similar equivalent logic. > >If you put > >a new child at the back of the queue, then your various interactive > >shell commands that typically do a lot of dependant forking get slowed > >right down behind your compile job. If you give a new child its own > >timeslice irrespective of the parent, then you have things like 'make' > >(which doesn't use a lot of CPU time) spawning off lots of high > >priority children. > > This is an artefact of trying to control nice using time slices while > using them for controlling array switching and whatever else they were > being used for. Priority (static and dynamic) is the the best way to > implement nice. I don't like the timeslice based nice in mainline. It's too nasty with latencies. nicksched is far better in that regard IMO. But I don't know how you can assert a particular way is the best way to do something. > >You need to do _something_ (Ingo's does). I don't see why this would > >be tied with a dual array. FWIW, mine doesn't do anything on exit() > >like most others, but it may need more tuning in this area. > > > > > >>This disregard for the dual array mechanism has prevented me from > >>looking at the rest of your scheduler in any great detail so I can't > >>comment on any other ideas that may be in there. > > > >Well I wasn't really asking you to review it. As I said, everyone > >has their own idea of what a good design does, and review can't really > >distinguish between the better of two reasonable designs. > > > >A fair evaluation of the alternatives seems like a good idea though. > >Nobody is actually against this, are they? > > No. It would be nice if the basic ideas that each scheduler tries to > implement could be extracted and explained though. This could lead to a > melding of ideas that leads to something quite good. > > > > > > >>>I haven't looked at Con's ones for a while, > >>>but I believe they are also much more straightforward than mainline... > >>I like Con's scheduler (partly because it uses a single array) but > >>mainly because it's nice and simple. However, his earlier schedulers > >>were prone to starvation (admittedly, only if you went out of your way > >>to make it happen) and I tried to convince him to use the anti > >>starvation mechanism in my SPA schedulers but was unsuccessful. I > >>haven't looked at his latest scheduler that sparked all this furore so > >>can't comment on it. > > > >I agree starvation or unfairness is unacceptable for a new scheduler. > > > > > >>>For example, let's say all else is equal between them, then why would > >>>we go with the O(logN) implementation rather than the O(1)? > >>In the highly unlikely event that you can't separate them on technical > >>grounds, Occam's razor recommends choosing the simplest solution. :-) > > > >O(logN) vs O(1) is technical grounds. > > In that case I'd go O(1) provided that the k factor for the O(1) wasn't > greater than O(logN)'s k factor multiplied by logMaxN. Yes, or even significantly greater around typical large sizes of N. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Nick Piggin @ 2007-04-17 13:16 ` Peter Williams 2007-04-18 4:46 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 13:16 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>>> Other hints that it was a bad idea was the need to transfer time slices >>>> between children and parents during fork() and exit(). >>> I don't see how that has anything to do with dual arrays. >> It's totally to do with the dual arrays. The only real purpose of the >> time slice in O(1) (regardless of what its perceived purpose was) was to >> control the switching between the arrays. > > The O(1) design is pretty convoluted in that regard. In my scheduler, > the only purpose of the arrays is to renew time slices. > > The fork/exit logic is added to make interactivity better. Ingo's > scheduler has similar equivalent logic. > > >>> If you put >>> a new child at the back of the queue, then your various interactive >>> shell commands that typically do a lot of dependant forking get slowed >>> right down behind your compile job. If you give a new child its own >>> timeslice irrespective of the parent, then you have things like 'make' >>> (which doesn't use a lot of CPU time) spawning off lots of high >>> priority children. >> This is an artefact of trying to control nice using time slices while >> using them for controlling array switching and whatever else they were >> being used for. Priority (static and dynamic) is the the best way to >> implement nice. > > I don't like the timeslice based nice in mainline. It's too nasty > with latencies. nicksched is far better in that regard IMO. > > But I don't know how you can assert a particular way is the best way > to do something. I should have added "I may be wrong but I think that ...". My opinion is based on a lot of experience with different types of scheduler design and the observation from gathering scheduling statistics while playing with these schedulers that the size of the time slices we're talking about is much larger than the CPU chunks most tasks use in any one go so time slice size has no real effect on most tasks and the faster CPUs become the more this becomes true. > > >>> You need to do _something_ (Ingo's does). I don't see why this would >>> be tied with a dual array. FWIW, mine doesn't do anything on exit() >>> like most others, but it may need more tuning in this area. >>> >>> >>>> This disregard for the dual array mechanism has prevented me from >>>> looking at the rest of your scheduler in any great detail so I can't >>>> comment on any other ideas that may be in there. >>> Well I wasn't really asking you to review it. As I said, everyone >>> has their own idea of what a good design does, and review can't really >>> distinguish between the better of two reasonable designs. >>> >>> A fair evaluation of the alternatives seems like a good idea though. >>> Nobody is actually against this, are they? >> No. It would be nice if the basic ideas that each scheduler tries to >> implement could be extracted and explained though. This could lead to a >> melding of ideas that leads to something quite good. >> >>> >>>>> I haven't looked at Con's ones for a while, >>>>> but I believe they are also much more straightforward than mainline... >>>> I like Con's scheduler (partly because it uses a single array) but >>>> mainly because it's nice and simple. However, his earlier schedulers >>>> were prone to starvation (admittedly, only if you went out of your way >>>> to make it happen) and I tried to convince him to use the anti >>>> starvation mechanism in my SPA schedulers but was unsuccessful. I >>>> haven't looked at his latest scheduler that sparked all this furore so >>>> can't comment on it. >>> I agree starvation or unfairness is unacceptable for a new scheduler. >>> >>> >>>>> For example, let's say all else is equal between them, then why would >>>>> we go with the O(logN) implementation rather than the O(1)? >>>> In the highly unlikely event that you can't separate them on technical >>>> grounds, Occam's razor recommends choosing the simplest solution. :-) >>> O(logN) vs O(1) is technical grounds. >> In that case I'd go O(1) provided that the k factor for the O(1) wasn't >> greater than O(logN)'s k factor multiplied by logMaxN. > > Yes, or even significantly greater around typical large sizes of N. Yes. In fact its' probably better to use the maximum number of threads allowed on the system for N. We know that value don't we? Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:16 ` Peter Williams @ 2007-04-18 4:46 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-18 4:46 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:16:54PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >I don't like the timeslice based nice in mainline. It's too nasty > >with latencies. nicksched is far better in that regard IMO. > > > >But I don't know how you can assert a particular way is the best way > >to do something. > > I should have added "I may be wrong but I think that ...". > > My opinion is based on a lot of experience with different types of > scheduler design and the observation from gathering scheduling > statistics while playing with these schedulers that the size of the time > slices we're talking about is much larger than the CPU chunks most tasks > use in any one go so time slice size has no real effect on most tasks > and the faster CPUs become the more this becomes true. For desktop loads, maybe. But for things that are compute bound, the cost of context switching I believe still gets worse as CPUs continue to be able to execute more instructions per cycle, get clocked faster, and get larger caches. > >>In that case I'd go O(1) provided that the k factor for the O(1) wasn't > >>greater than O(logN)'s k factor multiplied by logMaxN. > > > >Yes, or even significantly greater around typical large sizes of N. > > Yes. In fact its' probably better to use the maximum number of threads > allowed on the system for N. We know that value don't we? Well we might be able to work it out by looking at the tunables or amount of kernel memory available, but I guess it is hard to just pick a number. I'll try running a few more benchmarks. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:23 ` Peter Williams 2007-04-17 6:44 ` Nick Piggin @ 2007-04-17 8:44 ` Ingo Molnar 2007-04-19 2:20 ` Peter Williams 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 8:44 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. > > Yours is one of the smaller patches mainly because you perpetuate (or > you did in the last one I looked at) the (horrible to my eyes) dual > array (active/expired) mechanism. That this idea was bad should have > been apparent to all as soon as the decision was made to excuse some > tasks from being moved from the active array to the expired array. > This essentially meant that there would be circumstances where extreme > unfairness (to the extent of starvation in some cases) -- the very > things that the mechanism was originally designed to ensure (as far as > I can gather). Right about then in the development of the O(1) > scheduler alternative solutions should have been sought. in hindsight i'd agree. But back then we were clearly not ready for fine-grained accurate statistics + trees (cpus are alot faster at more complex arithmetics today, plus people still believed that low-res can be done well enough), and taking out any of these two concepts from CFS would result in a similarly complex runqueue implementation. Also, the array switch was just thought to be of another piece of 'if the heuristics go wrong, we fall back to an array switch' logic, right in line with the other heuristics. And you have to accept it, mainline's ability to auto-renice make -j jobs (and other CPU hogs) was quite a plus for developers, so it had (and probably still has) quite some inertia. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:44 ` Ingo Molnar @ 2007-04-19 2:20 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-19 2:20 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Peter Williams <pwil3058@bigpond.net.au> wrote: > >>> And my scheduler for example cuts down the amount of policy code and >>> code size significantly. >> Yours is one of the smaller patches mainly because you perpetuate (or >> you did in the last one I looked at) the (horrible to my eyes) dual >> array (active/expired) mechanism. That this idea was bad should have >> been apparent to all as soon as the decision was made to excuse some >> tasks from being moved from the active array to the expired array. >> This essentially meant that there would be circumstances where extreme >> unfairness (to the extent of starvation in some cases) -- the very >> things that the mechanism was originally designed to ensure (as far as >> I can gather). Right about then in the development of the O(1) >> scheduler alternative solutions should have been sought. > > in hindsight i'd agree. Hindsight's a wonderful place isn't it :-) and, of course, it's where I was making my comments from. > But back then we were clearly not ready for > fine-grained accurate statistics + trees (cpus are alot faster at more > complex arithmetics today, plus people still believed that low-res can > be done well enough), and taking out any of these two concepts from CFS > would result in a similarly complex runqueue implementation. I disagree. The single priority array with a promotion mechanism that I use in the SPA schedulers can do the job of avoiding starvation with no measurable increase in the overhead. Fairness, nice, good interactive responsiveness can then be managed by how you determine tasks' dynamic priorities. > Also, the > array switch was just thought to be of another piece of 'if the > heuristics go wrong, we fall back to an array switch' logic, right in > line with the other heuristics. And you have to accept it, mainline's > ability to auto-renice make -j jobs (and other CPU hogs) was quite a > plus for developers, so it had (and probably still has) quite some > inertia. I agree, it wasn't totally useless especially for the average user. My main problem with it was that the effect of "nice" wasn't consistent or predictable enough for reliable resource allocation. I also agree with the aims of the various heuristics i.e. you have to be unfair and give some tasks preferential treatment in order to give the users the type of responsiveness that they want. It's just a shame that it got broken in the process but as you say it's easier to see these things in hindsight than in the middle of the melee. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas 2007-04-15 5:16 ` Bill Huey 2007-04-15 6:43 ` Mike Galbraith @ 2007-04-15 15:05 ` Ingo Molnar 2007-04-15 20:05 ` Matt Mackall 2007-04-16 5:16 ` Con Kolivas 2 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 15:05 UTC (permalink / raw) To: Con Kolivas Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Con Kolivas <kernel@kolivas.org> wrote: [ i'm quoting this bit out of order: ] > 2. Since then I've been thinking/working on a cpu scheduler design > that takes away all the guesswork out of scheduling and gives very > predictable, as fair as possible, cpu distribution and latency while > preserving as solid interactivity as possible within those confines. yeah. I think you were right on target with this call. I've applied the sched.c change attached at the bottom of this mail to the CFS patch, if you dont mind. (or feel free to suggest some other text instead.) > 1. I tried in vain some time ago to push a working extensable > pluggable cpu scheduler framework (based on wli's work) for the linux > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he > didn't like it) as being absolutely the wrong approach and that we > should never do that. [...] i partially replied to that point to Will already, and i'd like to make it clear again: yes, i rejected plugsched 2-3 years ago (which already drifted away from wli's original codebase) and i would still reject it today. First and foremost, please dont take such rejections too personally - i had my own share of rejections (and in fact, as i mentioned it in a previous mail, i had a fair number of complete project throwaways: 4g:4g, in-kernel Tux, irqrate and many others). I know that they can hurt and can demoralize, but if i dont like something it's my job to tell that. Can i sum up your argument as: "you rejected plugsched, but then why on earth did you modularize portions of the scheduler in CFS? Isnt your position thus woefully inconsistent?" (i'm sure you would never put it this impolitely though, but i guess i can flame myself with impunity ;) While having an inconsistent position isnt a terminal sin in itself, please realize that the scheduler classes code in CFS is quite different from plugsched: it was a result of what i saw to be technological pressure for _internal modularization_. (This internal/policy modularization aspect is something that Will said was present in his original plugsched code, but which aspect i didnt see in the plugsched patches that i reviewed.) That possibility never even occured to me to until 3 days ago. You never raised it either AFAIK. No patches to simplify the scheduler that way were ever sent. Plugsched doesnt even touch the core load-balancer for example, and most of the time i spent with the modularization was to get the load-balancing details right. So it's really apples to oranges. My view about plugsched: first please take a look at the latest plugsched code: http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch 26 files changed, 8951 insertions(+), 1495 deletions(-) As an experiment i've removed all the add-on schedulers (both the core and the include files, only kept the vanilla one) from the plugsched patch (and the makefile and kconfig complications, etc), to see the 'infrastructure cost', and it still gave: 12 files changed, 1933 insertions(+), 1479 deletions(-) that's the extra complication i didnt like 3 years ago and which i still dont like today. What the current plugsched code does is that it simplifies the adding of new experimental schedulers, but it doesnt really do what i wanted: to simplify the _scheduler itself_. Personally i'm still not primarily interested in having a large selection of schedulers, i'm mainly interested in a good and maintainable scheduler that works for people. so the rejection was on these grounds, and i still very much stand by that position here and today: i didnt want to see the Linux scheduler landscape balkanized and i saw no technological reasons for the complication that external modularization brings. the new scheding classes code in the CFS patch was not a result of "oh, i want to write a new scheduler, lets make schedulers pluggable" kind of thinking. That result was just a side-effect of it. (and as you correctly noted it, the CFS related modularization is incomplete). Btw., the thing that triggered the scheduling classes code wasnt even plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he fixed it within the framework of the existing scheduler, and the end result behaved quite well when i threw various testloads on it. But i felt a bit uncomfortable that it added another few hundred lines of code to an already complex sched.c. This felt unnatural so i mailed Mike that i'd attempt to clean these infrastructure aspects of sched.c up a bit so that it becomes more hackable to him. Thus 3 days ago, without having made up my mind about anything, i started this experiment (which ended up in the modularization and in the CFS scheduler) to simplify the code and to enable Mike to fix such itches in an easier way. By your logic Mike should in fact be quite upset about this: if the new code works out and proves to be useful then it obsoletes a whole lot of code of him! > For weeks now, Ingo has said that the interactivity regressions were > showstoppers and we should address them, never mind the fact that the > so-called regressions were purely "it slows down linearly with load" > which to me is perfectly desirable behaviour. [...] yes. For me the first thing when considering a large scheduler patch is: "does a patch do what it claims" and "does it work". If those goals are met (and if it's a complete scheduler i actually try it quite extensively) then i look at the code cleanliness issues. Mike's patch was the first one that seemed to meet that threshold in my own humble testing, and CFS was a direct result of that. note that i tried the same workloads with CFS and while it wasnt as good as mainline, it handled them better than SD. Mike reported the same, and Mark Lord (who too reported SD interactivity problems) reported success yesterday too. (but .. CFS is a mere 2 days old so we cannot really tell anything with certainty yet.) > [...] However at one stage I virtually begged for support with my > attempts and help with the code. Dmitry Adamushko is the only person > who actually helped me with the code in the interim, while others > poked sticks at it. Sure the sticks helped at times but the sticks > always seemed to have their ends kerosene doused and flaming for > reasons I still don't get. No other help was forthcoming. i'm really sorry you got that impression. in 2004 i had a good look at the staircase scheduler and said: http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html "But in general i'm quite positive about the staircase scheduler." and even tested it and gave you feedback: http://lwn.net/Articles/96562/ i think i even told Andrew that i dont really like pluggable schedulers and if there's any replacement for the current scheduler then that would be a full replacement, and it would be the staircase scheduler. Hey, i told this to you as recently as 1 month ago as well: http://lkml.org/lkml/2007/3/8/54 "cool! I like this even more than i liked your original staircase scheduler from 2 years ago :)" Ingo -----------> Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -16,6 +16,7 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin + * 2007-04-15 Con Kolivas was dead right: fairness matters! :) */ ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:05 ` Ingo Molnar @ 2007-04-15 20:05 ` Matt Mackall 2007-04-15 20:48 ` Ingo Molnar 2007-04-16 5:16 ` Con Kolivas 1 sibling, 1 reply; 713+ messages in thread From: Matt Mackall @ 2007-04-15 20:05 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote: > so the rejection was on these grounds, and i still very much stand by > that position here and today: i didnt want to see the Linux scheduler > landscape balkanized and i saw no technological reasons for the > complication that external modularization brings. But "balkanization" is a good thing. "Monoculture" is a bad thing. Look at what happened with I/O scheduling. Opening things up to some new ideas by making it possible to select your I/O scheduler took us from 10 years of stagnation to healthy, competitive development, which gave us a substantially better I/O scheduler. Look at what's happening right now with TCP congestion algorithms. We've had decades of tweaking Reno slightly now turned into a vibrant research area with lots of radical alternatives. A winner will eventually emerge and it will probably look quite a bit different than Reno. Similar things have gone on since the beginning with filesystems on Linux. Being able to easily compare filesystems head to head has been immensely valuable in improving our 'core' Linux filesystems. And what we've had up to now is a scheduler monoculture. Until Andrew put RSDL in -mm, if people wanted to experiment with other schedulers, they had to go well off the beaten path to do it. So all the people who've been hopelessy frustrated with the mainline scheduler go off to the -ck ghetto, or worse, stick with 2.4. Whether your motivations have been protectionist or merely shortsighted, you've stomped pretty heavily on alternative scheduler development by completely rejecting the whole plugsched concept. If we'd opened up mainline to a variety of schedulers _3 years ago_, we'd probably have gotten to where we are today much sooner. Hopefully, the next time Rik suggests pluggable page replacement algorithms, folks will actually seriously consider it. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:05 ` Matt Mackall @ 2007-04-15 20:48 ` Ingo Molnar 2007-04-15 21:31 ` Matt Mackall 2007-04-15 23:39 ` William Lee Irwin III 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 20:48 UTC (permalink / raw) To: Matt Mackall Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Matt Mackall <mpm@selenic.com> wrote: > Look at what happened with I/O scheduling. Opening things up to some > new ideas by making it possible to select your I/O scheduler took us > from 10 years of stagnation to healthy, competitive development, which > gave us a substantially better I/O scheduler. actually, 2-3 years ago we already had IO schedulers, and my opinion against plugsched back then (also shared by Nick and Linus) was very much considering them. There are at least 4 reasons why I/O schedulers are different from CPU schedulers: 1) CPUs are a non-persistent resource shared by _all_ tasks and workloads in the system. Disks are _persistent_ resources very much attached to specific workloads. (If tasks had to be 'persistent' to the CPU they were started on we'd have much different scheduling technology, and there would be much less complexity.) More analogous to CPU schedulers would perhaps be VM/MM schedulers, and those tend to be hard to modularize in a technologically sane way too. (and unlike disks there's no good generic way to attach VM/MM schedulers to particular workloads.) So it's apples to oranges. in practice it comes down to having one good scheduler that runs all workloads on a system reasonably well. And given that a very large portion of system runs mixed workloads, the demand for one good scheduler is pretty high. While i can run with mixed IO schedulers just fine. 2) plugsched did not allow on the fly selection of schedulers, nor did it allow a per CPU selection of schedulers. IO schedulers you can change per disk, on the fly, making them much more useful in practice. Also, IO schedulers (while definitely not being slow!) are alot less performance sensitive than CPU schedulers. 3) I/O schedulers are pretty damn clean code, and plugsched, at least the last version i saw of it, didnt come even close. 4) the good thing that happened to I/O, after years of stagnation isnt I/O schedulers. The good thing that happened to I/O is called Jens Axboe. If you care about the I/O subystem then print that name out and hang it on the wall. That and only that is what mattered. all in one, while there are definitely uses (embedded would like to have a smaller/different scheduler, etc.), the technical case for modularization for the sake of selectability is alot lower for CPU schedulers than it is for I/O schedulers. nor was the non-modularity of some piece of code ever an impediment to competition. May i remind you of the pretty competitive SLAB allocator landscape, resulting in things like the SLOB allocator, written by yourself? ;-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:48 ` Ingo Molnar @ 2007-04-15 21:31 ` Matt Mackall 2007-04-16 3:03 ` Nick Piggin 2007-04-16 15:45 ` William Lee Irwin III 2007-04-15 23:39 ` William Lee Irwin III 1 sibling, 2 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-15 21:31 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > > * Matt Mackall <mpm@selenic.com> wrote: > > > Look at what happened with I/O scheduling. Opening things up to some > > new ideas by making it possible to select your I/O scheduler took us > > from 10 years of stagnation to healthy, competitive development, which > > gave us a substantially better I/O scheduler. > > actually, 2-3 years ago we already had IO schedulers, and my opinion > against plugsched back then (also shared by Nick and Linus) was very > much considering them. There are at least 4 reasons why I/O schedulers > are different from CPU schedulers: ... > 3) I/O schedulers are pretty damn clean code, and plugsched, at least > the last version i saw of it, didnt come even close. That's irrelevant. Plugsched was an attempt to get alternative schedulers exposure in mainline. I know, because I remember encouraging Bill to pursue it. Not only did you veto plugsched (which may have been a perfectly reasonable thing to do), but you also vetoed the whole concept of multiple schedulers in the tree too. "We don't want to balkanize the scheduling landscape". And that latter part is what I'm claiming has set us back for years. It's not a technical argument but a strategic one. And it's just not a good strategy. > 4) the good thing that happened to I/O, after years of stagnation isnt > I/O schedulers. The good thing that happened to I/O is called Jens > Axboe. If you care about the I/O subystem then print that name out > and hang it on the wall. That and only that is what mattered. Disagree. Things didn't actually get interesting until Nick showed up with AS and got it in-tree to demonstrate the huge amount of room we had for improvement. It took several iterations of AS and CFQ (with a couple complete rewrites) before CFQ began to look like the winner. The resulting time-sliced CFQ was fairly heavily influenced by the ideas in AS. Similarly, things in scheduler land had been pretty damn boring until Con finally got Andrew to take one of his schedulers for a spin. > nor was the non-modularity of some piece of code ever an impediment to > competition. May i remind you of the pretty competitive SLAB allocator > landscape, resulting in things like the SLOB allocator, written by > yourself? ;-) Thankfully no one came out and said "we don't want to balkanize the allocator landscape" when I submitted it or I probably would have just dropped it, rather than painfully dragging it along out of tree for years. I'm not nearly the glutton for punishment that Con is. :-P -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 21:31 ` Matt Mackall @ 2007-04-16 3:03 ` Nick Piggin 2007-04-16 14:28 ` Matt Mackall 2007-04-16 15:45 ` William Lee Irwin III 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-16 3:03 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote: > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > > > 4) the good thing that happened to I/O, after years of stagnation isnt > > I/O schedulers. The good thing that happened to I/O is called Jens > > Axboe. If you care about the I/O subystem then print that name out > > and hang it on the wall. That and only that is what mattered. > > Disagree. Things didn't actually get interesting until Nick showed up > with AS and got it in-tree to demonstrate the huge amount of room we > had for improvement. It took several iterations of AS and CFQ (with a > couple complete rewrites) before CFQ began to look like the winner. > The resulting time-sliced CFQ was fairly heavily influenced by the > ideas in AS. Well to be fair, Jens had just implemented deadline, which got me interested ;) Actually, I would still like to be able to deprecate deadline for AS, because AS has a tunable that you can switch to turn off read anticipation and revert to deadline behaviour (or very close to). It would have been nice if CFQ were then a layer on top of AS that implemented priorities (or vice versa). And then AS could be deprecated and we'd be back to 1 primary scheduler. Well CFQ seems to be going in the right direction with that, however some large users still find AS faster for some reason... Anyway, moral of the story is that I think it would have been nice if we hadn't proliferated IO schedulers, however in practice it isn't easy to just layer features on top of each other, and also keeping deadline helped a lot to be able to debug and examine performance regressions and actually get code upstream. And this was true even when it was globally boottine switchable only. I'd prefer if we kept a single CPU scheduler in mainline, because I think that simplifies analysis and focuses testing. I think we can have one that is good enough for everyone. But if the only other option for progress is that Linus or Andrew just pull one out of a hat, then I would rather merge all of them. Yes I think Con's scheduler should get a fair go, ditto for Ingo's, mine, and anyone else's. > > nor was the non-modularity of some piece of code ever an impediment to > > competition. May i remind you of the pretty competitive SLAB allocator > > landscape, resulting in things like the SLOB allocator, written by > > yourself? ;-) > > Thankfully no one came out and said "we don't want to balkanize the > allocator landscape" when I submitted it or I probably would have just > dropped it, rather than painfully dragging it along out of tree for > years. I'm not nearly the glutton for punishment that Con is. :-P I don't think this is a fault of the people or the code involved. We just didn't have much collective drive to replace the scheduler, and even less an idea of how to decide between any two of them. I've kept nicksched around since 2003 or so and no hard feelings ;) ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:03 ` Nick Piggin @ 2007-04-16 14:28 ` Matt Mackall 2007-04-17 3:31 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Matt Mackall @ 2007-04-16 14:28 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > I'd prefer if we kept a single CPU scheduler in mainline, because I > think that simplifies analysis and focuses testing. I think you'll find something like 80-90% of the testing will be done on the default choice, even if other choices exist. So you really won't have much of a problem here. But when the only choice for other schedulers is to go out-of-tree, then only 1% of the people will try it out and those people are guaranteed to be the ones who saw scheduling problems in mainline. So the alternative won't end up getting any testing on many of the workloads that work fine in mainstream so their feedback won't tell you very much at all. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 14:28 ` Matt Mackall @ 2007-04-17 3:31 ` Nick Piggin 2007-04-17 17:35 ` Matt Mackall 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 3:31 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote: > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > > I'd prefer if we kept a single CPU scheduler in mainline, because I > > think that simplifies analysis and focuses testing. > > I think you'll find something like 80-90% of the testing will be done > on the default choice, even if other choices exist. So you really > won't have much of a problem here. > > But when the only choice for other schedulers is to go out-of-tree, > then only 1% of the people will try it out and those people are > guaranteed to be the ones who saw scheduling problems in mainline. > So the alternative won't end up getting any testing on many of the > workloads that work fine in mainstream so their feedback won't tell > you very much at all. Yeah I concede that perhaps it is the only way to get things going any further. But how do we decide if and when the current scheduler should be demoted from default, and which should replace it? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:31 ` Nick Piggin @ 2007-04-17 17:35 ` Matt Mackall 0 siblings, 0 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-17 17:35 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote: > On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote: > > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > > > I'd prefer if we kept a single CPU scheduler in mainline, because I > > > think that simplifies analysis and focuses testing. > > > > I think you'll find something like 80-90% of the testing will be done > > on the default choice, even if other choices exist. So you really > > won't have much of a problem here. > > > > But when the only choice for other schedulers is to go out-of-tree, > > then only 1% of the people will try it out and those people are > > guaranteed to be the ones who saw scheduling problems in mainline. > > So the alternative won't end up getting any testing on many of the > > workloads that work fine in mainstream so their feedback won't tell > > you very much at all. > > Yeah I concede that perhaps it is the only way to get things going > any further. But how do we decide if and when the current scheduler > should be demoted from default, and which should replace it? Step one is ship both in -mm. If that doesn't give us enough confidence, ship both in mainline. If that doesn't give us enough confidence, wait until vendors ship both. Eventually a clear picture should emerge. If it doesn't, either the change is not significant or no one cares. But it really is important to be able to do controlled experiments on this stuff with little effort. That's the recipe for getting lots of valid feedback. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 21:31 ` Matt Mackall 2007-04-16 3:03 ` Nick Piggin @ 2007-04-16 15:45 ` William Lee Irwin III 1 sibling, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-16 15:45 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote: > That's irrelevant. Plugsched was an attempt to get alternative > schedulers exposure in mainline. I know, because I remember > encouraging Bill to pursue it. Not only did you veto plugsched (which > may have been a perfectly reasonable thing to do), but you also vetoed > the whole concept of multiple schedulers in the tree too. "We don't > want to balkanize the scheduling landscape". > And that latter part is what I'm claiming has set us back for years. > It's not a technical argument but a strategic one. And it's just not a > good strategy. [... excellent post trimmed...] These are some rather powerful arguments. I think I'll actually start looking at plugsched again. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:48 ` Ingo Molnar 2007-04-15 21:31 ` Matt Mackall @ 2007-04-15 23:39 ` William Lee Irwin III 2007-04-16 1:06 ` Peter Williams 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-15 23:39 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > 2) plugsched did not allow on the fly selection of schedulers, nor did > it allow a per CPU selection of schedulers. IO schedulers you can > change per disk, on the fly, making them much more useful in > practice. Also, IO schedulers (while definitely not being slow!) are > alot less performance sensitive than CPU schedulers. One of the reasons I never posted my own code is that it never met its own design goals, which absolutely included switching on the fly. I think Peter Williams may have done something about that. It was my hope to be able to do insmod sched_foo.ko until it became clear that the effort it was intended to assist wasn't going to get even the limited hardware access required, at which point I largely stopped working on it. On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > 3) I/O schedulers are pretty damn clean code, and plugsched, at least > the last version i saw of it, didnt come even close. I'm not sure what happened there. It wasn't a big enough patch to take hits in this area due to getting overwhelmed by the programming burden like some other efforts of mine. Maybe things started getting ugly once on-the-fly switching entered the picture. My guess is that Peter Williams will have to chime in here, since things have diverged enough from my one-time contribution 4 years ago. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:39 ` William Lee Irwin III @ 2007-04-16 1:06 ` Peter Williams 2007-04-16 3:04 ` William Lee Irwin III 2007-04-16 17:22 ` Chris Friesen 0 siblings, 2 replies; 713+ messages in thread From: Peter Williams @ 2007-04-16 1:06 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: >> 2) plugsched did not allow on the fly selection of schedulers, nor did >> it allow a per CPU selection of schedulers. IO schedulers you can >> change per disk, on the fly, making them much more useful in >> practice. Also, IO schedulers (while definitely not being slow!) are >> alot less performance sensitive than CPU schedulers. > > One of the reasons I never posted my own code is that it never met its > own design goals, which absolutely included switching on the fly. I > think Peter Williams may have done something about that. I didn't but some students did. In a previous life, I did implement a runtime configurable CPU scheduling mechanism (implemented on True64, Solaris and Linux) that allowed schedulers to be loaded as modules at run time. This was released commercially on True64 and Solaris. So I know that it can be done. I have thought about doing something similar for the SPA schedulers which differ in only small ways from each other but lack motivation. > It was my hope > to be able to do insmod sched_foo.ko until it became clear that the > effort it was intended to assist wasn't going to get even the limited > hardware access required, at which point I largely stopped working on > it. > > > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: >> 3) I/O schedulers are pretty damn clean code, and plugsched, at least >> the last version i saw of it, didnt come even close. > > I'm not sure what happened there. It wasn't a big enough patch to take > hits in this area due to getting overwhelmed by the programming burden > like some other efforts of mine. Maybe things started getting ugly once > on-the-fly switching entered the picture. My guess is that Peter Williams > will have to chime in here, since things have diverged enough from my > one-time contribution 4 years ago. From my POV, the current version of plugsched is considerably simpler than it was when I took the code over from Con as I put considerable effort into minimizing code overlap in the various schedulers. I also put considerable effort into minimizing any changes to the load balancing code (something Ingo seems to think is a deficiency) and the result is that plugsched allows "intra run queue" scheduling to be easily modified WITHOUT effecting load balancing. To my mind scheduling and load balancing are orthogonal and keeping them that way simplifies things. As Ingo correctly points out, plugsched does not allow different schedulers to be used per CPU but it would not be difficult to modify it so that they could. Although I've considered doing this over the years I decided not to as it would just increase the complexity and the amount of work required to keep the patch set going. About six months ago I decided to reduce the amount of work I was doing on plugsched (as it was obviously never going to be accepted) and now only publish patches against the vanilla kernel's major releases (and the only reason that I kept doing that is that the download figures indicated that about 80 users were interested in the experiment). Peter PS I no longer read LKML (due to time constraints) and would appreciate it if I could be CC'd on any e-mails suggesting scheduler changes. PPS I'm just happy to see that Ingo has finally accepted that the vanilla scheduler was badly in need of fixing and don't really care who fixes it. PPS Different schedulers for different aims (i.e. server or work station) do make a difference. E.g. the spa_svr scheduler in plugsched does about 1% better on kernbench than the next best scheduler in the bunch. PPPS Con, fairness isn't always best as humans aren't very altruistic and we need to give unfair preference to interactive tasks in order to stop the users flinging their PCs out the window. But the current scheduler doesn't do this very well and is also not very good at fairness so needs to change. But the changes need to address interactive response and fairness not just fairness. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 1:06 ` Peter Williams @ 2007-04-16 3:04 ` William Lee Irwin III 2007-04-16 5:09 ` Peter Williams 2007-04-16 17:22 ` Chris Friesen 1 sibling, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-16 3:04 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> One of the reasons I never posted my own code is that it never met its >> own design goals, which absolutely included switching on the fly. I >> think Peter Williams may have done something about that. >> It was my hope >> to be able to do insmod sched_foo.ko until it became clear that the >> effort it was intended to assist wasn't going to get even the limited >> hardware access required, at which point I largely stopped working on >> it. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > I didn't but some students did. > In a previous life, I did implement a runtime configurable CPU > scheduling mechanism (implemented on True64, Solaris and Linux) that > allowed schedulers to be loaded as modules at run time. This was > released commercially on True64 and Solaris. So I know that it can be done. > I have thought about doing something similar for the SPA schedulers > which differ in only small ways from each other but lack motivation. Driver models for scheduling are not so far out. AFAICS it's largely a tug-of-war over design goals, e.g. maintaining per-cpu runqueues and switching out intra-queue policies vs. switching out whole-system policies, SMP handling and all. Whether this involves load balancing depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x scheduler module, for instance, would not have a load balancer at all, as it has only one global runqueue. There are other sorts of policies wanting significant changes to SMP handling vs. the stock load balancing. William Lee Irwin III wrote: >> I'm not sure what happened there. It wasn't a big enough patch to take >> hits in this area due to getting overwhelmed by the programming burden >> like some other efforts of mine. Maybe things started getting ugly once >> on-the-fly switching entered the picture. My guess is that Peter Williams >> will have to chime in here, since things have diverged enough from my >> one-time contribution 4 years ago. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > From my POV, the current version of plugsched is considerably simpler > than it was when I took the code over from Con as I put considerable > effort into minimizing code overlap in the various schedulers. > I also put considerable effort into minimizing any changes to the load > balancing code (something Ingo seems to think is a deficiency) and the > result is that plugsched allows "intra run queue" scheduling to be > easily modified WITHOUT effecting load balancing. To my mind scheduling > and load balancing are orthogonal and keeping them that way simplifies > things. ISTR rearranging things for con in such a fashion that it no longer worked out of the box (though that wasn't the intention; restructuring it to be more suited to his purposes was) and that's what he worked off of afterward. I don't remember very well what changed there as I clearly invested less effort there than the prior versions. Now that I think of it, that may have been where the sample policy demonstrating scheduling classes was lost. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > As Ingo correctly points out, plugsched does not allow different > schedulers to be used per CPU but it would not be difficult to modify it > so that they could. Although I've considered doing this over the years > I decided not to as it would just increase the complexity and the amount > of work required to keep the patch set going. About six months ago I > decided to reduce the amount of work I was doing on plugsched (as it was > obviously never going to be accepted) and now only publish patches > against the vanilla kernel's major releases (and the only reason that I > kept doing that is that the download figures indicated that about 80 > users were interested in the experiment). That's a rather different goal from what I was going on about with it, so it's all diverged quite a bit. Where I had a significant need for mucking with the entire concept of how SMP was handled, this is rather different. At this point I'm questioning the relevance of my own work, though it was already relatively marginal as it started life as an attempt at a sort of debug patch to help gang scheduling (which is in itself a rather marginally relevant feature to most users) code along. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > PS I no longer read LKML (due to time constraints) and would appreciate > it if I could be CC'd on any e-mails suggesting scheduler changes. > PPS I'm just happy to see that Ingo has finally accepted that the > vanilla scheduler was badly in need of fixing and don't really care who > fixes it. > PPS Different schedulers for different aims (i.e. server or work > station) do make a difference. E.g. the spa_svr scheduler in plugsched > does about 1% better on kernbench than the next best scheduler in the bunch. > PPPS Con, fairness isn't always best as humans aren't very altruistic > and we need to give unfair preference to interactive tasks in order to > stop the users flinging their PCs out the window. But the current > scheduler doesn't do this very well and is also not very good at > fairness so needs to change. But the changes need to address > interactive response and fairness not just fairness. Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are better ones. I'd not bother citing kernel compile results. In any event, I'm not sure what to say about different schedulers for different aims. My intentions with plugsched were not centered around production usage or intra-queue policy. I'm relatively indifferent to the notion of having pluggable CPU schedulers, intra-queue or otherwise, in mainline. I don't see any particular harm in it, but neither am I particularly motivated to have it in. I had a rather strong sense of instrumentality about it, and since it became useless to me (at a conceptual level; the implementation was never finished ot the point of dynamic loading of scheduler modules) for assisting development on large systems via reboot avoidance by dint of it becoming clear that access to such was never going to happen, I've stopped looking at it. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:04 ` William Lee Irwin III @ 2007-04-16 5:09 ` Peter Williams 2007-04-16 11:04 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-16 5:09 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> One of the reasons I never posted my own code is that it never met its >>> own design goals, which absolutely included switching on the fly. I >>> think Peter Williams may have done something about that. >>> It was my hope >>> to be able to do insmod sched_foo.ko until it became clear that the >>> effort it was intended to assist wasn't going to get even the limited >>> hardware access required, at which point I largely stopped working on >>> it. > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> I didn't but some students did. >> In a previous life, I did implement a runtime configurable CPU >> scheduling mechanism (implemented on True64, Solaris and Linux) that >> allowed schedulers to be loaded as modules at run time. This was >> released commercially on True64 and Solaris. So I know that it can be done. >> I have thought about doing something similar for the SPA schedulers >> which differ in only small ways from each other but lack motivation. > > Driver models for scheduling are not so far out. AFAICS it's largely a > tug-of-war over design goals, e.g. maintaining per-cpu runqueues and > switching out intra-queue policies vs. switching out whole-system > policies, SMP handling and all. Whether this involves load balancing > depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x > scheduler module, for instance, would not have a load balancer at all, > as it has only one global runqueue. There are other sorts of policies > wanting significant changes to SMP handling vs. the stock load > balancing. Well a single run queue removes the need for load balancing but has scalability issues on large systems. Personally, I think something in between would be the best solution i.e. multiple run queues but more than one CPU per run queue. I think that this would be a particularly good solution to the problems introduced by hyper threading and multi core systems and also NUMA systems. E.g. if all CPUs in a hyper thread package are using the one queue then the case where one CPU is trying to run a high priority task and the other a low priority task (i.e. the cases that the sleeping dependent mechanism tried to address) is less likely to occur. By the way, I think that it's a very bad idea for the scheduling mechanism and the load balancing mechanism to be coupled. The anomalies that will be experienced and the attempts to make ad hoc fixes for them will lead to complexity spiralling out of control. > > > William Lee Irwin III wrote: >>> I'm not sure what happened there. It wasn't a big enough patch to take >>> hits in this area due to getting overwhelmed by the programming burden >>> like some other efforts of mine. Maybe things started getting ugly once >>> on-the-fly switching entered the picture. My guess is that Peter Williams >>> will have to chime in here, since things have diverged enough from my >>> one-time contribution 4 years ago. > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> From my POV, the current version of plugsched is considerably simpler >> than it was when I took the code over from Con as I put considerable >> effort into minimizing code overlap in the various schedulers. >> I also put considerable effort into minimizing any changes to the load >> balancing code (something Ingo seems to think is a deficiency) and the >> result is that plugsched allows "intra run queue" scheduling to be >> easily modified WITHOUT effecting load balancing. To my mind scheduling >> and load balancing are orthogonal and keeping them that way simplifies >> things. > > ISTR rearranging things for con in such a fashion that it no longer > worked out of the box (though that wasn't the intention; restructuring it > to be more suited to his purposes was) and that's what he worked off of > afterward. I don't remember very well what changed there as I clearly > invested less effort there than the prior versions. Now that I think of > it, that may have been where the sample policy demonstrating scheduling > classes was lost. I can't comment here as (as far as I can recall) I never saw your code and only became involved when Con posted his version of cpusched. > > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> As Ingo correctly points out, plugsched does not allow different >> schedulers to be used per CPU but it would not be difficult to modify it >> so that they could. Although I've considered doing this over the years >> I decided not to as it would just increase the complexity and the amount >> of work required to keep the patch set going. About six months ago I >> decided to reduce the amount of work I was doing on plugsched (as it was >> obviously never going to be accepted) and now only publish patches >> against the vanilla kernel's major releases (and the only reason that I >> kept doing that is that the download figures indicated that about 80 >> users were interested in the experiment). > > That's a rather different goal from what I was going on about with it, > so it's all diverged quite a bit. Yes, pragmatic considerations dictated a change of tack. > Where I had a significant need for > mucking with the entire concept of how SMP was handled, this is rather > different. Yes, I went with the idea of intra run queue scheduling being orthogonal to load balancing for two reasons: 1. I think that coupling them is a bad idea from the complexity POV, and 2. it's enough of a battle fighting for modifications to one bit of the code without trying to do it to two simultaneously. > At this point I'm questioning the relevance of my own work, > though it was already relatively marginal as it started life as an > attempt at a sort of debug patch to help gang scheduling (which is in > itself a rather marginally relevant feature to most users) code along. The main commercial plug in scheduler used with the run time loadable module scheduler that I mentioned earlier did gang scheduling (at the insistence of the Tru64 kernel folks). As this scheduler was a hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" ("unfairly" really in according to an allocation policy) among higher level entities such as users, groups and applications as well as processes; it was fairly easy to make it a gang scheduler by modifying it to give all of a process's threads the same priority based on the process's CPU usage rather than different priorities based on the threads' usage rates. In fact, it would have been possible to select between gang and non gang on a per process basis if that was considered desirable. The fact that threads and processes are distinct entities on Tru64 and Solaris made this easier to do on them than on Linux. My experience with this scheduler leads me to believe that to achieve gang scheduling and fairness, etc. you need (usage) statistics based schedulers. > > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> PS I no longer read LKML (due to time constraints) and would appreciate >> it if I could be CC'd on any e-mails suggesting scheduler changes. >> PPS I'm just happy to see that Ingo has finally accepted that the >> vanilla scheduler was badly in need of fixing and don't really care who >> fixes it. >> PPS Different schedulers for different aims (i.e. server or work >> station) do make a difference. E.g. the spa_svr scheduler in plugsched >> does about 1% better on kernbench than the next best scheduler in the bunch. >> PPPS Con, fairness isn't always best as humans aren't very altruistic >> and we need to give unfair preference to interactive tasks in order to >> stop the users flinging their PCs out the window. But the current >> scheduler doesn't do this very well and is also not very good at >> fairness so needs to change. But the changes need to address >> interactive response and fairness not just fairness. > > Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are > better ones. I'd not bother citing kernel compile results. spa_svr actually does its best work when the system isn't fully loaded as the type of improvement it strives to achieve (minimizing on queue wait time) hasn't got much room to manoeuvre when the system is fully loaded. Therefore, the fact that it's 1% better even in these circumstances is a good result and also indicates that the overhead for keeping the scheduling statistics it uses for its decision making is well spent. Especially, when you consider that the total available room for improvement on this benchmark is less than 3%. To elaborate, the motivation for this scheduler was acquired from the observation of scheduling statistics (in particular, on queue wait time) on systems running at about 30% to 50% load. Theoretically, at these load levels there should be no such waiting but the statistics show that there is considerable waiting (sometimes as high as 30% to 50%). I put this down to "lack of serendipity" e.g. everyone sleeping at the same time and then trying to run at the same time would be complete lack of serendipity. On the other hand, if everyone is synced then there would be total serendipity. Obviously, from the POV of a client, time the server task spends waiting on the queue adds to the response time for any request that has been made so reduction of this time on a server is a good thing(tm). Equally obviously, trying to achieve this synchronization by asking the tasks to cooperate with each other is not a feasible solution and some external influence needs to be exerted and this is what spa_svr does -- it nudges the scheduling order of the tasks in a way that makes them become well synced. Unfortunately, this is not a good scheduler for an interactive system as it minimizes the response times for ALL tasks (and the system as a whole) and this can result in increased response time for some interactive tasks (clunkiness) which annoys interactive users. When you start fiddling with this scheduler to bring back "interactive unfairness" you kill a lot of its superior low overall wait time performance. So this is why I think "horses for courses" schedulers are worth while. > > In any event, I'm not sure what to say about different schedulers for > different aims. My intentions with plugsched were not centered around > production usage or intra-queue policy. I'm relatively indifferent to > the notion of having pluggable CPU schedulers, intra-queue or otherwise, > in mainline. I don't see any particular harm in it, but neither am I > particularly motivated to have it in. If you look at the struct sched_spa_child in the file include/linux/sched_spa.h you'll see that the interface for switching between the various SPA schedulers is quite small and making them runtime switchable would be easy (I haven't done this in cpusched as I wanted to keep the same interface for switching schedulers for all schedulers: i.e. all run time switchable or none run time switchable; as the main aim of plugsched had become a mechanism for evaluating different intra queue scheduling designs.) > I had a rather strong sense of > instrumentality about it, and since it became useless to me (at a > conceptual level; the implementation was never finished ot the point of > dynamic loading of scheduler modules) for assisting development on > large systems via reboot avoidance by dint of it becoming clear that > access to such was never going to happen, I've stopped looking at it. I'll probably stop looking at this problem as well at least for the time being until all this new code has settled. Peter PS As I no longer read LKML, I haven't yet seen Ingo's or Con's or Nick's new schedulers yet so am unable to comment on their technical merits with respect to my comments above. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:09 ` Peter Williams @ 2007-04-16 11:04 ` William Lee Irwin III 2007-04-16 12:55 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-16 11:04 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> Driver models for scheduling are not so far out. AFAICS it's largely a >> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and >> switching out intra-queue policies vs. switching out whole-system >> policies, SMP handling and all. Whether this involves load balancing >> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x >> scheduler module, for instance, would not have a load balancer at all, >> as it has only one global runqueue. There are other sorts of policies >> wanting significant changes to SMP handling vs. the stock load >> balancing. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Well a single run queue removes the need for load balancing but has > scalability issues on large systems. Personally, I think something in > between would be the best solution i.e. multiple run queues but more > than one CPU per run queue. I think that this would be a particularly > good solution to the problems introduced by hyper threading and multi > core systems and also NUMA systems. E.g. if all CPUs in a hyper thread > package are using the one queue then the case where one CPU is trying to > run a high priority task and the other a low priority task (i.e. the > cases that the sleeping dependent mechanism tried to address) is less > likely to occur. This wasn't meant to sing the praises of the 2.4.x scheduler; it was rather meant to point out that the 2.4.x scheduler, among others, is unimplementable within the framework if it assumes per-cpu runqueues. More plausibly useful single-queue schedulers would likely use a vastly different policy and attempt to carry out all queue manipulations via lockless operations. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > By the way, I think that it's a very bad idea for the scheduling > mechanism and the load balancing mechanism to be coupled. The anomalies > that will be experienced and the attempts to make ad hoc fixes for them > will lead to complexity spiralling out of control. This is clearly unavoidable in the case of gang scheduling. There is simply no other way to schedule N tasks which must all be run simultaneously when they run at all on N cpus of the system without such coupling and furthermore at an extremely intimate level, particularly when multiple competing gangs must be scheduled in such a fashion. William Lee Irwin III wrote: >> Where I had a significant need for >> mucking with the entire concept of how SMP was handled, this is rather >> different. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Yes, I went with the idea of intra run queue scheduling being orthogonal > to load balancing for two reasons: > 1. I think that coupling them is a bad idea from the complexity POV, and > 2. it's enough of a battle fighting for modifications to one bit of the > code without trying to do it to two simultaneously. As nice as that sounds, such a code structure would've precluded the entire raison d'etre of the patch, i.e. gang scheduling. William Lee Irwin III wrote: >> At this point I'm questioning the relevance of my own work, >> though it was already relatively marginal as it started life as an >> attempt at a sort of debug patch to help gang scheduling (which is in >> itself a rather marginally relevant feature to most users) code along. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > The main commercial plug in scheduler used with the run time loadable > module scheduler that I mentioned earlier did gang scheduling (at the > insistence of the Tru64 kernel folks). As this scheduler was a > hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" > ("unfairly" really in according to an allocation policy) among higher > level entities such as users, groups and applications as well as > processes; it was fairly easy to make it a gang scheduler by modifying > it to give all of a process's threads the same priority based on the > process's CPU usage rather than different priorities based on the > threads' usage rates. In fact, it would have been possible to select > between gang and non gang on a per process basis if that was considered > desirable. > The fact that threads and processes are distinct entities on Tru64 and > Solaris made this easier to do on them than on Linux. > My experience with this scheduler leads me to believe that to achieve > gang scheduling and fairness, etc. you need (usage) statistics based > schedulers. This does not appear to make sense unless it's based on an incorrect use of the term "gang scheduling." I'm referring to a gang as a set of tasks (typically restricted to threads of the same process) which must all be considered runnable or unrunnable simultaneously, and are for the sake of performance required to all actually be run simultaneously. This means a gang of N threads, when run, must run on N processors at once. A time and a set of processors must be chosen for any time interval where the gang is running. This interacts with load balancing by needing to choose the cpus to run the gang on, and also arranging for a set of cpus available for the gang to use to exist by means of migrating tasks off the chosen cpus. William Lee Irwin III wrote: >> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are >> better ones. I'd not bother citing kernel compile results. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > spa_svr actually does its best work when the system isn't fully loaded > as the type of improvement it strives to achieve (minimizing on queue > wait time) hasn't got much room to manoeuvre when the system is fully > loaded. Therefore, the fact that it's 1% better even in these > circumstances is a good result and also indicates that the overhead for > keeping the scheduling statistics it uses for its decision making is > well spent. Especially, when you consider that the total available room > for improvement on this benchmark is less than 3%. None of these benchmarks require the system to be fully loaded. They are, on the other hand, vastly more realistic simulated workloads than kernel compiles, and furthermore are actually developed as benchmarks, with in some cases even measurements of variance, iteration to convergence, and similar such things that make them actually scientific. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > To elaborate, the motivation for this scheduler was acquired from the > observation of scheduling statistics (in particular, on queue wait time) > on systems running at about 30% to 50% load. Theoretically, at these > load levels there should be no such waiting but the statistics show that > there is considerable waiting (sometimes as high as 30% to 50%). I put > this down to "lack of serendipity" e.g. everyone sleeping at the same > time and then trying to run at the same time would be complete lack of > serendipity. On the other hand, if everyone is synced then there would > be total serendipity. > Obviously, from the POV of a client, time the server task spends waiting > on the queue adds to the response time for any request that has been > made so reduction of this time on a server is a good thing(tm). Equally > obviously, trying to achieve this synchronization by asking the tasks to > cooperate with each other is not a feasible solution and some external > influence needs to be exerted and this is what spa_svr does -- it nudges > the scheduling order of the tasks in a way that makes them become well > synced. This all sounds like a relatively good idea. So it's good for throughput vs. latency or otherwise not particularly interactive. No big deal, just use it where it makes sense. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Unfortunately, this is not a good scheduler for an interactive system as > it minimizes the response times for ALL tasks (and the system as a > whole) and this can result in increased response time for some > interactive tasks (clunkiness) which annoys interactive users. When you > start fiddling with this scheduler to bring back "interactive > unfairness" you kill a lot of its superior low overall wait time > performance. > So this is why I think "horses for courses" schedulers are worth while. I have no particular objection to using an appropriate scheduler for the system's workload. I also have little or no preference as to how that's accomplished overall. But I really think that if we want to push pluggable scheduling it should load schedulers as kernel modules on the fly and so on versus pure /proc/ tunables and a compiled-in set of alternatives. William Lee Irwin III wrote: >> In any event, I'm not sure what to say about different schedulers for >> different aims. My intentions with plugsched were not centered around >> production usage or intra-queue policy. I'm relatively indifferent to >> the notion of having pluggable CPU schedulers, intra-queue or otherwise, >> in mainline. I don't see any particular harm in it, but neither am I >> particularly motivated to have it in. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > If you look at the struct sched_spa_child in the file > include/linux/sched_spa.h you'll see that the interface for switching > between the various SPA schedulers is quite small and making them > runtime switchable would be easy (I haven't done this in cpusched as I > wanted to keep the same interface for switching schedulers for all > schedulers: i.e. all run time switchable or none run time switchable; as > the main aim of plugsched had become a mechanism for evaluating > different intra queue scheduling designs.) I remember actually looking at this, and I would almost characterize the differences between the SPA schedulers as a tunable parameter. I have a different concept of what pluggability means from how the SPA schedulers were switched, but no particular objection to the method given the commonalities between them. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:04 ` William Lee Irwin III @ 2007-04-16 12:55 ` Peter Williams 2007-04-16 23:10 ` Michael K. Edwards [not found] ` <20070416135915.GK8915@holomorphy.com> 0 siblings, 2 replies; 713+ messages in thread From: Peter Williams @ 2007-04-16 12:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> Driver models for scheduling are not so far out. AFAICS it's largely a >>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and >>> switching out intra-queue policies vs. switching out whole-system >>> policies, SMP handling and all. Whether this involves load balancing >>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x >>> scheduler module, for instance, would not have a load balancer at all, >>> as it has only one global runqueue. There are other sorts of policies >>> wanting significant changes to SMP handling vs. the stock load >>> balancing. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Well a single run queue removes the need for load balancing but has >> scalability issues on large systems. Personally, I think something in >> between would be the best solution i.e. multiple run queues but more >> than one CPU per run queue. I think that this would be a particularly >> good solution to the problems introduced by hyper threading and multi >> core systems and also NUMA systems. E.g. if all CPUs in a hyper thread >> package are using the one queue then the case where one CPU is trying to >> run a high priority task and the other a low priority task (i.e. the >> cases that the sleeping dependent mechanism tried to address) is less >> likely to occur. > > This wasn't meant to sing the praises of the 2.4.x scheduler; it was > rather meant to point out that the 2.4.x scheduler, among others, is > unimplementable within the framework if it assumes per-cpu runqueues. > More plausibly useful single-queue schedulers would likely use a vastly > different policy and attempt to carry out all queue manipulations via > lockless operations. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> By the way, I think that it's a very bad idea for the scheduling >> mechanism and the load balancing mechanism to be coupled. The anomalies >> that will be experienced and the attempts to make ad hoc fixes for them >> will lead to complexity spiralling out of control. > > This is clearly unavoidable in the case of gang scheduling. There is > simply no other way to schedule N tasks which must all be run > simultaneously when they run at all on N cpus of the system without > such coupling and furthermore at an extremely intimate level, > particularly when multiple competing gangs must be scheduled in such > a fashion. I can't see the logic here or why you would want to do such a thing. It certainly doesn't coincide with what I interpret "gang scheduling" to mean. > > > William Lee Irwin III wrote: >>> Where I had a significant need for >>> mucking with the entire concept of how SMP was handled, this is rather >>> different. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Yes, I went with the idea of intra run queue scheduling being orthogonal >> to load balancing for two reasons: >> 1. I think that coupling them is a bad idea from the complexity POV, and >> 2. it's enough of a battle fighting for modifications to one bit of the >> code without trying to do it to two simultaneously. > > As nice as that sounds, such a code structure would've precluded the > entire raison d'etre of the patch, i.e. gang scheduling. Not for what I understand "gang scheduling" to mean. As I understand it the constraints of gang scheduling are no where near as strict as you seem to think they are. And for what it's worth I don't think that what you think it means is in any sense a reasonable target. > > > William Lee Irwin III wrote: >>> At this point I'm questioning the relevance of my own work, >>> though it was already relatively marginal as it started life as an >>> attempt at a sort of debug patch to help gang scheduling (which is in >>> itself a rather marginally relevant feature to most users) code along. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> The main commercial plug in scheduler used with the run time loadable >> module scheduler that I mentioned earlier did gang scheduling (at the >> insistence of the Tru64 kernel folks). As this scheduler was a >> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" >> ("unfairly" really in according to an allocation policy) among higher >> level entities such as users, groups and applications as well as >> processes; it was fairly easy to make it a gang scheduler by modifying >> it to give all of a process's threads the same priority based on the >> process's CPU usage rather than different priorities based on the >> threads' usage rates. In fact, it would have been possible to select >> between gang and non gang on a per process basis if that was considered >> desirable. >> The fact that threads and processes are distinct entities on Tru64 and >> Solaris made this easier to do on them than on Linux. >> My experience with this scheduler leads me to believe that to achieve >> gang scheduling and fairness, etc. you need (usage) statistics based >> schedulers. > > This does not appear to make sense unless it's based on an incorrect > use of the term "gang scheduling." It's become obvious that we mean different things. > I'm referring to a gang as a set of > tasks (typically restricted to threads of the same process) which must > all be considered runnable or unrunnable simultaneously, and are for > the sake of performance required to all actually be run simultaneously. > This means a gang of N threads, when run, must run on N processors at > once. A time and a set of processors must be chosen for any time > interval where the gang is running. This interacts with load balancing > by needing to choose the cpus to run the gang on, and also arranging > for a set of cpus available for the gang to use to exist by means of > migrating tasks off the chosen cpus. Sounds like a job for the load balancer NOT the scheduler. Also I can't see you meeting such strict constraints without making the tasks all SCHED_FIFO. > > > William Lee Irwin III wrote: >>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are >>> better ones. I'd not bother citing kernel compile results. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> spa_svr actually does its best work when the system isn't fully loaded >> as the type of improvement it strives to achieve (minimizing on queue >> wait time) hasn't got much room to manoeuvre when the system is fully >> loaded. Therefore, the fact that it's 1% better even in these >> circumstances is a good result and also indicates that the overhead for >> keeping the scheduling statistics it uses for its decision making is >> well spent. Especially, when you consider that the total available room >> for improvement on this benchmark is less than 3%. > > None of these benchmarks require the system to be fully loaded. They > are, on the other hand, vastly more realistic simulated workloads than > kernel compiles, and furthermore are actually developed as benchmarks, > with in some cases even measurements of variance, iteration to > convergence, and similar such things that make them actually scientific. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> To elaborate, the motivation for this scheduler was acquired from the >> observation of scheduling statistics (in particular, on queue wait time) >> on systems running at about 30% to 50% load. Theoretically, at these >> load levels there should be no such waiting but the statistics show that >> there is considerable waiting (sometimes as high as 30% to 50%). I put >> this down to "lack of serendipity" e.g. everyone sleeping at the same >> time and then trying to run at the same time would be complete lack of >> serendipity. On the other hand, if everyone is synced then there would >> be total serendipity. >> Obviously, from the POV of a client, time the server task spends waiting >> on the queue adds to the response time for any request that has been >> made so reduction of this time on a server is a good thing(tm). Equally >> obviously, trying to achieve this synchronization by asking the tasks to >> cooperate with each other is not a feasible solution and some external >> influence needs to be exerted and this is what spa_svr does -- it nudges >> the scheduling order of the tasks in a way that makes them become well >> synced. > > This all sounds like a relatively good idea. So it's good for throughput > vs. latency or otherwise not particularly interactive. No big deal, just > use it where it makes sense. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Unfortunately, this is not a good scheduler for an interactive system as >> it minimizes the response times for ALL tasks (and the system as a >> whole) and this can result in increased response time for some >> interactive tasks (clunkiness) which annoys interactive users. When you >> start fiddling with this scheduler to bring back "interactive >> unfairness" you kill a lot of its superior low overall wait time >> performance. >> So this is why I think "horses for courses" schedulers are worth while. > > I have no particular objection to using an appropriate scheduler for the > system's workload. I also have little or no preference as to how that's > accomplished overall. But I really think that if we want to push > pluggable scheduling it should load schedulers as kernel modules on the > fly and so on versus pure /proc/ tunables and a compiled-in set of > alternatives. > > > William Lee Irwin III wrote: >>> In any event, I'm not sure what to say about different schedulers for >>> different aims. My intentions with plugsched were not centered around >>> production usage or intra-queue policy. I'm relatively indifferent to >>> the notion of having pluggable CPU schedulers, intra-queue or otherwise, >>> in mainline. I don't see any particular harm in it, but neither am I >>> particularly motivated to have it in. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> If you look at the struct sched_spa_child in the file >> include/linux/sched_spa.h you'll see that the interface for switching >> between the various SPA schedulers is quite small and making them >> runtime switchable would be easy (I haven't done this in cpusched as I >> wanted to keep the same interface for switching schedulers for all >> schedulers: i.e. all run time switchable or none run time switchable; as >> the main aim of plugsched had become a mechanism for evaluating >> different intra queue scheduling designs.) > > I remember actually looking at this, and I would almost characterize > the differences between the SPA schedulers as a tunable parameter. I > have a different concept of what pluggability means from how the SPA > schedulers were switched, but no particular objection to the method > given the commonalities between them. Yes, that's the way I look at them (in fact, in Zaphod that's exactly what they were -- i.e. Zaphod could be made to behave like various SPA schedulers by fiddling its run time parameters). They illustrate (to my mind) that once you get rid of the O(1) scheduler and replace it with a simple mechanism such as SPA (where there's a small number of points where the scheduling discipline gets to do its thing rather than being interspersed willy nilly throughout the rest of the code) adding run time switchable "horses for courses" scheduler disciplines becomes simple. I think that the simplifications in Ingo's new scheduler (whose scheduling classes now look a lot like Solaris's and its predecessor OSes' scheduler classes) may make it possible to have switchable scheduling disciplines within a scheduling class. I think that something similar (i.e. switchability) could be done for load balancing so that different load balancers could be used when required. By keeping this load balancing functionality orthogonal to the intra run queue scheduling disciplines you increase the number of options available. As I see it, if the scheduling discipline in use does its job properly within a run queue and the load balancer does its job of keeping the weighted load/demand on each run queue roughly equal (except where it has to do otherwise for your version of "gang scheduling") then the overall outcome will meet expectations. Note that I talk of run queues not CPUs as I think a shift to multiple CPUs per run queue may be a good idea. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 12:55 ` Peter Williams @ 2007-04-16 23:10 ` Michael K. Edwards 2007-04-17 3:55 ` Nick Piggin [not found] ` <20070416135915.GK8915@holomorphy.com> 1 sibling, 1 reply; 713+ messages in thread From: Michael K. Edwards @ 2007-04-16 23:10 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > Note that I talk of run queues > not CPUs as I think a shift to multiple CPUs per run queue may be a good > idea. This observation of Peter's is the best thing to come out of this whole foofaraw. Looking at what's happening in CPU-land, I think it's going to be necessary, within a couple of years, to replace the whole idea of "CPU scheduling" with "run queue scheduling" across a complex, possibly dynamic mix of CPU-ish resources. Ergo, there's not much point in churning the mainline scheduler through a design that isn't significantly more flexible than any of those now under discussion. For instance, there are architectures where several "CPUs" (instruction stream decoders feeding execution pipelines) share parts of a cache hierarchy ("chip-level multitasking"). On these machines, you may want to co-schedule a "real" processing task on one pipeline with a "cache warming" task on the other pipeline -- but only for tasks whose memory access patterns have been sufficiently analyzed to write the "cache warming" task code. Some other tasks may want to idle the second pipeline so they can use the full cache-to-RAM bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O bound but so context-heavy that it's not worth yielding the CPU during quick I/Os), and hence perfectly happy to run concurrently with an unrelated task on the other pipeline. There are other architectures where several "hardware threads" fight over parts of a cache hierarchy (sometimes bizarrely described as "sharing" the cache, kind of the way most two-year-olds "share" toys). On these machines, one instruction pipeline can't help the other along cache-wise, but it sure can hurt. A scheduler designed, tested, and tuned principally on one of these architectures (hint: "hyperthreading") will probably leave a lot of performance on the floor on processors in the former category. In the not-so-distant future, we're likely to see architectures with dynamically reconfigurable interconnect between instruction issue units and execution resources. (This is already quite feasible on, say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with as many Nios II cores as fit on the chip.) Restoring task context may involve not just MMU swaps and FPU instructions (with state-dependent hidden costs) but processsor reconfiguration. Achieving "fairness" according to any standard that a platform integrator cares about (let alone an end user) will require a fairly detailed model of the hidden costs associated with different sorts of task switch. So if you are interested in schedulers for some reason other than a paycheck, let the distros worry about 5% improvements on x86[_64]. Get hold of some different "hardware" -- say: - a Xilinx ML410 if you've got $3K to blow and want to explore reconfigurable processors; - a SunFire T2000 if you've got $11K and want to mess with a CMT system that's actually shipping; - a QEMU-simulated massively SMP x86 if you're poor but clever enough to implement funky cross-core cache effects yourself; or - a cycle-accurate simulator from Gaisler or Virtio if you want a real research project. Then go explore some more interesting regions of parameter space and see what the demands on mainline Linux will look like in a few years. Cheers, - Michael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 23:10 ` Michael K. Edwards @ 2007-04-17 3:55 ` Nick Piggin 2007-04-17 4:25 ` Peter Williams 2007-04-17 8:24 ` William Lee Irwin III 0 siblings, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 3:55 UTC (permalink / raw) To: Michael K. Edwards Cc: Peter Williams, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: > On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > >Note that I talk of run queues > >not CPUs as I think a shift to multiple CPUs per run queue may be a good > >idea. > > This observation of Peter's is the best thing to come out of this > whole foofaraw. Looking at what's happening in CPU-land, I think it's > going to be necessary, within a couple of years, to replace the whole > idea of "CPU scheduling" with "run queue scheduling" across a complex, > possibly dynamic mix of CPU-ish resources. Ergo, there's not much > point in churning the mainline scheduler through a design that isn't > significantly more flexible than any of those now under discussion. Why? If you do that, then your load balancer just becomes less flexible because it is harder to have tasks run on one or the other. You can have single-runqueue-per-domain behaviour (or close to) just by relaxing all restrictions on idle load balancing within that domain. It is harder to go the other way and place any per-cpu affinity or restirctions with multiple cpus on a single runqueue. > For instance, there are architectures where several "CPUs" > (instruction stream decoders feeding execution pipelines) share parts > of a cache hierarchy ("chip-level multitasking"). On these machines, > you may want to co-schedule a "real" processing task on one pipeline > with a "cache warming" task on the other pipeline -- but only for > tasks whose memory access patterns have been sufficiently analyzed to > write the "cache warming" task code. Some other tasks may want to > idle the second pipeline so they can use the full cache-to-RAM > bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O > bound but so context-heavy that it's not worth yielding the CPU during > quick I/Os), and hence perfectly happy to run concurrently with an > unrelated task on the other pipeline. We can do all that now with load balancing, affinities or by shutting down threads dynamically. > There are other architectures where several "hardware threads" fight > over parts of a cache hierarchy (sometimes bizarrely described as > "sharing" the cache, kind of the way most two-year-olds "share" toys). > On these machines, one instruction pipeline can't help the other > along cache-wise, but it sure can hurt. A scheduler designed, tested, > and tuned principally on one of these architectures (hint: > "hyperthreading") will probably leave a lot of performance on the > floor on processors in the former category. > > In the not-so-distant future, we're likely to see architectures with > dynamically reconfigurable interconnect between instruction issue > units and execution resources. (This is already quite feasible on, > say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with > as many Nios II cores as fit on the chip.) Restoring task context may > involve not just MMU swaps and FPU instructions (with state-dependent > hidden costs) but processsor reconfiguration. Achieving "fairness" > according to any standard that a platform integrator cares about (let > alone an end user) will require a fairly detailed model of the hidden > costs associated with different sorts of task switch. > > So if you are interested in schedulers for some reason other than a > paycheck, let the distros worry about 5% improvements on x86[_64]. > Get hold of some different "hardware" -- say: > - a Xilinx ML410 if you've got $3K to blow and want to explore > reconfigurable processors; > - a SunFire T2000 if you've got $11K and want to mess with a CMT > system that's actually shipping; > - a QEMU-simulated massively SMP x86 if you're poor but clever > enough to implement funky cross-core cache effects yourself; or > - a cycle-accurate simulator from Gaisler or Virtio if you want a > real research project. > Then go explore some more interesting regions of parameter space and > see what the demands on mainline Linux will look like in a few years. There are no doubt improvements to be made, but they are generally intended to be able to be done within the sched-domains framework. I am not aware of a particular need that would be impossible to do using that topology hierarchy and per-CPU runqueues, and there are added complications involved with multiple CPUs per runqueue. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:55 ` Nick Piggin @ 2007-04-17 4:25 ` Peter Williams 2007-04-17 4:34 ` Nick Piggin 2007-04-17 8:24 ` William Lee Irwin III 1 sibling, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 4:25 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >>> Note that I talk of run queues >>> not CPUs as I think a shift to multiple CPUs per run queue may be a good >>> idea. >> This observation of Peter's is the best thing to come out of this >> whole foofaraw. Looking at what's happening in CPU-land, I think it's >> going to be necessary, within a couple of years, to replace the whole >> idea of "CPU scheduling" with "run queue scheduling" across a complex, >> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >> point in churning the mainline scheduler through a design that isn't >> significantly more flexible than any of those now under discussion. > > Why? If you do that, then your load balancer just becomes less flexible > because it is harder to have tasks run on one or the other. > > You can have single-runqueue-per-domain behaviour (or close to) just by > relaxing all restrictions on idle load balancing within that domain. It > is harder to go the other way and place any per-cpu affinity or > restirctions with multiple cpus on a single runqueue. Allowing N (where N can be one or greater) CPUs per run queue actually increases flexibility as you can still set N to 1 to get the current behaviour. One advantage of allowing multiple CPUs per run queue would be at the smaller end of the system scale i.e. a PC with a single hyper threading chip (i.e. 2 CPUs) would not need to worry about load balancing at all if both CPUs used the one runqueue and all the nasty side effects that come with hyper threading would be minimized at the same time. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:25 ` Peter Williams @ 2007-04-17 4:34 ` Nick Piggin 2007-04-17 6:03 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 4:34 UTC (permalink / raw) To: Peter Williams Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: > >>On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > >>>Note that I talk of run queues > >>>not CPUs as I think a shift to multiple CPUs per run queue may be a good > >>>idea. > >>This observation of Peter's is the best thing to come out of this > >>whole foofaraw. Looking at what's happening in CPU-land, I think it's > >>going to be necessary, within a couple of years, to replace the whole > >>idea of "CPU scheduling" with "run queue scheduling" across a complex, > >>possibly dynamic mix of CPU-ish resources. Ergo, there's not much > >>point in churning the mainline scheduler through a design that isn't > >>significantly more flexible than any of those now under discussion. > > > >Why? If you do that, then your load balancer just becomes less flexible > >because it is harder to have tasks run on one or the other. > > > >You can have single-runqueue-per-domain behaviour (or close to) just by > >relaxing all restrictions on idle load balancing within that domain. It > >is harder to go the other way and place any per-cpu affinity or > >restirctions with multiple cpus on a single runqueue. > > Allowing N (where N can be one or greater) CPUs per run queue actually > increases flexibility as you can still set N to 1 to get the current > behaviour. But you add extra code for that on top of what we have, and are also prevented from making per-cpu assumptions. And you can get N CPUs per runqueue behaviour by having them in a domain with no restrictions on idle balancing. So where does your increased flexibilty come from? > One advantage of allowing multiple CPUs per run queue would be at the > smaller end of the system scale i.e. a PC with a single hyper threading > chip (i.e. 2 CPUs) would not need to worry about load balancing at all > if both CPUs used the one runqueue and all the nasty side effects that > come with hyper threading would be minimized at the same time. I don't know about that -- the current load balancer already minimises the nasty multi threading effects. SMT is very important for IBM's chips for example, and they've never had any problem with that side of it since it was introduced and bugs ironed out (at least, none that I've heard). ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:34 ` Nick Piggin @ 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Peter Williams @ 2007-04-17 6:03 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >>>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >>>>> Note that I talk of run queues >>>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good >>>>> idea. >>>> This observation of Peter's is the best thing to come out of this >>>> whole foofaraw. Looking at what's happening in CPU-land, I think it's >>>> going to be necessary, within a couple of years, to replace the whole >>>> idea of "CPU scheduling" with "run queue scheduling" across a complex, >>>> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >>>> point in churning the mainline scheduler through a design that isn't >>>> significantly more flexible than any of those now under discussion. >>> Why? If you do that, then your load balancer just becomes less flexible >>> because it is harder to have tasks run on one or the other. >>> >>> You can have single-runqueue-per-domain behaviour (or close to) just by >>> relaxing all restrictions on idle load balancing within that domain. It >>> is harder to go the other way and place any per-cpu affinity or >>> restirctions with multiple cpus on a single runqueue. >> Allowing N (where N can be one or greater) CPUs per run queue actually >> increases flexibility as you can still set N to 1 to get the current >> behaviour. > > But you add extra code for that on top of what we have, and are also > prevented from making per-cpu assumptions. > > And you can get N CPUs per runqueue behaviour by having them in a domain > with no restrictions on idle balancing. So where does your increased > flexibilty come from? > >> One advantage of allowing multiple CPUs per run queue would be at the >> smaller end of the system scale i.e. a PC with a single hyper threading >> chip (i.e. 2 CPUs) would not need to worry about load balancing at all >> if both CPUs used the one runqueue and all the nasty side effects that >> come with hyper threading would be minimized at the same time. > > I don't know about that -- the current load balancer already minimises > the nasty multi threading effects. SMT is very important for IBM's chips > for example, and they've never had any problem with that side of it > since it was introduced and bugs ironed out (at least, none that I've > heard). > There's a lot of ugly code in the load balancer that is only there to overcome the side effects of SMT and dual core. A lot of it was put there by Intel employees trying to make load balancing more friendly to their systems. What I'm suggesting is that an N CPUs per runqueue is a better way of achieving that end. I may (of course) be wrong but I think that the idea deserves more consideration than you're willing to give it. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams @ 2007-04-17 6:14 ` William Lee Irwin III 2007-04-17 6:23 ` Nick Piggin 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:14 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Michael K. Edwards, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote: > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly to > their systems. What I'm suggesting is that an N CPUs per runqueue is a > better way of achieving that end. I may (of course) be wrong but I > think that the idea deserves more consideration than you're willing to > give it. This may be a good one to ask Ingo about, as he did significant performance work on per-core runqueues for SMT. While I did write per-node runqueue code for NUMA at some point in the past, I did no tuning or other performance work on it, only functionality. I've actually dealt with kernels using elder versions of Ingo's code for per-core runqueues on SMT, but was never called upon to examine that particular code for either performance or stability, so I'm largely ignorant of what the perceived outcome of it was. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III @ 2007-04-17 6:23 ` Nick Piggin 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 6:23 UTC (permalink / raw) To: Peter Williams Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote: > Nick Piggin wrote: > > > >But you add extra code for that on top of what we have, and are also > >prevented from making per-cpu assumptions. > > > >And you can get N CPUs per runqueue behaviour by having them in a domain > >with no restrictions on idle balancing. So where does your increased > >flexibilty come from? > > > >>One advantage of allowing multiple CPUs per run queue would be at the > >>smaller end of the system scale i.e. a PC with a single hyper threading > >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all > >>if both CPUs used the one runqueue and all the nasty side effects that > >>come with hyper threading would be minimized at the same time. > > > >I don't know about that -- the current load balancer already minimises > >the nasty multi threading effects. SMT is very important for IBM's chips > >for example, and they've never had any problem with that side of it > >since it was introduced and bugs ironed out (at least, none that I've > >heard). > > > > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly to I agree that some of that has exploded complexity. I have some thoughts about better approaches for some of those things, but basically been stuck working on VM problems for a while. > their systems. What I'm suggesting is that an N CPUs per runqueue is a > better way of achieving that end. I may (of course) be wrong but I > think that the idea deserves more consideration than you're willing to > give it. Put it this way: it is trivial to group the load balancing stats of N CPUs with their own runqueues. Just put them under a domain and take the sum. The domain essentially takes on the same function as a single queue with N CPUs under it. Anything _further_ you can do with individual runqueues (like naturally adding an affinity pressure ranging from nothing to absolute) are things that you don't trivially get with 1:N approach. AFAIKS. So I will definitely give any idea consideration, but I just need to be shown where the benefit comes from. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III 2007-04-17 6:23 ` Nick Piggin @ 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 9:36 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Michael K. Edwards, William Lee Irwin III, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly > to their systems. What I'm suggesting is that an N CPUs per runqueue > is a better way of achieving that end. I may (of course) be wrong but > I think that the idea deserves more consideration than you're willing > to give it. i actually implemented that some time ago and i'm afraid it was ugly as hell and pretty fragile. Load-balancing gets simpler, but task picking gets alot uglier. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:55 ` Nick Piggin 2007-04-17 4:25 ` Peter Williams @ 2007-04-17 8:24 ` William Lee Irwin III 1 sibling, 0 replies; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 8:24 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, Peter Williams, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >> This observation of Peter's is the best thing to come out of this >> whole foofaraw. Looking at what's happening in CPU-land, I think it's >> going to be necessary, within a couple of years, to replace the whole >> idea of "CPU scheduling" with "run queue scheduling" across a complex, >> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >> point in churning the mainline scheduler through a design that isn't >> significantly more flexible than any of those now under discussion. On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote: > Why? If you do that, then your load balancer just becomes less flexible > because it is harder to have tasks run on one or the other. On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote: > You can have single-runqueue-per-domain behaviour (or close to) just by > relaxing all restrictions on idle load balancing within that domain. It > is harder to go the other way and place any per-cpu affinity or > restirctions with multiple cpus on a single runqueue. The big sticking point here is order-sensitivity. One can point to stringent sched_yield() ordering but that's not so important in and of itself. The more significant case is RT applications which are order- sensitive. Per-cpu runqueues rather significantly disturb the ordering requirements of applications that care about it. In terms of a plugging framework, the per-cpu arrangement precludes or makes extremely awkward scheduling policies that don't have per-cpu runqueues, for instance, the 2.4.x policy. There is also the alternate SMP scalability strategy of a lockless scheduler with a single global queue, which is more performance-oriented. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
[parent not found: <20070416135915.GK8915@holomorphy.com>]
[parent not found: <46241677.7060909@bigpond.net.au>]
[parent not found: <20070417025704.GM8915@holomorphy.com>]
[parent not found: <462445EC.1060306@bigpond.net.au>]
[parent not found: <20070417053147.GN8915@holomorphy.com>]
[parent not found: <46246A7C.8050501@bigpond.net.au>]
[parent not found: <20070417064109.GP8915@holomorphy.com>]
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] [not found] ` <20070417064109.GP8915@holomorphy.com> @ 2007-04-17 8:00 ` Peter Williams 2007-04-17 10:41 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 8:00 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote: >> This doesn't make any sense to me. >> For a start, exact simultaneous operation would be impossible to achieve >> except with highly specialized architecture such as the long departed >> transputer. And secondly, I can't see why it's necessary. > > We're not going to make any headway here, so we might as well drop the > thread. Yes, we were starting to go around in circles weren't we? > > There are other things to talk about anyway, for instance I'm seeing > interest in plugsched come about from elsewhere and am taking an > interest in getting it into shape wrt. various design goals therefore. > > Probably the largest issue of note is getting scheduler drivers > loadable as kernel modules. Addressing the points Ingo made that can > be addressed are also lined up for this effort. > > Comments on which directions you'd like this to go in these respects > would be appreciated, as I regard you as the current "project owner." I'd do scan through LKML from about 18 months ago looking for mention of runtime configurable version of plugsched. Some students at a university (in Germany, I think) posted some patches adding this feature to plugsched around about then. I never added them to plugsched proper as I knew (from previous experience when the company I worked for posted patches with similar functionality) that Linux would like this idea less than he did the current plugsched mechanism. Unfortunately, my own cache of the relevant e-mails got overwritten during a Fedora Core upgrade (I've since moved /var onto a separate drive to avoid a repetition) or I would dig them out and send them to you. I'd provided with copies of the company's patches to use as a guide to how to overcome the problems associated with changing schedulers on a running system (a few non trivial locking issues pop up). Maybe if one of the students still reads LKML he will provide a pointer. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:00 ` Peter Williams @ 2007-04-17 10:41 ` William Lee Irwin III 2007-04-17 13:48 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-17 10:41 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: >> Comments on which directions you'd like this to go in these respects >> would be appreciated, as I regard you as the current "project owner." On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > I'd do scan through LKML from about 18 months ago looking for mention of > runtime configurable version of plugsched. Some students at a > university (in Germany, I think) posted some patches adding this feature > to plugsched around about then. Excellent. I'll go hunting for that. On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > I never added them to plugsched proper as I knew (from previous > experience when the company I worked for posted patches with similar > functionality) that Linux would like this idea less than he did the > current plugsched mechanism. Odd how the requirements ended up including that. Fickleness abounds. If only we knew up-front what the end would be. On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > Unfortunately, my own cache of the relevant e-mails got overwritten > during a Fedora Core upgrade (I've since moved /var onto a separate > drive to avoid a repetition) or I would dig them out and send them to > you. I'd provided with copies of the company's patches to use as a > guide to how to overcome the problems associated with changing > schedulers on a running system (a few non trivial locking issues pop up). > Maybe if one of the students still reads LKML he will provide a pointer. I was tempted to restart from scratch given Ingo's comments, but I reconsidered and I'll be working with your code (and the German students' as well). If everything has to change, so be it, but it'll still be a derived work. It would be ignoring precedent and failure to properly attribute if I did otherwise. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 10:41 ` William Lee Irwin III @ 2007-04-17 13:48 ` Peter Williams 2007-04-18 0:27 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 13:48 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> Comments on which directions you'd like this to go in these respects >>> would be appreciated, as I regard you as the current "project owner." > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> I'd do scan through LKML from about 18 months ago looking for mention of >> runtime configurable version of plugsched. Some students at a >> university (in Germany, I think) posted some patches adding this feature >> to plugsched around about then. > > Excellent. I'll go hunting for that. > > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> I never added them to plugsched proper as I knew (from previous >> experience when the company I worked for posted patches with similar >> functionality) that Linux would like this idea less than he did the >> current plugsched mechanism. > > Odd how the requirements ended up including that. Fickleness abounds. > If only we knew up-front what the end would be. > > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> Unfortunately, my own cache of the relevant e-mails got overwritten >> during a Fedora Core upgrade (I've since moved /var onto a separate >> drive to avoid a repetition) or I would dig them out and send them to >> you. I'd provided with copies of the company's patches to use as a >> guide to how to overcome the problems associated with changing >> schedulers on a running system (a few non trivial locking issues pop up). >> Maybe if one of the students still reads LKML he will provide a pointer. > > I was tempted to restart from scratch given Ingo's comments, but I > reconsidered and I'll be working with your code (and the German > students' as well). If everything has to change, so be it, but it'll > still be a derived work. It would be ignoring precedent and failure to > properly attribute if I did otherwise. I can give you a patch (or set of patches) against the latest git vanilla kernel version if that would help. There have been changes to the vanilla scheduler code since 2.6.20 so the latest patch on sourceforge won't apply cleanly. I've found that implementing this as a series of patches rather than one big patch makes it easier fro me to cope with changes to the underlying code. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:48 ` Peter Williams @ 2007-04-18 0:27 ` Peter Williams 2007-04-18 2:03 ` William Lee Irwin III 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-18 0:27 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List Peter Williams wrote: > William Lee Irwin III wrote: >> I was tempted to restart from scratch given Ingo's comments, but I >> reconsidered and I'll be working with your code (and the German >> students' as well). If everything has to change, so be it, but it'll >> still be a derived work. It would be ignoring precedent and failure to >> properly attribute if I did otherwise. > > I can give you a patch (or set of patches) against the latest git > vanilla kernel version if that would help. There have been changes to > the vanilla scheduler code since 2.6.20 so the latest patch on > sourceforge won't apply cleanly. I've found that implementing this as a > series of patches rather than one big patch makes it easier fro me to > cope with changes to the underlying code. I've just placed a single patch for plugsched against 2.6.21-rc7 updated to Linus's git tree as of an hour or two ago on sourceforge: <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> This should at least enable you to get it to apply cleanly to the latest kernel sources. Let me know if you'd also like this as a quilt/mq friendly patch series? Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 0:27 ` Peter Williams @ 2007-04-18 2:03 ` William Lee Irwin III 2007-04-18 2:31 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: William Lee Irwin III @ 2007-04-18 2:03 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List > Peter Williams wrote: > >William Lee Irwin III wrote: > >>I was tempted to restart from scratch given Ingo's comments, but I > >>reconsidered and I'll be working with your code (and the German > >>students' as well). If everything has to change, so be it, but it'll > >>still be a derived work. It would be ignoring precedent and failure to > >>properly attribute if I did otherwise. > > > >I can give you a patch (or set of patches) against the latest git > >vanilla kernel version if that would help. There have been changes to > >the vanilla scheduler code since 2.6.20 so the latest patch on > >sourceforge won't apply cleanly. I've found that implementing this as a > >series of patches rather than one big patch makes it easier fro me to > >cope with changes to the underlying code. > On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote: > I've just placed a single patch for plugsched against 2.6.21-rc7 updated > to Linus's git tree as of an hour or two ago on sourceforge: > <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> > This should at least enable you to get it to apply cleanly to the latest > kernel sources. Let me know if you'd also like this as a quilt/mq > friendly patch series? A quilt-friendly series would be most excellent if you could arrange it. Thanks. -- wli ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 2:03 ` William Lee Irwin III @ 2007-04-18 2:31 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-18 2:31 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: >> Peter Williams wrote: >>> William Lee Irwin III wrote: >>>> I was tempted to restart from scratch given Ingo's comments, but I >>>> reconsidered and I'll be working with your code (and the German >>>> students' as well). If everything has to change, so be it, but it'll >>>> still be a derived work. It would be ignoring precedent and failure to >>>> properly attribute if I did otherwise. >>> I can give you a patch (or set of patches) against the latest git >>> vanilla kernel version if that would help. There have been changes to >>> the vanilla scheduler code since 2.6.20 so the latest patch on >>> sourceforge won't apply cleanly. I've found that implementing this as a >>> series of patches rather than one big patch makes it easier fro me to >>> cope with changes to the underlying code. > On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote: >> I've just placed a single patch for plugsched against 2.6.21-rc7 updated >> to Linus's git tree as of an hour or two ago on sourceforge: >> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> >> This should at least enable you to get it to apply cleanly to the latest >> kernel sources. Let me know if you'd also like this as a quilt/mq >> friendly patch series? > > A quilt-friendly series would be most excellent if you could arrange it. Done: <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch-series.tar.gz> Just untar this in the base directory of your Linux kernel source and Bob's your uncle. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 1:06 ` Peter Williams 2007-04-16 3:04 ` William Lee Irwin III @ 2007-04-16 17:22 ` Chris Friesen 2007-04-17 0:54 ` Peter Williams 1 sibling, 1 reply; 713+ messages in thread From: Chris Friesen @ 2007-04-16 17:22 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > To my mind scheduling > and load balancing are orthogonal and keeping them that way simplifies > things. Scuse me if I jump in here, but doesn't the load balancer need some way to figure out a) when to run, and b) which tasks to pull and where to push them? I suppose you could abstract this into a per-scheduler API, but to me at least these are the hard parts of the load balancer... Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 17:22 ` Chris Friesen @ 2007-04-17 0:54 ` Peter Williams 2007-04-17 15:52 ` Chris Friesen 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 0:54 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: > >> To my mind scheduling and load balancing are orthogonal and keeping >> them that way simplifies things. > > Scuse me if I jump in here, but doesn't the load balancer need some way > to figure out a) when to run, and b) which tasks to pull and where to > push them? Yes but both of these are independent of the scheduler discipline in force. > > I suppose you could abstract this into a per-scheduler API, but to me at > least these are the hard parts of the load balancer... Load balancing needs to be based on the static priorities (i.e. nice or real time priority) of the runnable tasks not the dynamic priorities. If the load balancer manages to keep the weighted (according to static priority) load and distribution of priorities within the loads on the CPUs roughly equal and the scheduler does a good job of ensuring fairness, interactive responsiveness etc. for the tasks within a CPU then the result will be good system performance within the constraints set by the sys admins use of real time priorities and nice. The smpnice modifications to the load balancer were meant to give it the appropriate behaviour and what we need to fix now is the intra CPU scheduling. Even if the load balancer isn't yet perfect perfecting it can be done separately to fixing the scheduler preferably with as little interdependency as possible. Probably the only contribution to load balancing that the scheduler really needs to make is the calculating of the average weighted load on each of the CPUs (or run queues if there's more than one CPU per runqueue). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 0:54 ` Peter Williams @ 2007-04-17 15:52 ` Chris Friesen 2007-04-17 23:50 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Chris Friesen @ 2007-04-17 15:52 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Chris Friesen wrote: >> Scuse me if I jump in here, but doesn't the load balancer need some >> way to figure out a) when to run, and b) which tasks to pull and where >> to push them? > Yes but both of these are independent of the scheduler discipline in force. It is not clear to me that this is always the case, especially once you mix in things like resource groups. > If > the load balancer manages to keep the weighted (according to static > priority) load and distribution of priorities within the loads on the > CPUs roughly equal and the scheduler does a good job of ensuring > fairness, interactive responsiveness etc. for the tasks within a CPU > then the result will be good system performance within the constraints > set by the sys admins use of real time priorities and nice. Suppose I have a really high priority task running. Another very high priority task wakes up and would normally preempt the first one. However, there happens to be another cpu available. It seems like it would be a win if we moved one of those tasks to the available cpu immediately so they can both run simultaneously. This would seem to require some communication between the scheduler and the load balancer. Certainly the above design could introduce a lot of context switching. But if my goal is a scheduler that minimizes latency (even at the cost of throughput) then that's an acceptable price to pay. Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 15:52 ` Chris Friesen @ 2007-04-17 23:50 ` Peter Williams 2007-04-18 5:43 ` Chris Friesen 0 siblings, 1 reply; 713+ messages in thread From: Peter Williams @ 2007-04-17 23:50 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: >> Chris Friesen wrote: >>> Scuse me if I jump in here, but doesn't the load balancer need some >>> way to figure out a) when to run, and b) which tasks to pull and >>> where to push them? > >> Yes but both of these are independent of the scheduler discipline in >> force. > > It is not clear to me that this is always the case, especially once you > mix in things like resource groups. > >> If >> the load balancer manages to keep the weighted (according to static >> priority) load and distribution of priorities within the loads on the >> CPUs roughly equal and the scheduler does a good job of ensuring >> fairness, interactive responsiveness etc. for the tasks within a CPU >> then the result will be good system performance within the constraints >> set by the sys admins use of real time priorities and nice. > > Suppose I have a really high priority task running. Another very high > priority task wakes up and would normally preempt the first one. > However, there happens to be another cpu available. It seems like it > would be a win if we moved one of those tasks to the available cpu > immediately so they can both run simultaneously. This would seem to > require some communication between the scheduler and the load balancer. Not really the load balancer can do this on its own AND the decision should be based on the STATIC priority of the task being woken. > > Certainly the above design could introduce a lot of context switching. > But if my goal is a scheduler that minimizes latency (even at the cost > of throughput) then that's an acceptable price to pay. It would actually probably reduce context switching as putting the woken task on the best CPU at wake up means you don't have to move it later on. The wake up code already does a little bit in this direction when it chooses which CPU to put a newly woken task on but could do more -- the only real cost would be the cost of looking at more candidate CPUs than it currently does. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:50 ` Peter Williams @ 2007-04-18 5:43 ` Chris Friesen 2007-04-18 13:00 ` Peter Williams 0 siblings, 1 reply; 713+ messages in thread From: Chris Friesen @ 2007-04-18 5:43 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Chris Friesen wrote: >> Suppose I have a really high priority task running. Another very high >> priority task wakes up and would normally preempt the first one. >> However, there happens to be another cpu available. It seems like it >> would be a win if we moved one of those tasks to the available cpu >> immediately so they can both run simultaneously. This would seem to >> require some communication between the scheduler and the load balancer. > > > Not really the load balancer can do this on its own AND the decision > should be based on the STATIC priority of the task being woken. I guess I don't follow. How would the load balancer know that it needs to run? Running on every task wake-up seems expensive. Also, static priority isn't everything. What about the gang-scheduler concept where certain tasks must be scheduled simultaneously on different cpus? What about a resource-group scenario where you have per-cpu resource limits, so that for good latency/fairness you need to force a high priority task to migrate to another cpu once it has consumed the cpu allocation of that group on the current cpu? I can see having a generic load balancer core code, but it seems to me that the scheduler proper needs to have some way of triggering the load balancer to run, and some kind of goodness functions to indicate a) which tasks to move, and b) where to move them. Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:43 ` Chris Friesen @ 2007-04-18 13:00 ` Peter Williams 0 siblings, 0 replies; 713+ messages in thread From: Peter Williams @ 2007-04-18 13:00 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: >> Chris Friesen wrote: > >>> Suppose I have a really high priority task running. Another very >>> high priority task wakes up and would normally preempt the first one. >>> However, there happens to be another cpu available. It seems like it >>> would be a win if we moved one of those tasks to the available cpu >>> immediately so they can both run simultaneously. This would seem to >>> require some communication between the scheduler and the load balancer. >> >> >> Not really the load balancer can do this on its own AND the decision >> should be based on the STATIC priority of the task being woken. > > I guess I don't follow. How would the load balancer know that it needs > to run? Running on every task wake-up seems expensive. Also, static > priority isn't everything. What about the gang-scheduler concept where > certain tasks must be scheduled simultaneously on different cpus? What > about a resource-group scenario where you have per-cpu resource limits, > so that for good latency/fairness you need to force a high priority task > to migrate to another cpu once it has consumed the cpu allocation of > that group on the current cpu? > > I can see having a generic load balancer core code, but it seems to me > that the scheduler proper needs to have some way of triggering the load > balancer to run, It doesn't have to be closely coupled with the load balancer to does this. It just needs to know where the trigger is. > and some kind of goodness functions to indicate a) > which tasks to move, and b) where to move them. That's the load balancer's job and even if you use dynamic priority for load balancing it still wouldn't need to be closely coupled. The load balancer would just need to know how to find a process's dynamic priority. In fact, in the current set up, the load balancer decides how much load needs to be moved based on the static load on the CPUs but uses dynamic priority (to a large degree) to decide which ones to move. This is due more to computational efficiency considerations than any deliberate design (I suspect) as the fact that tasks are stored on the runqueue in dynamic priority order makes looking at processes in dynamic priority order is the most efficient strategy. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:05 ` Ingo Molnar 2007-04-15 20:05 ` Matt Mackall @ 2007-04-16 5:16 ` Con Kolivas 2007-04-16 5:48 ` Gene Heskett 1 sibling, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-16 5:16 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 01:05, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > 2. Since then I've been thinking/working on a cpu scheduler design > > that takes away all the guesswork out of scheduling and gives very > > predictable, as fair as possible, cpu distribution and latency while > > preserving as solid interactivity as possible within those confines. > > yeah. I think you were right on target with this call. Yay thank goodness :) It's time to fix the damn cpu scheduler once and for all. Everyone uses this; it's no minor driver or $bigsmp or $bigram or $small_embedded_RT_hardware feature. > I've applied the > sched.c change attached at the bottom of this mail to the CFS patch, if > you dont mind. (or feel free to suggest some other text instead.) > * 2003-09-03 Interactivity tuning by Con Kolivas. > * 2004-04-02 Scheduler domains code by Nick Piggin > + * 2007-04-15 Con Kolivas was dead right: fairness matters! :) LOL that's awful. I'd prefer something meaningful like "Work begun on replacing all interactivity tuning with a fair virtual-deadline design by Con Kolivas". While you're at it, it's worth getting rid of a few slightly pointless name changes too. Don't rename SCHED_NORMAL yet again, and don't call all your things sched_fair blah_fair __blah_fair and so on. It means that anything else is by proxy going to be considered unfair. Leave SCHED_NORMAL as is, replace the use of the word _fair with _cfs. I don't really care how many copyright notices you put into our already noisy bootup but it's redundant since there is no choice; we all get the same cpu scheduler. > > 1. I tried in vain some time ago to push a working extensable > > pluggable cpu scheduler framework (based on wli's work) for the linux > > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he > > didn't like it) as being absolutely the wrong approach and that we > > should never do that. [...] > > i partially replied to that point to Will already, and i'd like to make > it clear again: yes, i rejected plugsched 2-3 years ago (which already > drifted away from wli's original codebase) and i would still reject it > today. No that was just me being flabbergasted by what appeared to be you posting your own plugsched. Note nowhere in the 40 iterations of rsdl->sd did I ask/suggest for plugsched. I said in my first announcement my aim was to create a scheduling policy robust enough for all situations rather than fantastic a lot of the time and awful sometimes. There are plenty of people ready to throw out arguments for plugsched now and I don't have the energy to continue that fight (I never did really). But my question still stands about this comment: > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] What exactly would be the purpose of such a module that governs nothing in particular? Since there'll be no pluggable scheduler by your admission it has no control over SCHED_NORMAL, and would require another scheduling policy for it to govern which there is no express way to use at the moment and people tend to just use the default without great effort. > First and foremost, please dont take such rejections too personally - i > had my own share of rejections (and in fact, as i mentioned it in a > previous mail, i had a fair number of complete project throwaways: > 4g:4g, in-kernel Tux, irqrate and many others). I know that they can > hurt and can demoralize, but if i dont like something it's my job to > tell that. Hmm? No that's not what this is about. Remember dynticks which was not originally my code but I tried to bring it up to mainline standard which I fought with for months? You came along with yet another rewrite from scratch and the flaws in the design I was working with were obvious so I instantly bowed down to that and never touched my code again. I didn't ask for credit back then, but obviously brought the requirement for a no idle tick implementation to the table. > My view about plugsched: first please take a look at the latest > plugsched code: > > http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch > > 26 files changed, 8951 insertions(+), 1495 deletions(-) > > As an experiment i've removed all the add-on schedulers (both the core > and the include files, only kept the vanilla one) from the plugsched > patch (and the makefile and kconfig complications, etc), to see the > 'infrastructure cost', and it still gave: > > 12 files changed, 1933 insertions(+), 1479 deletions(-) I do not see extra code per-se as being a bad thing. I've heard said a few times before "ever notice how when the correct solution is done it is a lot more code than the quick hack that ultimately fails?". Insert long winded discussion of perfect is the enemy of good here, _but_ I'm not arguing perfect versus good, I'm talking about solid code versus quick fix. Again, none of this comment is directed specifically at this implementation of plugsched, its code quality or intent, but using "extra code is bad" as an argument is not enough. > By your logic Mike should in fact be quite upset about this: if the > new code works out and proves to be useful then it obsoletes a whole lot > of code of him! > > [...] However at one stage I virtually begged for support with my > > attempts and help with the code. Dmitry Adamushko is the only person > > who actually helped me with the code in the interim, while others > > poked sticks at it. Sure the sticks helped at times but the sticks > > always seemed to have their ends kerosene doused and flaming for > > reasons I still don't get. No other help was forthcoming. > Hey, i told this to you as recently as 1 month ago as well: > > http://lkml.org/lkml/2007/3/8/54 > > "cool! I like this even more than i liked your original staircase > scheduler from 2 years ago :)" Email has an awful knack of disguising intent so I took that on face value that you did like the idea :). Above when I said "no other help was forthcoming" all I was hoping for was really simple obvious bugfixes to help me along while I was laid up in bed such as "I like what you're doing but oh your use of memset here is bogus, here is a one line patch". I wasn't specifically expecting you to fix my code; you've got truckloads of things you need to do. It just reminds me that the concept of "release early, release often" doesn't actually work in the kernel. What is far more obvious is "release code only when it's so close to perfect that noone can argue against it" since most of the work is done by one person, otherwise someone will come out with a counterpatch that is _complete_ earlier but in all possibility not as good, it's just ready sooner. *NOTE* In no way am I saying your code is not as good as mine; I would have to say exactly the opposite is true pretty much always (<sarcasm>conversely then I doubt if I dropped you in my work environment you'd do as good a job as I do</sarcasm>). At one stage wli (again at my request) put together a quick hack to check for non-preemptible regions within the kernel. From that quick hack you adopted it and turned it into that beautiful latency tracer that is the cornerstone of the -rt tree testing. However, there are many instances I've seen good evolving code in the linux kernel be trumped by not-as-good but already-working alternatives written from scratch with no reference to the original work. This is the NIH (not invented here) mechanism I see happening that is worth objecting to. What you may find most amusing is the very first iterations of RSDL looked _nothing_ like the mainline scheduler. There were all sorts of different structures, mechanisms, one priority array, plans to remove scheduler_tick entirely and so on. Most of those were never made for public consumption. I spent about half a dozen iterations of RSDL removing all of that and making it as close to the mainline design as possible, thus minimising the size of the patch, and to make it readily readable for most people familiar with the scheduler policy code in sched.c (all 5 of them). I should have just said bugger it and started everything from scratch with little to no reference to the original scheduler but found myself obliged to try to do things the minimal code patch size readable difference thingy that was valued in linux kernel development. I think the radically different approach would have been better in the long run. Trying to play ball I ruined it. Either way I've decided for myself, my family, my career and my sanity I'm abandoning SD. I will shelve SD and try to have fond memories of SD as an intellectual prompting exercise only > Ingo -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:16 ` Con Kolivas @ 2007-04-16 5:48 ` Gene Heskett 0 siblings, 0 replies; 713+ messages in thread From: Gene Heskett @ 2007-04-16 5:48 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007, Con Kolivas wrote: And I snipped, Sorry fellas. Con's original submission was to me, quite an improvement. But I have to say it, and no denegration of your efforts is intended Con, but you did 'pull the trigger' and get this thing rolling by scratching the itch & drawing attention to an ugly lack of user interactivity that had crept into the 2.6 family. So from me to Con, a tip of the hat, and a deep bow in your direction, thank you. Now, you have done what you aimed to do, so please get well. I've now been through most of an amanda session using Ingo's "CFS" and I have to say that it is another improvement over your 0.40 that's is just as obvious as your first patch was against the stock scheduler. No other scheduler yet has allowed the full utilization of the cpu, and maintained user interactivity as well as this one has, my cpu is running about 5 degrees F hotter just from this effect alone. gzip, if the rest of the system is in between tasks, is consistently showing around 95%, but let anything else stick up its hand, like procmail etc, and gzip now dutifully steps aside, dropping into the 40% range until procmail and spamd are done, at which point there is no rest for the wicked and the cpu never gets a chance to cool. There was, just now, a pause of about 2 seconds, while amanda moved a tarball from the holding disk area on /dev/hda to the vtapes disk on /dev/hdd, so that would have been an I/O bound situation. This one Ingo, even without any other patches and I think I did see one go by in this thread which I didn't apply, is a definite keeper. Sweet even. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) A word to the wise is enough. -- Miguel de Cervantes ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (9 preceding siblings ...) 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas @ 2007-04-15 12:29 ` Esben Nielsen 2007-04-15 13:04 ` Ingo Molnar 2007-04-15 22:49 ` Ismail Dönmez ` (3 subsequent siblings) 14 siblings, 1 reply; 713+ messages in thread From: Esben Nielsen @ 2007-04-15 12:29 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [...] I took a brief look at it. Have you tested priority inheritance? As far as I can see rt_mutex_setprio doesn't have much effect on SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change scheduler class when boosted in rt_mutex_setprio(). Esben ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:29 ` Esben Nielsen @ 2007-04-15 13:04 ` Ingo Molnar 2007-04-16 7:16 ` Esben Nielsen 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-15 13:04 UTC (permalink / raw) To: Esben Nielsen Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > I took a brief look at it. Have you tested priority inheritance? yeah, you are right, it's broken at the moment, i'll fix it. But the good news is that i think PI could become cleaner via scheduling classes. > As far as I can see rt_mutex_setprio doesn't have much effect on > SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task > change scheduler class when boosted in rt_mutex_setprio(). i think via scheduling classes we dont have to do the p->policy and p->prio based gymnastics anymore, we can just have a clean look at p->sched_class and stack the original scheduling class into p->real_sched_class. It would probably also make sense to 'privatize' p->prio into the scheduling class. That way PI would be a pure property of sched_rt, and the PI scheduler would be driven purely by p->rt_priority, not by p->prio. That way all the normal_prio() kind of complications and interactions with SCHED_OTHER/SCHED_FAIR would be eliminated as well. What do you think? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 13:04 ` Ingo Molnar @ 2007-04-16 7:16 ` Esben Nielsen 0 siblings, 0 replies; 713+ messages in thread From: Esben Nielsen @ 2007-04-16 7:16 UTC (permalink / raw) To: Ingo Molnar Cc: Esben Nielsen, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Ingo Molnar wrote: > > * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > >> I took a brief look at it. Have you tested priority inheritance? > > yeah, you are right, it's broken at the moment, i'll fix it. But the > good news is that i think PI could become cleaner via scheduling > classes. > >> As far as I can see rt_mutex_setprio doesn't have much effect on >> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task >> change scheduler class when boosted in rt_mutex_setprio(). > > i think via scheduling classes we dont have to do the p->policy and > p->prio based gymnastics anymore, we can just have a clean look at > p->sched_class and stack the original scheduling class into > p->real_sched_class. It would probably also make sense to 'privatize' > p->prio into the scheduling class. That way PI would be a pure property > of sched_rt, and the PI scheduler would be driven purely by > p->rt_priority, not by p->prio. That way all the normal_prio() kind of > complications and interactions with SCHED_OTHER/SCHED_FAIR would be > eliminated as well. What do you think? > Now I have not read your patch into detail. But agree it would be nice to have it more "OO" and remove cross references between schedulers. But first one should consider wether PI between SCHED_FAIR tasks or not is usefull or not. Does PI among dynamic priorities make sense at all? I think it does: On heavy loaded systems where a nice 19 might not get the CPU for very long, a nice -20 task can be priority inverted for a very long time. But I see no need it taking the dynamic part of the effective priorities into account. The current/old solution of mapping the static nice values into a global priority index which can incorporate the two scheduler classes is probably good enough - it just has to be "switched on" a again :-) But what about other scheduler classes which some people want to add in the future? What about having a "cleaner design"? My thought was to generalize the concept of 'priority' to be an object (a struct prio) to be interpreted with help from a scheduler class instead of globally interpreted integer. int compare_prio(struct prio *a, struct prio *b) { if (a->sched_class->class_prio < b->sched_class->class_prio) return -1; if (a->sched_class->class_prio < b->sched_class->class_prio) return +1; return a->sched_class->compare_prio(a, b); } Problem 1: Performance. Problem 2: Operations on a plist with these generalized priorities are not bounded because the number of different priorites are not bounded. Problem 2 could be solved by using a combined plist (for rt priorities) and rbtree (for fair priorities) - making operations logarithmic just as the fair scheduler itself. But that would take more memory for every rtmutex. I conclude that is too complicated and go on to the obvious idea: Use a global priority index where each scheduler class get's it own range (rt: 0-99, fair 100-139 :-). Let the scheduler class have a function returning it instead of reading it directly from task_struct such that new scheduler classes can return their own numbers. Esben > Ingo > ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (10 preceding siblings ...) 2007-04-15 12:29 ` Esben Nielsen @ 2007-04-15 22:49 ` Ismail Dönmez 2007-04-15 23:23 ` Arjan van de Ven 2007-04-16 11:58 ` Ingo Molnar 2007-04-16 22:00 ` Andi Kleen ` (2 subsequent siblings) 14 siblings, 2 replies; 713+ messages in thread From: Ismail Dönmez @ 2007-04-15 22:49 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 573 bytes --] Hi, On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch Tested this on top of Linus' GIT tree but the system gets very unresponsive during high disk i/o using ext3 as filesystem but even writing a 300mb file to a usb disk (iPod actually) has the same affect. Regards, ismail [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:49 ` Ismail Dönmez @ 2007-04-15 23:23 ` Arjan van de Ven 2007-04-15 23:33 ` Ismail Dönmez 2007-04-16 11:58 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Arjan van de Ven @ 2007-04-15 23:23 UTC (permalink / raw) To: Ismail Dönmez Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote: > Hi, > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > [CFS] > > > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > Tested this on top of Linus' GIT tree but the system gets very unresponsive > during high disk i/o using ext3 as filesystem but even writing a 300mb file > to a usb disk (iPod actually) has the same affect. just to make sure; this exact same workload but with the stock scheduler does not have this effect? if so, then it could well be that the scheduler is too fair for it's own good (being really fair inevitably ends up not batching as much as one should, and batching is needed to get any kind of decent performance out of disks nowadays) -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:23 ` Arjan van de Ven @ 2007-04-15 23:33 ` Ismail Dönmez 0 siblings, 0 replies; 713+ messages in thread From: Ismail Dönmez @ 2007-04-15 23:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner On Monday 16 April 2007 02:23:08 Arjan van de Ven wrote: > On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote: > > Hi, > > > > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > > [CFS] > > > > > > i'm pleased to announce the first release of the "Modular Scheduler > > > Core and Completely Fair Scheduler [CFS]" patchset: > > > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > > > Tested this on top of Linus' GIT tree but the system gets very > > unresponsive during high disk i/o using ext3 as filesystem but even > > writing a 300mb file to a usb disk (iPod actually) has the same affect. > > just to make sure; this exact same workload but with the stock scheduler > does not have this effect? > > if so, then it could well be that the scheduler is too fair for it's own > good (being really fair inevitably ends up not batching as much as one > should, and batching is needed to get any kind of decent performance out > of disks nowadays) Tried with make install in kdepim (which made system sluggish with CFS) and the system is just fine (using CFQ). Regards, ismail ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:49 ` Ismail Dönmez 2007-04-15 23:23 ` Arjan van de Ven @ 2007-04-16 11:58 ` Ingo Molnar 2007-04-16 12:02 ` Ismail Dönmez 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-16 11:58 UTC (permalink / raw) To: Ismail Dönmez Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ismail Dönmez <ismail@pardus.org.tr> wrote: > Tested this on top of Linus' GIT tree but the system gets very > unresponsive during high disk i/o using ext3 as filesystem but even > writing a 300mb file to a usb disk (iPod actually) has the same > affect. hm. Is this an SMP system+kernel by any chance? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:58 ` Ingo Molnar @ 2007-04-16 12:02 ` Ismail Dönmez 0 siblings, 0 replies; 713+ messages in thread From: Ismail Dönmez @ 2007-04-16 12:02 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 14:58:54 Ingo Molnar wrote: > * Ismail Dönmez <ismail@pardus.org.tr> wrote: > > Tested this on top of Linus' GIT tree but the system gets very > > unresponsive during high disk i/o using ext3 as filesystem but even > > writing a 300mb file to a usb disk (iPod actually) has the same > > affect. > > hm. Is this an SMP system+kernel by any chance? Nope the kernel and the system is UP. Regards, ismail ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (11 preceding siblings ...) 2007-04-15 22:49 ` Ismail Dönmez @ 2007-04-16 22:00 ` Andi Kleen 2007-04-16 21:05 ` Ingo Molnar 2007-04-17 7:56 ` Andy Whitcroft 2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse 14 siblings, 1 reply; 713+ messages in thread From: Andi Kleen @ 2007-04-16 22:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Ingo Molnar <mingo@elte.hu> writes: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch I would suggest to drop the tsc.c change. The "small errors" can be really large on some systems and you can also see large backward jumps. I have a proper (but complicated) solution pending in ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/sched-clock-share BTW with all this CPU time measurement it would be really nice to report it to the user too. It seems a bit bizarre that the scheduler keeps track of ns, but top only knows jiffies with large sampling errors. -Andi ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 22:00 ` Andi Kleen @ 2007-04-16 21:05 ` Ingo Molnar 2007-04-16 21:21 ` Andi Kleen 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-16 21:05 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel * Andi Kleen <andi@firstfloor.org> wrote: > > i'm pleased to announce the first release of the "Modular Scheduler > > Core and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > I would suggest to drop the tsc.c change. The "small errors" can be > really large on some systems and you can also see large backward > jumps. actually, i designed the CFS code assuming a per-CPU TSC (with no global synchronization), not assuming any globally sync TSC. In fact i wrote it on such systems: a CoreDuo2 box that has stops the TSC in C3 and the different cores have wildly different TSC values and a dual-core Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() change for now. > BTW with all this CPU time measurement it would be really nice to > report it to the user too. It seems a bit bizarre that the scheduler > keeps track of ns, but top only knows jiffies with large sampling > errors. yeah - i'll fix that too if someone doesnt beat me at it. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 21:05 ` Ingo Molnar @ 2007-04-16 21:21 ` Andi Kleen 0 siblings, 0 replies; 713+ messages in thread From: Andi Kleen @ 2007-04-16 21:21 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, linux-kernel > actually, i designed the CFS code assuming a per-CPU TSC (with no global > synchronization), not assuming any globally sync TSC. In fact i wrote it That already worked in the old scheduler (just in a hackish way) > on such systems: a CoreDuo2 box that has stops the TSC in C3 and the > different cores have wildly different TSC values and a dual-core > Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() > change for now. The problem is not CPU synchronized TSC, but TSC with varying frequency on a single CPU like on the A64. The old implementation can lose really badly on that because it mixes measurements at different frequencies together without individual scaling. The error gets worse the longer the system runs. >> BTW with all this CPU time measurement it would be really nice to >> report it to the user too. It seems a bit bizarre that the scheduler >> keeps track of ns, but top only knows jiffies with large sampling >> errors. > yeah - i'll fix that too if someone doesnt beat me at it. I've been pondering for some time if doubling the NMI watchdog as a ring 0 counter for this is worth it. So far I'm still undecided (and it's moot now since it's disabled by default :/) -Andi ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (12 preceding siblings ...) 2007-04-16 22:00 ` Andi Kleen @ 2007-04-17 7:56 ` Andy Whitcroft 2007-04-17 9:32 ` Nick Piggin 2007-04-18 10:22 ` Ingo Molnar 2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse 14 siblings, 2 replies; 713+ messages in thread From: Andy Whitcroft @ 2007-04-17 7:56 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] > > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. > > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! > > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] > > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). > > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. > > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. > > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. > > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) > > - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler > way than the vanilla scheduler does. It uses 100 runqueues (for all > 100 RT priority levels, instead of 140 in the vanilla scheduler) > and it needs no expired array. > > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. > > the core scheduler got smaller by more than 700 lines: > > kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------ > 1 file changed, 372 insertions(+), 1082 deletions(-) > > and even adding all the scheduling modules, the total size impact is > relatively small: > > 18 files changed, 1454 insertions(+), 1133 deletions(-) > > most of the increase is due to extensive comments. The kernel size > impact is in fact a small negative: > > text data bss dec hex filename > 23366 4001 24 27391 6aff kernel/sched.o.vanilla > 24159 2705 56 26920 6928 kernel/sched.o.CFS > > (this is mainly due to the benefit of getting rid of the expired array > and its data structure overhead.) > > thanks go to Thomas Gleixner and Arjan van de Ven for review of this > patchset. > > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, Pushed this through the test.kernel.org and nothing new blew up. Notably the kernbench figures are within expectations even on the bigger numa systems, commonly badly affected by balancing problems in the schedular. I see there is a second one out, I'll push that one through too. -apw ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Andy Whitcroft @ 2007-04-17 9:32 ` Nick Piggin 2007-04-17 9:59 ` Ingo Molnar 2007-04-18 10:22 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-17 9:32 UTC (permalink / raw) To: Andy Whitcroft Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote: > > > > as usual, any sort of feedback, bugreports, fixes and suggestions are > > more than welcome, > > Pushed this through the test.kernel.org and nothing new blew up. > Notably the kernbench figures are within expectations even on the bigger > numa systems, commonly badly affected by balancing problems in the > schedular. > > I see there is a second one out, I'll push that one through too. Well I just sent some feedback on cfs-v2, but realised it went off-list, so I'll resend here because others may find it interesting too. Sorry about jamming it in here, but it is relevant to performance... Anyway, roughly in the context of good cfs-v2 interactivity, I wrote: Well I'm not too surprised. I am disappointed that it uses such small timeslices (or whatever they are called) as the default. Using small timeslices is actually a pretty easy way to ensure everything stays smooth even under load, but is bad for efficiency. Sure you can say you'll have desktop and server tunings, but... With nicksched I'm testing a default timeslice of *300ms* even on the desktop, wheras Ingo's seems to be effectively 3ms :P So if you compare default tunings, it isn't exactly fair! Kbuild times on a 2x Xeon: 2.6.21-rc7 508.87user 32.47system 2:17.82elapsed 392%CPU 509.05user 32.25system 2:17.84elapsed 392%CPU 508.75user 32.26system 2:17.83elapsed 392%CPU 508.63user 32.17system 2:17.88elapsed 392%CPU 509.01user 32.26system 2:17.90elapsed 392%CPU 509.08user 32.20system 2:17.95elapsed 392%CPU 2.6.21-rc7-cfs-v2 534.80user 30.92system 2:23.64elapsed 393%CPU 534.75user 31.01system 2:23.70elapsed 393%CPU 534.66user 31.07system 2:23.76elapsed 393%CPU 534.56user 30.91system 2:23.76elapsed 393%CPU 534.66user 31.07system 2:23.67elapsed 393%CPU 535.43user 30.62system 2:23.72elapsed 393%CPU 2.6.21-rc7-nicksched 505.60user 32.31system 2:17.91elapsed 390%CPU 506.55user 32.42system 2:17.66elapsed 391%CPU 506.41user 32.30system 2:17.85elapsed 390%CPU 506.48user 32.36system 2:17.77elapsed 391%CPU 506.10user 32.40system 2:17.81elapsed 390%CPU 506.69user 32.16system 2:17.78elapsed 391%CPU ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:32 ` Nick Piggin @ 2007-04-17 9:59 ` Ingo Molnar 2007-04-17 11:11 ` Nick Piggin 2007-04-18 8:55 ` Nick Piggin 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-17 9:59 UTC (permalink / raw) To: Nick Piggin Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Nick Piggin <npiggin@suse.de> wrote: > 2.6.21-rc7-cfs-v2 > 534.80user 30.92system 2:23.64elapsed 393%CPU > 534.75user 31.01system 2:23.70elapsed 393%CPU > 534.66user 31.07system 2:23.76elapsed 393%CPU > 534.56user 30.91system 2:23.76elapsed 393%CPU > 534.66user 31.07system 2:23.67elapsed 393%CPU > 535.43user 30.62system 2:23.72elapsed 393%CPU Thanks for testing this! Could you please try this also with: echo 100000000 > /proc/sys/kernel/sched_granularity on the same system, so that we can get a complete set of numbers? Just to make sure that lowering the preemption frequency indeed has the expected result of moving kernbench numbers back to mainline levels. (if not then that would indicate some CFS buglet) could you maybe even try a more extreme setting of: echo 500000000 > /proc/sys/kernel/sched_granularity for kicks? This would allow us to see how much kernbench we lose due to preemption granularity. Thanks! Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:59 ` Ingo Molnar @ 2007-04-17 11:11 ` Nick Piggin 2007-04-18 8:55 ` Nick Piggin 1 sibling, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-17 11:11 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > Thanks for testing this! Could you please try this also with: > > echo 100000000 > /proc/sys/kernel/sched_granularity > > on the same system, so that we can get a complete set of numbers? Just > to make sure that lowering the preemption frequency indeed has the > expected result of moving kernbench numbers back to mainline levels. (if > not then that would indicate some CFS buglet) > > could you maybe even try a more extreme setting of: > > echo 500000000 > /proc/sys/kernel/sched_granularity > > for kicks? This would allow us to see how much kernbench we lose due to > preemption granularity. Thanks! Yeah but I just powered down the test-box, so I'll have to get onto that tomorrow. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:59 ` Ingo Molnar 2007-04-17 11:11 ` Nick Piggin @ 2007-04-18 8:55 ` Nick Piggin 2007-04-18 9:33 ` Con Kolivas 2007-04-18 9:53 ` Ingo Molnar 1 sibling, 2 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-18 8:55 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > Thanks for testing this! Could you please try this also with: > > echo 100000000 > /proc/sys/kernel/sched_granularity 507.68user 31.87system 2:18.05elapsed 390%CPU 507.99user 31.93system 2:18.09elapsed 390%CPU 507.46user 31.78system 2:18.03elapsed 390%CPU 507.68user 31.93system 2:18.11elapsed 390%CPU 507.63user 31.98system 2:18.01elapsed 390%CPU 507.83user 31.94system 2:18.28elapsed 390%CPU > could you maybe even try a more extreme setting of: > > echo 500000000 > /proc/sys/kernel/sched_granularity 504.87user 32.13system 2:18.03elapsed 389%CPU 505.94user 32.29system 2:17.87elapsed 390%CPU 506.10user 31.90system 2:17.96elapsed 389%CPU 505.02user 32.02system 2:17.96elapsed 389%CPU 506.69user 31.96system 2:17.82elapsed 390%CPU 505.70user 31.84system 2:17.90elapsed 389%CPU Again, for comparison 2.6.21-rc7 mainline: 508.87user 32.47system 2:17.82elapsed 392%CPU 509.05user 32.25system 2:17.84elapsed 392%CPU 508.75user 32.26system 2:17.83elapsed 392%CPU 508.63user 32.17system 2:17.88elapsed 392%CPU 509.01user 32.26system 2:17.90elapsed 392%CPU 509.08user 32.20system 2:17.95elapsed 392%CPU So looking at elapsed time, a granularity of 100ms is just behind the mainline score. However it is using slightly less user time and slightly more idle time, which indicates that balancing might have got a bit less aggressive. But anyway, it conclusively shows the efficiency impact of such tiny timeslices. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 8:55 ` Nick Piggin @ 2007-04-18 9:33 ` Con Kolivas 2007-04-18 12:14 ` Nick Piggin 2007-04-18 9:53 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-18 9:33 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > > > Thanks for testing this! Could you please try this also with: > > > > echo 100000000 > /proc/sys/kernel/sched_granularity > > 507.68user 31.87system 2:18.05elapsed 390%CPU > 507.99user 31.93system 2:18.09elapsed 390%CPU > 507.46user 31.78system 2:18.03elapsed 390%CPU > 507.68user 31.93system 2:18.11elapsed 390%CPU > 507.63user 31.98system 2:18.01elapsed 390%CPU > 507.83user 31.94system 2:18.28elapsed 390%CPU > > > could you maybe even try a more extreme setting of: > > > > echo 500000000 > /proc/sys/kernel/sched_granularity > > 504.87user 32.13system 2:18.03elapsed 389%CPU > 505.94user 32.29system 2:17.87elapsed 390%CPU > 506.10user 31.90system 2:17.96elapsed 389%CPU > 505.02user 32.02system 2:17.96elapsed 389%CPU > 506.69user 31.96system 2:17.82elapsed 390%CPU > 505.70user 31.84system 2:17.90elapsed 389%CPU > > > Again, for comparison 2.6.21-rc7 mainline: > > 508.87user 32.47system 2:17.82elapsed 392%CPU > 509.05user 32.25system 2:17.84elapsed 392%CPU > 508.75user 32.26system 2:17.83elapsed 392%CPU > 508.63user 32.17system 2:17.88elapsed 392%CPU > 509.01user 32.26system 2:17.90elapsed 392%CPU > 509.08user 32.20system 2:17.95elapsed 392%CPU > > So looking at elapsed time, a granularity of 100ms is just behind the > mainline score. However it is using slightly less user time and > slightly more idle time, which indicates that balancing might have got > a bit less aggressive. > > But anyway, it conclusively shows the efficiency impact of such tiny > timeslices. See test.kernel.org for how (the now defunct) SD was performing on kernbench. It had low latency _and_ equivalent throughput to mainline. Set the standard appropriately on both counts please. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 9:33 ` Con Kolivas @ 2007-04-18 12:14 ` Nick Piggin 2007-04-18 12:33 ` Con Kolivas 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-18 12:14 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > Again, for comparison 2.6.21-rc7 mainline: > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > mainline score. However it is using slightly less user time and > > slightly more idle time, which indicates that balancing might have got > > a bit less aggressive. > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > timeslices. > > See test.kernel.org for how (the now defunct) SD was performing on kernbench. > It had low latency _and_ equivalent throughput to mainline. Set the standard > appropriately on both counts please. I can give it a run. Got an updated patch against -rc7? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:14 ` Nick Piggin @ 2007-04-18 12:33 ` Con Kolivas 2007-04-18 21:49 ` Con Kolivas 0 siblings, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-18 12:33 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:14, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > > Again, for comparison 2.6.21-rc7 mainline: > > > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > mainline score. However it is using slightly less user time and > > > slightly more idle time, which indicates that balancing might have got > > > a bit less aggressive. > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > timeslices. > > > > See test.kernel.org for how (the now defunct) SD was performing on > > kernbench. It had low latency _and_ equivalent throughput to mainline. > > Set the standard appropriately on both counts please. > > I can give it a run. Got an updated patch against -rc7? I said I wasn't pursuing it but since you're offering, the rc6 patch should apply ok. http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:33 ` Con Kolivas @ 2007-04-18 21:49 ` Con Kolivas 0 siblings, 0 replies; 713+ messages in thread From: Con Kolivas @ 2007-04-18 21:49 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:33, Con Kolivas wrote: > On Wednesday 18 April 2007 22:14, Nick Piggin wrote: > > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > > > Again, for comparison 2.6.21-rc7 mainline: > > > > > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > > mainline score. However it is using slightly less user time and > > > > slightly more idle time, which indicates that balancing might have > > > > got a bit less aggressive. > > > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > > timeslices. > > > > > > See test.kernel.org for how (the now defunct) SD was performing on > > > kernbench. It had low latency _and_ equivalent throughput to mainline. > > > Set the standard appropriately on both counts please. > > > > I can give it a run. Got an updated patch against -rc7? > > I said I wasn't pursuing it but since you're offering, the rc6 patch should > apply ok. > > http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch Oh and if you go to the effort of trying you may as well try the timeslice tweak to see what effect it has on SD as well. /proc/sys/kernel/rr_interval 100 is the highest. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 8:55 ` Nick Piggin 2007-04-18 9:33 ` Con Kolivas @ 2007-04-18 9:53 ` Ingo Molnar 2007-04-18 12:13 ` Nick Piggin 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 9:53 UTC (permalink / raw) To: Nick Piggin Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Nick Piggin <npiggin@suse.de> wrote: > > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > > > Thanks for testing this! Could you please try this also with: > > > > echo 100000000 > /proc/sys/kernel/sched_granularity > > 507.68user 31.87system 2:18.05elapsed 390%CPU > 507.99user 31.93system 2:18.09elapsed 390%CPU > > could you maybe even try a more extreme setting of: > > > > echo 500000000 > /proc/sys/kernel/sched_granularity > 506.69user 31.96system 2:17.82elapsed 390%CPU > 505.70user 31.84system 2:17.90elapsed 389%CPU > Again, for comparison 2.6.21-rc7 mainline: > > 508.87user 32.47system 2:17.82elapsed 392%CPU > 509.05user 32.25system 2:17.84elapsed 392%CPU thanks for testing this! > So looking at elapsed time, a granularity of 100ms is just behind the > mainline score. However it is using slightly less user time and > slightly more idle time, which indicates that balancing might have got > a bit less aggressive. > > But anyway, it conclusively shows the efficiency impact of such tiny > timeslices. yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is not unexpected when going to really frequent preemption. Clearly, the default preemption granularity needs to be tuned up. I think you said you measured ~3msec average preemption rate per CPU? That would suggest the average cache-trashing cost was 120 usecs per every 3 msec window. Taking that as a ballpark figure, to get the difference back into the noise range we'd have to either use ~5 msec: echo 5000000 > /proc/sys/kernel/sched_granularity or 15 msec: echo 15000000 > /proc/sys/kernel/sched_granularity (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i correctly understood your 3msec value. I'd have to know your kernbench workload's approximate 'steady state' context-switch rate to do a more accurate calculation.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 9:53 ` Ingo Molnar @ 2007-04-18 12:13 ` Nick Piggin 2007-04-18 12:49 ` Con Kolivas 0 siblings, 1 reply; 713+ messages in thread From: Nick Piggin @ 2007-04-18 12:13 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > So looking at elapsed time, a granularity of 100ms is just behind the > > mainline score. However it is using slightly less user time and > > slightly more idle time, which indicates that balancing might have got > > a bit less aggressive. > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > timeslices. > > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is > not unexpected when going to really frequent preemption. Clearly, the > default preemption granularity needs to be tuned up. > > I think you said you measured ~3msec average preemption rate per CPU? This was just looking at ctxsw numbers from running 2 cpu hogs on the same runqueue. > That would suggest the average cache-trashing cost was 120 usecs per > every 3 msec window. Taking that as a ballpark figure, to get the > difference back into the noise range we'd have to either use ~5 msec: > > echo 5000000 > /proc/sys/kernel/sched_granularity > > or 15 msec: > > echo 15000000 > /proc/sys/kernel/sched_granularity > > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i > correctly understood your 3msec value. I'd have to know your kernbench > workload's approximate 'steady state' context-switch rate to do a more > accurate calculation.) The kernel compile (make -j8 on 4 thread system) is doing 1800 total context switches per second (450/s per runqueue) for cfs, and 670 for mainline. Going up to 20ms granularity for cfs brings the context switch numbers similar, but user time is still a % or so higher. I'd be more worried about compute heavy threads which naturally don't do much context switching. Some other numbers on the same system Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched 10 groups: Time: 1.332 0.743 0.607 20 groups: Time: 1.197 1.100 1.241 30 groups: Time: 1.754 2.376 1.834 40 groups: Time: 3.451 2.227 2.503 50 groups: Time: 3.726 3.399 3.220 60 groups: Time: 3.548 4.567 3.668 70 groups: Time: 4.206 4.905 4.314 80 groups: Time: 4.551 6.324 4.879 90 groups: Time: 7.904 6.962 5.335 100 groups: Time: 7.293 7.799 5.857 110 groups: Time: 10.595 8.728 6.517 120 groups: Time: 7.543 9.304 7.082 130 groups: Time: 8.269 10.639 8.007 140 groups: Time: 11.867 8.250 8.302 150 groups: Time: 14.852 8.656 8.662 160 groups: Time: 9.648 9.313 9.541 Mainline seems pretty inconsistent here. lmbench 0K ctxsw latency bound to CPU0: tasks 2 2.59 3.42 2.50 4 3.26 3.54 3.09 8 3.01 3.64 3.22 16 3.00 3.66 3.50 32 2.99 3.70 3.49 64 3.09 4.17 3.50 128 4.80 5.58 4.74 256 5.79 6.37 5.76 cfs is noticably disadvantaged. [*] 500ms didn't make much difference in either test. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:13 ` Nick Piggin @ 2007-04-18 12:49 ` Con Kolivas 2007-04-19 3:28 ` Nick Piggin 0 siblings, 1 reply; 713+ messages in thread From: Con Kolivas @ 2007-04-18 12:49 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:13, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > mainline score. However it is using slightly less user time and > > > slightly more idle time, which indicates that balancing might have got > > > a bit less aggressive. > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > timeslices. > > > > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is > > not unexpected when going to really frequent preemption. Clearly, the > > default preemption granularity needs to be tuned up. > > > > I think you said you measured ~3msec average preemption rate per CPU? > > This was just looking at ctxsw numbers from running 2 cpu hogs on the > same runqueue. > > > That would suggest the average cache-trashing cost was 120 usecs per > > every 3 msec window. Taking that as a ballpark figure, to get the > > difference back into the noise range we'd have to either use ~5 msec: > > > > echo 5000000 > /proc/sys/kernel/sched_granularity > > > > or 15 msec: > > > > echo 15000000 > /proc/sys/kernel/sched_granularity > > > > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i > > correctly understood your 3msec value. I'd have to know your kernbench > > workload's approximate 'steady state' context-switch rate to do a more > > accurate calculation.) > > The kernel compile (make -j8 on 4 thread system) is doing 1800 total > context switches per second (450/s per runqueue) for cfs, and 670 > for mainline. Going up to 20ms granularity for cfs brings the context > switch numbers similar, but user time is still a % or so higher. I'd > be more worried about compute heavy threads which naturally don't do > much context switching. While kernel compiles are nice and easy to do I've seen enough criticism of them in the past to wonder about their usefulness as a standard benchmark on their own. > > Some other numbers on the same system > Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched > 10 groups: Time: 1.332 0.743 0.607 > 20 groups: Time: 1.197 1.100 1.241 > 30 groups: Time: 1.754 2.376 1.834 > 40 groups: Time: 3.451 2.227 2.503 > 50 groups: Time: 3.726 3.399 3.220 > 60 groups: Time: 3.548 4.567 3.668 > 70 groups: Time: 4.206 4.905 4.314 > 80 groups: Time: 4.551 6.324 4.879 > 90 groups: Time: 7.904 6.962 5.335 > 100 groups: Time: 7.293 7.799 5.857 > 110 groups: Time: 10.595 8.728 6.517 > 120 groups: Time: 7.543 9.304 7.082 > 130 groups: Time: 8.269 10.639 8.007 > 140 groups: Time: 11.867 8.250 8.302 > 150 groups: Time: 14.852 8.656 8.662 > 160 groups: Time: 9.648 9.313 9.541 Hackbench even more so. A prolonged discussion with Rusty Russell on this issue he suggested hackbench was more a pass/fail benchmark to ensure there was no starvation scenario that never ended, and very little value should be placed on the actual results returned from it. Wli's concerns regarding some sort of standard framework for a battery of accepted meaningful benchmarks comes to mind as important rather than ones that highlight one over the other. So while interesting for their own endpoints, I certainly wouldn't put either benchmark as some sort of yardstick for a "winner". Note I'm not saying that we shouldn't be looking at them per se, but since the whole drive for a new scheduler is trying to be more objective we need to start expanding the range of benchmarks. Even though I don't feel the need to have SD in the "race" I guess it stands for more data to compare what is possible/where as well. > Mainline seems pretty inconsistent here. > > lmbench 0K ctxsw latency bound to CPU0: > tasks > 2 2.59 3.42 2.50 > 4 3.26 3.54 3.09 > 8 3.01 3.64 3.22 > 16 3.00 3.66 3.50 > 32 2.99 3.70 3.49 > 64 3.09 4.17 3.50 > 128 4.80 5.58 4.74 > 256 5.79 6.37 5.76 > > cfs is noticably disadvantaged. > > [*] 500ms didn't make much difference in either test. -- -ck ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:49 ` Con Kolivas @ 2007-04-19 3:28 ` Nick Piggin 0 siblings, 0 replies; 713+ messages in thread From: Nick Piggin @ 2007-04-19 3:28 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 10:49:45PM +1000, Con Kolivas wrote: > On Wednesday 18 April 2007 22:13, Nick Piggin wrote: > > > > The kernel compile (make -j8 on 4 thread system) is doing 1800 total > > context switches per second (450/s per runqueue) for cfs, and 670 > > for mainline. Going up to 20ms granularity for cfs brings the context > > switch numbers similar, but user time is still a % or so higher. I'd > > be more worried about compute heavy threads which naturally don't do > > much context switching. > > While kernel compiles are nice and easy to do I've seen enough criticism of > them in the past to wonder about their usefulness as a standard benchmark on > their own. Actually it is a real workload for most kernel developers including you no doubt :) The criticism's of kernbench for the kernel are probably fair in that kernel compiles don't exercise a lot of kernel functionality (page allocator and fault paths mostly, IIRC). However as far as I'm concerned, they're great for testing the CPU scheduler, because it doesn't actually matter whether you're running in userspace or kernel space for a context switch to blow your caches. The results are quite stable. You could actually make up a benchmark that hurts a whole lot more from context switching, but I figure that kernbench is a real world thing that shows it up quite well. > > Some other numbers on the same system > > Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched > > 10 groups: Time: 1.332 0.743 0.607 > > 20 groups: Time: 1.197 1.100 1.241 > > 30 groups: Time: 1.754 2.376 1.834 > > 40 groups: Time: 3.451 2.227 2.503 > > 50 groups: Time: 3.726 3.399 3.220 > > 60 groups: Time: 3.548 4.567 3.668 > > 70 groups: Time: 4.206 4.905 4.314 > > 80 groups: Time: 4.551 6.324 4.879 > > 90 groups: Time: 7.904 6.962 5.335 > > 100 groups: Time: 7.293 7.799 5.857 > > 110 groups: Time: 10.595 8.728 6.517 > > 120 groups: Time: 7.543 9.304 7.082 > > 130 groups: Time: 8.269 10.639 8.007 > > 140 groups: Time: 11.867 8.250 8.302 > > 150 groups: Time: 14.852 8.656 8.662 > > 160 groups: Time: 9.648 9.313 9.541 > > Hackbench even more so. A prolonged discussion with Rusty Russell on this > issue he suggested hackbench was more a pass/fail benchmark to ensure there > was no starvation scenario that never ended, and very little value should be > placed on the actual results returned from it. Yeah, cfs seems to do a little worse than nicksched here, but I include the numbers not because I think that is significant, but to show mainline's poor characteristics. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Andy Whitcroft 2007-04-17 9:32 ` Nick Piggin @ 2007-04-18 10:22 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 10:22 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Andy Whitcroft <apw@shadowen.org> wrote: > > as usual, any sort of feedback, bugreports, fixes and suggestions > > are more than welcome, > > Pushed this through the test.kernel.org and nothing new blew up. > Notably the kernbench figures are within expectations even on the > bigger numa systems, commonly badly affected by balancing problems in > the schedular. thanks! Given the really low preemption latency/granularity default (roughly equivalent to 'timeslice length'), and that basically all of my focus was on interactivity characteristics, this is a pretty good result. I suspect it will be necessary to increase the default to 10 msecs (or more) to be on the safe side. (Nick has reported a 4% kernbench drop so for his kernbench workload it's needed.) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar ` (13 preceding siblings ...) 2007-04-17 7:56 ` Andy Whitcroft @ 2007-04-18 15:58 ` Christian Hesse 2007-04-18 16:46 ` Ingo Molnar 14 siblings, 1 reply; 713+ messages in thread From: Christian Hesse @ 2007-04-18 15:58 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 591 bytes --] Hi Ingo and all, On Friday 13 April 2007, Ingo Molnar wrote: > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, I just gave CFS a try on my system. From a user's point of view it looks good so far. Thanks for your work. However I found a problem: When trying to suspend a system patched with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the ESC key results in a message that it tries to abort suspend, but then still hangs. I cced suspend2 devel list, perhaps Nigel is interested as well. -- Regards, Chris [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) 2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse @ 2007-04-18 16:46 ` Ingo Molnar 2007-04-18 20:45 ` CFS and suspend2: hang in atomic copy Christian Hesse 2007-04-19 9:32 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 16:46 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Christian Hesse <mail@earthworm.de> wrote: > Hi Ingo and all, > > On Friday 13 April 2007, Ingo Molnar wrote: > > as usual, any sort of feedback, bugreports, fixes and suggestions are > > more than welcome, > > I just gave CFS a try on my system. From a user's point of view it > looks good so far. Thanks for your work. you are welcome! > However I found a problem: When trying to suspend a system patched > with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the > ESC key results in a message that it tries to abort suspend, but then > still hangs. i took a quick look at suspend2 and it makes some use of yield(). There's a bug in CFS's yield code, i've attached a patch that should fix it, does it make any difference to the hang? Ingo Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq /* * sched_yield() support is very simple via the rbtree, we just - * dequeue and enqueue the task, which causes the task to - * roundrobin to the end of the tree: + * dequeue the task and move it to the rightmost position, which + * causes the task to roundrobin to the end of the tree. */ static void requeue_task_fair(struct rq *rq, struct task_struct *p) { dequeue_task_fair(rq, p); p->on_rq = 0; - enqueue_task_fair(rq, p); + /* + * Temporarily insert at the last position of the tree: + */ + p->fair_key = LLONG_MAX; + __enqueue_task_fair(rq, p); p->on_rq = 1; + + /* + * Update the key to the real value, so that when all other + * tasks from before the rightmost position have executed, + * this task is picked up again: + */ + p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; } /* @@ -380,7 +391,10 @@ static void task_tick_fair(struct rq *rq * Dequeue and enqueue the task to update its * position within the tree: */ - requeue_task_fair(rq, curr); + dequeue_task_fair(rq, curr); + curr->on_rq = 0; + enqueue_task_fair(rq, curr); + curr->on_rq = 1; /* * Reschedule if another task tops the current one. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 16:46 ` Ingo Molnar @ 2007-04-18 20:45 ` Christian Hesse 2007-04-18 21:16 ` Ingo Molnar 2007-04-19 9:32 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen 1 sibling, 1 reply; 713+ messages in thread From: Christian Hesse @ 2007-04-18 20:45 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 905 bytes --] On Wednesday 18 April 2007, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > On Friday 13 April 2007, Ingo Molnar wrote: > > > as usual, any sort of feedback, bugreports, fixes and suggestions are > > > more than welcome, > > > > When trying to suspend a system patched > > with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the > > ESC key results in a message that it tries to abort suspend, but then > > still hangs. > > i took a quick look at suspend2 and it makes some use of yield(). > There's a bug in CFS's yield code, i've attached a patch that should fix > it, does it make any difference to the hang? This patch should apply cleanly against what? The second hunk is ignored as it has already been applied. Is this correct? But no, it does not change anything. Let me know if you have any other patches to test. -- Regards, Chris [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 20:45 ` CFS and suspend2: hang in atomic copy Christian Hesse @ 2007-04-18 21:16 ` Ingo Molnar 2007-04-18 21:57 ` Christian Hesse 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 21:16 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Christian Hesse <mail@earthworm.de> wrote: > > i took a quick look at suspend2 and it makes some use of yield(). > > There's a bug in CFS's yield code, i've attached a patch that should > > fix it, does it make any difference to the hang? > > This patch should apply cleanly against what? The second hunk is > ignored as it has already been applied. Is this correct? hm, i think you might have had one of the earlier CFS patches. > But no, it does not change anything. Let me know if you have any other > patches to test. could you try the -v3 patch i released a few hours ago: http://redhat.com/~mingo/cfs-scheduler/ although probably your suspend2 problem is still not fixed, it's worth a try nevertheless. Which suspend2 patch did you apply, and was it against -rc6 or -rc7? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 21:16 ` Ingo Molnar @ 2007-04-18 21:57 ` Christian Hesse 2007-04-18 22:02 ` Ingo Molnar 2007-04-18 22:16 ` CFS and suspend2: hang in atomic copy Ingo Molnar 0 siblings, 2 replies; 713+ messages in thread From: Christian Hesse @ 2007-04-18 21:57 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 1098 bytes --] On Wednesday 18 April 2007, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > > i took a quick look at suspend2 and it makes some use of yield(). > > > There's a bug in CFS's yield code, i've attached a patch that should > > > fix it, does it make any difference to the hang? > > > > This patch should apply cleanly against what? The second hunk is > > ignored as it has already been applied. Is this correct? > > hm, i think you might have had one of the earlier CFS patches. You are right. > > But no, it does not change anything. Let me know if you have any other > > patches to test. > > could you try the -v3 patch i released a few hours ago: > > http://redhat.com/~mingo/cfs-scheduler/ > > although probably your suspend2 problem is still not fixed, it's worth a > try nevertheless. Which suspend2 patch did you apply, and was it against > -rc6 or -rc7? You are right again. ;-) Linux 2.6.21-rc7 Suspend2 2.2.9.11 (applies cleanly to -rc7) CFS v3 (without any additional patches) And it still hangs on suspend. -- Regards, Chris [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 21:57 ` Christian Hesse @ 2007-04-18 22:02 ` Ingo Molnar 2007-04-18 22:22 ` Christian Hesse ` (2 more replies) 2007-04-18 22:16 ` CFS and suspend2: hang in atomic copy Ingo Molnar 1 sibling, 3 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 22:02 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Christian Hesse <mail@earthworm.de> wrote: > > although probably your suspend2 problem is still not fixed, it's > > worth a try nevertheless. Which suspend2 patch did you apply, and > > was it against -rc6 or -rc7? > > You are right again. ;-) > > Linux 2.6.21-rc7 > Suspend2 2.2.9.11 (applies cleanly to -rc7) > CFS v3 (without any additional patches) > > And it still hangs on suspend. what's the easiest way for me to try suspend2? Apply the patch, reboot into the kernel, then execute what command to suspend? (there's a confusing mismash of initiators of all the suspend variants. Can i drive this by echoing to /sys/power/state?) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:02 ` Ingo Molnar @ 2007-04-18 22:22 ` Christian Hesse 2007-04-19 1:37 ` [Suspend2-devel] " Nigel Cunningham 2007-04-18 22:56 ` Bob Picco 2007-04-19 1:52 ` [Suspend2-devel] " Nigel Cunningham 2 siblings, 1 reply; 713+ messages in thread From: Christian Hesse @ 2007-04-18 22:22 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 1088 bytes --] On Thursday 19 April 2007, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > > although probably your suspend2 problem is still not fixed, it's > > > worth a try nevertheless. Which suspend2 patch did you apply, and > > > was it against -rc6 or -rc7? > > > > You are right again. ;-) > > > > Linux 2.6.21-rc7 > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > CFS v3 (without any additional patches) > > > > And it still hangs on suspend. > > what's the easiest way for me to try suspend2? Apply the patch, reboot > into the kernel, then execute what command to suspend? (there's a > confusing mismash of initiators of all the suspend variants. Can i drive > this by echoing to /sys/power/state?) Perhaps you have to install suspend2-userui as well for the output (I'm not shure whether it works without). Then you can trigger the suspend by echoing to /sys/power/suspend2/do_suspend. Useful informations can be found in the Howto: http://www.suspend2.net/HOWTO I dropped some ccs to not abuse Linus and friends. -- Regards, Chris [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:22 ` Christian Hesse @ 2007-04-19 1:37 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-19 1:37 UTC (permalink / raw) To: Christian Hesse; +Cc: Ingo Molnar, linux-kernel, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 1249 bytes --] Hi. On Thu, 2007-04-19 at 00:22 +0200, Christian Hesse wrote: > On Thursday 19 April 2007, Ingo Molnar wrote: > > * Christian Hesse <mail@earthworm.de> wrote: > > > > although probably your suspend2 problem is still not fixed, it's > > > > worth a try nevertheless. Which suspend2 patch did you apply, and > > > > was it against -rc6 or -rc7? > > > > > > You are right again. ;-) > > > > > > Linux 2.6.21-rc7 > > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > > CFS v3 (without any additional patches) > > > > > > And it still hangs on suspend. > > > > what's the easiest way for me to try suspend2? Apply the patch, reboot > > into the kernel, then execute what command to suspend? (there's a > > confusing mismash of initiators of all the suspend variants. Can i drive > > this by echoing to /sys/power/state?) > > Perhaps you have to install suspend2-userui as well for the output (I'm not > shure whether it works without). Then you can trigger the suspend by echoing > to /sys/power/suspend2/do_suspend. > Useful informations can be found in the Howto: > > http://www.suspend2.net/HOWTO > > I dropped some ccs to not abuse Linus and friends. You can suspend and resume without it. Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:02 ` Ingo Molnar 2007-04-18 22:22 ` Christian Hesse @ 2007-04-18 22:56 ` Bob Picco 2007-04-19 1:43 ` [Suspend2-devel] " Nigel Cunningham 2007-04-19 6:29 ` Ingo Molnar 2007-04-19 1:52 ` [Suspend2-devel] " Nigel Cunningham 2 siblings, 2 replies; 713+ messages in thread From: Bob Picco @ 2007-04-18 22:56 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel Ingo Molnar wrote: [Wed Apr 18 2007, 06:02:28PM EDT] > > * Christian Hesse <mail@earthworm.de> wrote: > > > > although probably your suspend2 problem is still not fixed, it's > > > worth a try nevertheless. Which suspend2 patch did you apply, and > > > was it against -rc6 or -rc7? > > > > You are right again. ;-) > > > > Linux 2.6.21-rc7 > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > CFS v3 (without any additional patches) > > > > And it still hangs on suspend. > > what's the easiest way for me to try suspend2? Apply the patch, reboot > into the kernel, then execute what command to suspend? (there's a > confusing mismash of initiators of all the suspend variants. Can i drive > this by echoing to /sys/power/state?) > > Ingo I had hoped to collect more data with CFS V2. It crashes in scale_nice_down for s2ram when attempting to disable_nonboot_cpus. So part of traceback looks like (typed by hand with obvious omissions): scale_nice_down update_stats_wait_end - not shown in traceback because inlined pick_next_task_fair migration_call task_rq_lock notifier_call_chain _cpu_down disable_nonboot_cpus ... This is standard -rc7 with V2 CFS applied. It could be a completely unrelated issue. I'll attempt to debug further tomorrow. bob ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:56 ` Bob Picco @ 2007-04-19 1:43 ` Nigel Cunningham 2007-04-19 6:29 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-19 1:43 UTC (permalink / raw) To: Bob Picco Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1649 bytes --] Hi. On Wed, 2007-04-18 at 18:56 -0400, Bob Picco wrote: > Ingo Molnar wrote: [Wed Apr 18 2007, 06:02:28PM EDT] > > > > * Christian Hesse <mail@earthworm.de> wrote: > > > > > > although probably your suspend2 problem is still not fixed, it's > > > > worth a try nevertheless. Which suspend2 patch did you apply, and > > > > was it against -rc6 or -rc7? > > > > > > You are right again. ;-) > > > > > > Linux 2.6.21-rc7 > > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > > CFS v3 (without any additional patches) > > > > > > And it still hangs on suspend. > > > > what's the easiest way for me to try suspend2? Apply the patch, reboot > > into the kernel, then execute what command to suspend? (there's a > > confusing mismash of initiators of all the suspend variants. Can i drive > > this by echoing to /sys/power/state?) > > > > Ingo > I had hoped to collect more data with CFS V2. It crashes in > scale_nice_down for s2ram when attempting to disable_nonboot_cpus. > So part of traceback looks like (typed by hand with obvious omissions): > > scale_nice_down > update_stats_wait_end - not shown in traceback because inlined > pick_next_task_fair > migration_call > task_rq_lock > notifier_call_chain > _cpu_down > disable_nonboot_cpus > ... > > This is standard -rc7 with V2 CFS applied. It could be a completely > unrelated issue. I'll attempt to debug further tomorrow. That - and Christian's other reply with the jpg - look to me more like this is an interaction between CFS and cpu hotplugging than Suspend2 itself. Can you also reproduce this with swsusp? Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:56 ` Bob Picco 2007-04-19 1:43 ` [Suspend2-devel] " Nigel Cunningham @ 2007-04-19 6:29 ` Ingo Molnar 2007-04-19 11:10 ` Bob Picco 1 sibling, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 6:29 UTC (permalink / raw) To: Bob Picco Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Bob Picco <bob.picco@hp.com> wrote: > I had hoped to collect more data with CFS V2. It crashes in > scale_nice_down for s2ram when attempting to disable_nonboot_cpus. So > part of traceback looks like (typed by hand with obvious omissions): > > scale_nice_down > update_stats_wait_end - not shown in traceback because inlined > pick_next_task_fair > migration_call > task_rq_lock > notifier_call_chain > _cpu_down > disable_nonboot_cpus ok, this looks similar to the jpeg Christian did. Does the patch below fix the crash for you? Ingo --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned struct task_struct *next; for (;;) { + if (!rq->nr_running) + break; next = pick_next_task(rq, rq->curr); if (!next) break; ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-19 6:29 ` Ingo Molnar @ 2007-04-19 11:10 ` Bob Picco 0 siblings, 0 replies; 713+ messages in thread From: Bob Picco @ 2007-04-19 11:10 UTC (permalink / raw) To: Ingo Molnar Cc: Bob Picco, Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel Ingo Molnar wrote: [Thu Apr 19 2007, 02:29:36AM EDT] > > * Bob Picco <bob.picco@hp.com> wrote: > > > I had hoped to collect more data with CFS V2. It crashes in > > scale_nice_down for s2ram when attempting to disable_nonboot_cpus. So > > part of traceback looks like (typed by hand with obvious omissions): > > > > scale_nice_down > > update_stats_wait_end - not shown in traceback because inlined > > pick_next_task_fair > > migration_call > > task_rq_lock > > notifier_call_chain > > _cpu_down > > disable_nonboot_cpus > > ok, this looks similar to the jpeg Christian did. Does the patch below > fix the crash for you? > > Ingo > > --- > kernel/sched.c | 2 ++ > 1 file changed, 2 insertions(+) > > Index: linux/kernel/sched.c > =================================================================== > --- linux.orig/kernel/sched.c > +++ linux/kernel/sched.c > @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned > struct task_struct *next; > > for (;;) { > + if (!rq->nr_running) > + break; > next = pick_next_task(rq, rq->curr); > if (!next) > break; This patch repairs s2ram issue. Thanks. bob ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:02 ` Ingo Molnar 2007-04-18 22:22 ` Christian Hesse 2007-04-18 22:56 ` Bob Picco @ 2007-04-19 1:52 ` Nigel Cunningham 2007-04-19 7:04 ` Ingo Molnar 2 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-19 1:52 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1160 bytes --] Hi. On Thu, 2007-04-19 at 00:02 +0200, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > > > although probably your suspend2 problem is still not fixed, it's > > > worth a try nevertheless. Which suspend2 patch did you apply, and > > > was it against -rc6 or -rc7? > > > > You are right again. ;-) > > > > Linux 2.6.21-rc7 > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > CFS v3 (without any additional patches) > > > > And it still hangs on suspend. > > what's the easiest way for me to try suspend2? Apply the patch, reboot > into the kernel, then execute what command to suspend? (there's a > confusing mismash of initiators of all the suspend variants. Can i drive > this by echoing to /sys/power/state?) From subsequent emails, I think you already got your answer, but just in case... Yes, if you enabled "Replace swsusp by default" and you already had it set up for getting swsusp to resume. If not, and you're using an initrd/ramfs, you'll need to modify it to echo > /sys/power/suspend2/do_resume after /sys and /proc are mounted but prior to mounting / and so on. Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy 2007-04-19 1:52 ` [Suspend2-devel] " Nigel Cunningham @ 2007-04-19 7:04 ` Ingo Molnar 2007-04-19 9:05 ` Nigel Cunningham 2007-04-24 20:23 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek 0 siblings, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 7:04 UTC (permalink / raw) To: Nigel Cunningham Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven * Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > From subsequent emails, I think you already got your answer, but just > in case... > > Yes, if you enabled "Replace swsusp by default" and you already had it > set up for getting swsusp to resume. If not, and you're using an > initrd/ramfs, you'll need to modify it to echo > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but > prior to mounting / and so on. yeah, went with the default suggested by your patch: CONFIG_SUSPEND2_REPLACE_SWSUSP=y and it was pretty easy to set things up. I used "echo disk > /sys/power/state" to trigger it. In hindsight it was all pretty straightforward and suspend2 worked beautifully on an UP and on an SMP system i tried. So in exchange for suspend2 folks debugging a bug in CFS here's some suspend2 review feedback ;) Any plans about moving suspend2 to the upstream kernel? It should be pretty easy for it to co-exist with the current swsuspend code. The patch has quite some size: 89 files changed, 16452 insertions(+), 69 deletions(-) that should obviously be split up into more than a dozen sub-patches, and fed to lkml with the small ones first. (unless it already is split up?) i cannot comment on the kernel/power/ bits (they are way too large anyway), other than that they look pretty clean visually, but the lowlevel arch and generic kernel bits look sane in detail too, sans a few mostly trivial cleanliness issues: +int suspend2_faulted = 0; +EXPORT_SYMBOL(suspend2_faulted); should be done via the pagefault notifier chain mechanism. Also, all the exports you added should be EXPORT_SYMBOL_GPL(). this: - ClearPageReserved(virt_to_page(addr)); - init_page_count(virt_to_page(addr)); + //ClearPageReserved(virt_to_page(addr)); + //init_page_count(virt_to_page(addr)); looks like there's a buglet in there still somewhere? + if(PageHighMem(page)) + return 0; coding style. + BUG_ON( test_suspend_state(SUSPEND_RUNNING) && /* Suspend2, that is */ make this a WARN_ON() or a WARN_ON_ONCE() - that way you have a chance to even get feedback from users, instead of a 'uhm, X froze' report. +#define FREEZER_OFF 0 +#define FREEZER_USERSPACE_FROZEN 1 +#define FREEZER_FULLY_ON 2 should be: +#define FREEZER_OFF 0 +#define FREEZER_USERSPACE_FROZEN 1 +#define FREEZER_FULLY_ON 2 (you want your reviewers have an pleasant time reading your code :) +#define NETLINK_SUSPEND2_USERUI 20 /* For suspend2's userui */ IIRC userui was at the center of suspend2 merge flames, right? So you might want to layer it ontop a less flashy suspend2-core and thus get 90% of your patch upstream? +++ linux/mm/vmscan.c the MM impact looks quite nontrivial. But i suspect this is unavoidable, because you zap portions of the pagecache on the way to disk, so when it comes back it results in a different pagecache (new lru lists, etc.), right? +++ linux/lib/dyn_pageflags.c shouldnt this be in mm/dyn_pageflags.c? Plus it would be nice to use some other core kernel user for this infrastructure. (but it's not a necessity i guess) but ... again, the patch looks sane all around. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy 2007-04-19 7:04 ` Ingo Molnar @ 2007-04-19 9:05 ` Nigel Cunningham 2007-04-24 20:23 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-19 9:05 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 6210 bytes --] Hi Ingo. On Thu, 2007-04-19 at 09:04 +0200, Ingo Molnar wrote: > * Nigel Cunningham <nigel@nigel.suspend2.net> wrote: > > > From subsequent emails, I think you already got your answer, but just > > in case... > > > > Yes, if you enabled "Replace swsusp by default" and you already had it > > set up for getting swsusp to resume. If not, and you're using an > > initrd/ramfs, you'll need to modify it to echo > > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but > > prior to mounting / and so on. > > yeah, went with the default suggested by your patch: > > CONFIG_SUSPEND2_REPLACE_SWSUSP=y > > and it was pretty easy to set things up. I used "echo disk > > /sys/power/state" to trigger it. > > In hindsight it was all pretty straightforward and suspend2 worked > beautifully on an UP and on an SMP system i tried. So in exchange for > suspend2 folks debugging a bug in CFS here's some suspend2 review > feedback ;) Any plans about moving suspend2 to the upstream kernel? It > should be pretty easy for it to co-exist with the current swsuspend > code. I really would like to get it into Linus' tree but Pavel doesn't want it (obviously!) and I haven't got together enough of a case yet to convince Andrew. I yet another here's-why-I-think-it-should-be-merged email in the works (poor Andrew!) but there are too many other things on my plate at the mo. > The patch has quite some size: > > 89 files changed, 16452 insertions(+), 69 deletions(-) > > that should obviously be split up into more than a dozen sub-patches, > and fed to lkml with the small ones first. (unless it already is split > up?) Right. A good portion (~2000 lines) of that is documentation. > i cannot comment on the kernel/power/ bits (they are way too large > anyway), other than that they look pretty clean visually, but the > lowlevel arch and generic kernel bits look sane in detail too, sans a > few mostly trivial cleanliness issues: > > +int suspend2_faulted = 0; > +EXPORT_SYMBOL(suspend2_faulted); > > should be done via the pagefault notifier chain mechanism. Also, all the > exports you added should be EXPORT_SYMBOL_GPL(). I'll look at that, but I'm not sure if it's a good idea - this is for during the atomic copy & restore, when DEBUG_PAGEALLOC is enabled on x86. Other things might touch memory in ways we don't want. It's only needed for slab pages that get unmapped but not freed. As far as the module exports go, I'm not expecting them to get merged. I like building Suspend2 as modules (it helps speed the development cycle), and see it as potentially useful for embedded but IMO there are too many export symbols to make merging that code a possibility. This is why they're all in one file rather than sprinkled through the files that define the symbols. > this: > > - ClearPageReserved(virt_to_page(addr)); > - init_page_count(virt_to_page(addr)); > + //ClearPageReserved(virt_to_page(addr)); > + //init_page_count(virt_to_page(addr)); > > looks like there's a buglet in there still somewhere? Yeah. When I was recently debugging, I found that cpu hotplugging is using something marked __init which is causing the machine to spontaneously reboot when cpus are replugged if DEBUG_PAGEALLOC is enabled. Haven't had the time to get back to it, and also need some help with the approach (what makes the machine reboot in this case instead of oopsing, and how do I stop it?). > + if(PageHighMem(page)) > + return 0; > > coding style. Oh. The space missing after the if? Ok. > + BUG_ON( test_suspend_state(SUSPEND_RUNNING) && /* Suspend2, that is */ > > make this a WARN_ON() or a WARN_ON_ONCE() - that way you have a chance > to even get feedback from users, instead of a 'uhm, X froze' report. > > +#define FREEZER_OFF 0 > +#define FREEZER_USERSPACE_FROZEN 1 > +#define FREEZER_FULLY_ON 2 > > should be: > > +#define FREEZER_OFF 0 > +#define FREEZER_USERSPACE_FROZEN 1 > +#define FREEZER_FULLY_ON 2 > > (you want your reviewers have an pleasant time reading your code :) Ok. > +#define NETLINK_SUSPEND2_USERUI 20 /* For suspend2's userui */ > > IIRC userui was at the center of suspend2 merge flames, right? So you > might want to layer it ontop a less flashy suspend2-core and thus get > 90% of your patch upstream? Ok. I've just separated that into it's own file/module, so that will be straightforward to do. > +++ linux/mm/vmscan.c > > the MM impact looks quite nontrivial. But i suspect this is unavoidable, > because you zap portions of the pagecache on the way to disk, so when it > comes back it results in a different pagecache (new lru lists, etc.), > right? The modifications do three things. First, we're seeking to keep the LRU static once while we're suspending. I originally sought to achieve that by avoiding entering the vmscan.c logic (not as drastic as it sounds - Suspend2 is the only thing running!). I think it was Nick who said he'd rather see it the pages unlinked and kept safe that way, so now I do that. Oh, as part of this, I separated out the code from shrink_inactive_list that returns isolated pages, since relink_lru_lists uses it too. The other part is (prior to the above) seeking to get in a situation where we have enough memory available to do the cycle. I used to use shrink_all_zones, but some users with lots of Highmem were finding issues that let me to take a more per-zone based approach, hence shrink_one_zone. The last thing is a patch Rafael recently posted to improve kswapd freezing. > +++ linux/lib/dyn_pageflags.c > > shouldnt this be in mm/dyn_pageflags.c? Plus it would be nice to use > some other core kernel user for this infrastructure. (but it's not a > necessity i guess) Yeah, I guess mm makes more sense. > but ... again, the patch looks sane all around. Thanks for the feedback. Although I'd like to see Suspend2 merged, I've been feeling a bit like it's a lost cause. It's nice to get some encouragement. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-19 7:04 ` Ingo Molnar 2007-04-19 9:05 ` Nigel Cunningham @ 2007-04-24 20:23 ` Pavel Machek 2007-04-24 20:41 ` Linus Torvalds 1 sibling, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-24 20:23 UTC (permalink / raw) To: Ingo Molnar Cc: Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven Hi! > > From subsequent emails, I think you already got your answer, but just > > in case... > > > > Yes, if you enabled "Replace swsusp by default" and you already had it > > set up for getting swsusp to resume. If not, and you're using an > > initrd/ramfs, you'll need to modify it to echo > > > /sys/power/suspend2/do_resume after /sys and /proc are mounted but > > prior to mounting / and so on. > > yeah, went with the default suggested by your patch: > > CONFIG_SUSPEND2_REPLACE_SWSUSP=y > > and it was pretty easy to set things up. I used "echo disk > > /sys/power/state" to trigger it. > > In hindsight it was all pretty straightforward and suspend2 worked > beautifully on an UP and on an SMP system i tried. So in exchange for > suspend2 folks debugging a bug in CFS here's some suspend2 review > feedback ;) Any plans about moving suspend2 to the upstream kernel? It > should be pretty easy for it to co-exist with the current swsuspend > code. Well, current uswsusp code can do most of stuff suspend2 can do, with 20% (or so) of kernel code. "Major feature" that is missing is ability to save 100% of memory if it is all the pagecache. I think that is not that important; we have 200 line patch to do that, but noone was able to verify it is correct. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 20:23 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek @ 2007-04-24 20:41 ` Linus Torvalds 2007-04-24 20:51 ` Hua Zhong ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-24 20:41 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Tue, 24 Apr 2007, Pavel Machek wrote: > > Well, current uswsusp code can do most of stuff suspend2 can do, with > 20% (or so) of kernel code. Btw, this is a totally inane argument. If the code just moved somewhere else, it's not "less code". You compare complete subsystems against complete subsystems, OR YOU DON'T COMPARE THEM AT ALL! This whole notion that "kernel lines of code" is somehow different is a stupid and idiotic _disease_ that is spread by microkernel people and people who have been brainwashed by them. Code is code, and sometimes it's better in the kernel, sometimes it's better in user space, and you cannot say "we only have 10 lines of kernel code", if that is then combined with a million lines of user space code that actually is the only reason for the 10 lines of code in the first place. Separation of code often makes things *harder* to understand and debug. A few prime examples of this f*cking idiotic stupid disease of discounting user level code because it somehow "doesn't matter" is: - the old 16-bit pcmcia layer that did all the "policy" in user space, and only the "device access" in kernel space, and as a result _neither_ actually knew what the hell they were actually doing, and debugging was a nightmare. We've become a *lot* better off with a device layer that actually knows and understands what it is doing, and having the code in one place, rather than having two broken pieces. And we became better exactly by doing *more* in the kernel, and havign a *higher* level of abstraction. This is a BIG ISSUE. Abstraction is good, but abstraction is good only if it is at a high enough level to make sense and matter, and give the abstraction level a choice in how to implement the lower layers. - the old module loader was also split into user/kernel space, and yes, we made the kernel part "larger" by moving some parts into the kernel, but in doing so, we actually made the *combined* code smaller, and a hell of a lot more maintainable. It also automatically (again, because of a higher level of abstraction) meant that the new module loader infrastructure was not only more maintainable, but could actually *do* more. Suddenly you can do things like check for cryptographic signatures etc, because you know what you're actually doing, as opposed to getting a ready-made "binary blob" that you don't know anything about, that has been pre-linked etc. So stop blathering about "less kernel code". That's the *least* of any worries. The only thing that matters is the end result, and trying to say that magically only one part counts is just dishonest and stupid. In general, the kernel should be self-sufficient and *understand* what it is doing. If the kernel cannot understand the bigger picture, nobody can ever maintain the kernel, because the kernel is just a broken piece bobbing around in a mindless churning sea during a thunderstorm. You cannot have purpose, and you cannot improve yourself if you don't actyally understand your lot in life. That's as true of kernels as it is of people. User-space should set high-level policy, but if the kernel doesn't know what it's all about, the kernel can never do anything smarter and can never *fix* itself. That was the case in both PCMCIA and in kernel module loaders. I have not a frigging clue whether that is the case in suspend2 vs uswsusp, but I object to this idiotic argument of counting "kernel code". That's simply not a valid argument. It never was. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* RE: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 20:41 ` Linus Torvalds @ 2007-04-24 20:51 ` Hua Zhong 2007-04-24 20:54 ` Ingo Molnar 2007-04-24 21:24 ` Pavel Machek 2 siblings, 0 replies; 713+ messages in thread From: Hua Zhong @ 2007-04-24 20:51 UTC (permalink / raw) To: 'Linus Torvalds', 'Pavel Machek' Cc: 'Ingo Molnar', 'Nigel Cunningham', 'Christian Hesse', 'Nick Piggin', 'Mike Galbraith', linux-kernel, 'Con Kolivas', suspend2-devel, 'Andrew Morton', 'Thomas Gleixner', 'Arjan van de Ven' > This whole notion that "kernel lines of code" is somehow different is a > stupid and idiotic _disease_ that is spread by microkernel people and > people who have been brainwashed by them. I think a lot of people are tired of this argument, but I am glad you speak up (as you did last year wrt s2ram). > The only thing that matters is the end result Amen to that. The end result is not just code size, but quality and whether it actually *works reliably*. Cheers, Hua ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 20:41 ` Linus Torvalds 2007-04-24 20:51 ` Hua Zhong @ 2007-04-24 20:54 ` Ingo Molnar 2007-04-24 21:29 ` Pavel Machek 2007-04-24 21:24 ` Pavel Machek 2 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-24 20:54 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven * Linus Torvalds <torvalds@linux-foundation.org> wrote: > I have not a frigging clue whether that is the case in suspend2 vs > uswsusp, but I object to this idiotic argument of counting "kernel > code". That's simply not a valid argument. It never was. the raw linecount appears to be the following: suspend2-2.2.9.12-for-2.6.21-rc6.patch 89 files changed, 16452 insertions(+), 69 deletions(-) $ suspend-0.5> countlines 32260 so, while it's probably apples to oranges, uswsusp seems to be larger, while there's at least one feature that it is missing. also, from the structure of the suspend2 patch it seemed to me that they could peacefully coexist in the kernel without stepping on each other's toes - why not do that? Users will then pick the winner. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 20:54 ` Ingo Molnar @ 2007-04-24 21:29 ` Pavel Machek 2007-04-24 22:24 ` Ray Lee 2007-04-25 21:41 ` Matt Mackall 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-24 21:29 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > I have not a frigging clue whether that is the case in suspend2 vs > > uswsusp, but I object to this idiotic argument of counting "kernel > > code". That's simply not a valid argument. It never was. > > the raw linecount appears to be the following: > > suspend2-2.2.9.12-for-2.6.21-rc6.patch > > 89 files changed, 16452 insertions(+), 69 deletions(-) > > $ suspend-0.5> countlines > 32260 (I'm getting very different numbers here. Userland part:) pavel@amd:~/sf/suspend$ wc -l *.c 125 bootsplash.c 12 breakit.c 119 config.c 136 delme.c 207 dmidecode.c 92 encrypt.c 222 keygen.c 447 md5.c 286 radeontool.c 870 resume.c 434 s2ram.c 78 splash.c 73 splashy_funcs.c 1481 suspend.c 117 swap-offset.c 11 vfork_test.c 123 vt.c 413 whitelist.c 136 whitelist2.c 5382 total pavel@amd:~/sf/suspend$ wc -l *.h 23 bootsplash.h 26 config.h 62 encrypt.h 106 md5.h 1764 radeon_reg.h 20 s2ram.h 26 splash.h 25 splashy_funcs.h 217 swsusp.h 10 vt.h 2279 total pavel@amd:~/sf/suspend$ > so, while it's probably apples to oranges, uswsusp seems to be larger, > while there's at least one feature that it is missing. (We are talking "save 100% memory" here). As I said, that one feature is doable in uswsusp, too. It is 200 lines. It also makes mm <-> swsusp interaction _way_ more complex, and noone was able to review it. It will corrupt memory if we got it wrong. (Suspend2 has the same problem. It includes that same feature, and noone is able to review it. It has few more problems.). > also, from the structure of the suspend2 patch it seemed to me that they > could peacefully coexist in the kernel without stepping on each other's > toes - why not do that? Users will then pick the winner. We do not want to fragment the testing base, and suspend2 does not really have any interesting features over uswsusp. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 21:29 ` Pavel Machek @ 2007-04-24 22:24 ` Ray Lee 2007-04-25 21:41 ` Matt Mackall 1 sibling, 0 replies; 713+ messages in thread From: Ray Lee @ 2007-04-24 22:24 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On 4/24/07, Pavel Machek <pavel@ucw.cz> wrote: > > so, while it's probably apples to oranges, uswsusp seems to be larger, > > while there's at least one feature that it is missing. > > (We are talking "save 100% memory" here). > > As I said, that one feature is doable in uswsusp, too. It is 200 > lines. It also makes mm <-> swsusp interaction _way_ more complex, and Sounds like the perfect reason to put that in kernel space. Ray ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 21:29 ` Pavel Machek 2007-04-24 22:24 ` Ray Lee @ 2007-04-25 21:41 ` Matt Mackall 2007-04-26 11:27 ` Pavel Machek 2007-04-26 19:04 ` Bill Davidsen 1 sibling, 2 replies; 713+ messages in thread From: Matt Mackall @ 2007-04-25 21:41 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Tue, Apr 24, 2007 at 11:29:56PM +0200, Pavel Machek wrote: > We do not want to fragment the testing base, and suspend2 does not > really have any interesting features over uswsusp. The testing base is already fragmented! What the current situation means is that you simply never hear from the people who get fed up with suspend but who manage to get suspend2 working. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:41 ` Matt Mackall @ 2007-04-26 11:27 ` Pavel Machek 2007-04-26 19:04 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:27 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > We do not want to fragment the testing base, and suspend2 does not > > really have any interesting features over uswsusp. > > The testing base is already fragmented! > > What the current situation means is that you simply never hear from > the people who get fed up with suspend but who manage to get suspend2 > working. Well, and what can I do? I can wait for suspend2 to slowly disappear on their own. Merging suspend2 would just make testing base _more_ fragmented then it is today. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:41 ` Matt Mackall 2007-04-26 11:27 ` Pavel Machek @ 2007-04-26 19:04 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-26 19:04 UTC (permalink / raw) To: Matt Mackall Cc: Pavel Machek, Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Matt Mackall wrote: > On Tue, Apr 24, 2007 at 11:29:56PM +0200, Pavel Machek wrote: >> We do not want to fragment the testing base, and suspend2 does not >> really have any interesting features over uswsusp. > > The testing base is already fragmented! > > What the current situation means is that you simply never hear from > the people who get fed up with suspend but who manage to get suspend2 > working. > I have to feel that having a *working resume* capability is "any interesting features" enough. What you say about "simply never hear from" is unfortunately true. On 04/25/2007 05:30 PM EDT, Pavel Machek wrote: >It is not Rafael's fault. Actually it is quite hard to work with >Nigel, because he implements every feature someone asks for, and wants >to merge them all :-( . I don't expect to ever agree with Nigel on >anything important, sorry. The fact that Pavel thinks giving the users what they want is a problem certainly defines the difference between them, the populist "give them what they want" and the elitist "let's them make do with what I think they should have." I do respect Pavel for all the stuff he has done and is doing, I wish I could have found a nicer way to say that. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 20:41 ` Linus Torvalds 2007-04-24 20:51 ` Hua Zhong 2007-04-24 20:54 ` Ingo Molnar @ 2007-04-24 21:24 ` Pavel Machek 2007-04-24 23:41 ` Linus Torvalds 2007-04-26 10:17 ` Johannes Berg 2 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-24 21:24 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > Well, current uswsusp code can do most of stuff suspend2 can do, with > > 20% (or so) of kernel code. > > Btw, this is a totally inane argument. > > If the code just moved somewhere else, it's not "less code". It is not "just moved". It is in userspace, where we can use liblzf / gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about 7000 LoC of userland code (that is not libraries). > You compare complete subsystems against complete subsystems, OR YOU DON'T > COMPARE THEM AT ALL! Ok, I do not know how big suspend2 user code is, but kernel uswsusp (4 kLoC) + userland support (7 kLoC) is still smaller than suspend2 kernel code (+ ? kLoC suspend2 userland support). > This whole notion that "kernel lines of code" is somehow different is a > stupid and idiotic _disease_ that is spread by microkernel people and > people who have been brainwashed by them. Yep, sorry about that. > Separation of code often makes things *harder* to understand and debug. A > few prime examples of this f*cking idiotic stupid disease of discounting > user level code because it somehow "doesn't matter" is: I believe uswsusp user/kernel separation is clean enough. Kernel provides "snapshot image" and "resume image". (Thanks go to Rafael for very clean interface). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 21:24 ` Pavel Machek @ 2007-04-24 23:41 ` Linus Torvalds 2007-04-25 1:06 ` Olivier Galibert ` (2 more replies) 2007-04-26 10:17 ` Johannes Berg 1 sibling, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-24 23:41 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Tue, 24 Apr 2007, Pavel Machek wrote: > > > > If the code just moved somewhere else, it's not "less code". > > It is not "just moved". It is in userspace, where we can use liblzf / > gcrypt / ( and vbetool for s2ram/s2both) as libraries. We have about > 7000 LoC of userland code (that is not libraries). If it's in user land, we also have - communication difficulties between two parts, and all the *crap* that tends to entail (ie legacy interfaces forever, and upgrading one without the other etc) - people who work on the kernel part are working "blind" (ie they are at the mercy of whatever userland does, and it's not a "contained" subsystem). This just ends up becoming worse when you then interact with ten different versions of the user-land stuff, thanks to small tweaks by five different vendors, and a hundred random people. And don't tell me that doesn't happen. Maybe it doesn't happen _now_, because people who use it all get the patches from one place, but the moment we start talking about integration into the standard kernel, that means that the kernel needs to work regardless of whether somebody uses SuSE, RH, Fedora, Ubuntu or cooked his own distro entirely using some development version of the suspend user-space tools. This is why I don't believe in the whole kernel-line-counting thing. I'm personally 100% convinced that it's better to have ten times as many lines in the kernel, if it means that you can just forget about version skew and bad user-space interfaces etc. So if you want to enumerate "good" points, you'd damn well also face the _problems_. This is why there's a lot to be said for echo mem > /sys/power/state and being able to follow the path through _one_ object (the kernel) over trying to figure out the interaction between many different parts with different versions. > I believe uswsusp user/kernel separation is clean enough. Kernel > provides "snapshot image" and "resume image". (Thanks go to Rafael for > very clean interface). Now, *that* is the kind of argument that matters. Quite frankly, if you want to convince me, it's not by "lines of kernel code", but by talking about easy-to-understand interfaces that actuually do one thing and do it well (and by "one thing", I mean "one _whole_ thing"). Because I care a lot less about lines of code than about "maintainable interfaces that people can think about and debug". I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the whole thing. I think they've _all_ caused problems for the "true" suspend (suspend-to-ram), and the last thing I want to see is three or four different suspend-to-disk implementations. So unlike Ingo, I don't think "let's just integrate them all side-by-side and maintain them and look who wins" is really a good idea. How many different magic ioctl's does the thing introduce? Is it really just *two* entry-points (and how simple are they, interface-wise), and nothing else? Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 23:41 ` Linus Torvalds @ 2007-04-25 1:06 ` Olivier Galibert 2007-04-25 6:41 ` Ingo Molnar 2007-04-25 7:23 ` Pavel Machek 2 siblings, 0 replies; 713+ messages in thread From: Olivier Galibert @ 2007-04-25 1:06 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Tue, Apr 24, 2007 at 04:41:58PM -0700, Linus Torvalds wrote: > How many different magic ioctl's does the thing introduce? Is it really > just *two* entry-points (and how simple are they, interface-wise), and > nothing else? Aren't you a little late to the party here? The userland version is the one that currently is in the kernel, after all the people who said "doing it in userland is not necessarily a good idea" got happily ignored. Suspend2 which is the continuity of the fully-in-kernel one is the one that has been constantly rejected by Pavel, lately by saying "it should be done in userspace", and hence never merged. Incidentally, it's 13 ioctls, and it's documented in Documentation/power/userland-swsusp.txt in a hard drive near you. I especially like the "get the available swap space in bytes" one that can only handle 32 bits. OG. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 23:41 ` Linus Torvalds 2007-04-25 1:06 ` Olivier Galibert @ 2007-04-25 6:41 ` Ingo Molnar 2007-04-25 7:29 ` Pavel Machek 2007-04-25 7:23 ` Pavel Machek 2 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-25 6:41 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven * Linus Torvalds <torvalds@linux-foundation.org> wrote: > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate > the whole thing. I think they've _all_ caused problems for the "true" > suspend (suspend-to-ram), and the last thing I want to see is three or > four different suspend-to-disk implementations. So unlike Ingo, I > don't think "let's just integrate them all side-by-side and maintain > them and look who wins" is really a good idea. > > How many different magic ioctl's does the thing introduce? Is it > really just *two* entry-points (and how simple are they, > interface-wise), and nothing else? userspace-driven-suspend is already in the kernel, today. So it's not really "two versions side by side doing the same thing", but more of: A B C + D E F G H where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". (uswsusp of course redoes 'DEFGH' in user-space its own way, and there is the inevitable "+" glue code as well, but it's at least not two parallel versions of the same thing in the kernel, which would be gross.) My original mail was about the following thing: i tried the suspend2 patch (which just makes "echo disk > /sys/power/state" work as expected, as long as you give the booting up kernel image an idea about where the swap partition we suspended to is, via a single boot option) and that it was pretty straightforward and worked well, and that i think its way of reusing the existing suspend infrastructure and doing the add-ons cleanly while keeping the existing user-hibernate code intact looked viable to me. I.e. to me it looked like while there are apparent conflicts of personalities suspend2 did not really seem to be a hostile reimplementation of 'A B C', but that it tries to build upon 'A B C' and just has a different technical opinion about whether 'DEFGH' should be in the kernel or outside of it. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 6:41 ` Ingo Molnar @ 2007-04-25 7:29 ` Pavel Machek 2007-04-25 7:48 ` Dumitru Ciobarcianu ` (3 more replies) 0 siblings, 4 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 7:29 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate > > the whole thing. I think they've _all_ caused problems for the "true" > > suspend (suspend-to-ram), and the last thing I want to see is three or > > four different suspend-to-disk implementations. So unlike Ingo, I > > don't think "let's just integrate them all side-by-side and maintain > > them and look who wins" is really a good idea. > > > > How many different magic ioctl's does the thing introduce? Is it > > really just *two* entry-points (and how simple are they, > > interface-wise), and nothing else? > > userspace-driven-suspend is already in the kernel, today. So it's not > really "two versions side by side doing the same thing", but more of: > > A B C + D E F G H > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". Actually, we have 'D H' in kernel, today. It is called swsusp... (Encryption, swapFile support and Graphical progress are missing from today's kernel.) > My original mail was about the following thing: i tried the suspend2 > patch (which just makes "echo disk > /sys/power/state" work as expected, > as long as you give the booting up kernel image an idea about where the ..and it means that 'echo disk > ...' should work w/o suspend2 patch, too. (Just try it). You'll miss compression part, but that provides only small speedup. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:29 ` Pavel Machek @ 2007-04-25 7:48 ` Dumitru Ciobarcianu 2007-04-25 8:10 ` Pavel Machek 2007-04-25 8:48 ` Nigel Cunningham ` (2 subsequent siblings) 3 siblings, 1 reply; 713+ messages in thread From: Dumitru Ciobarcianu @ 2007-04-25 7:48 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 07:29 +0000, Pavel Machek wrote: > > userspace-driven-suspend is already in the kernel, today. So it's not > > really "two versions side by side doing the same thing", but more of: > > > > A B C + D E F G H > > > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". > > Actually, we have 'D H' in kernel, today. It is called swsusp... > (Encryption, swapFile support and Graphical progress are missing from > today's kernel.) Please stop using FUD. Graphical progress it's not in the kernel, even with suspend2. > > > My original mail was about the following thing: i tried the suspend2 > > patch (which just makes "echo disk > /sys/power/state" work as expected, > > as long as you give the booting up kernel image an idea about where the > > ..and it means that 'echo disk > ...' should work w/o suspend2 patch, > too. (Just try it). You'll miss compression part, but that provides > only small speedup. I beg to differ: Compressed 904687616 bytes into 418828687 (53 percent compression). Almost 500mb less to write (did I mention it writes the full image?). Now imagine the time it takes to write that with those pesky 4200rpm laptop hdds. -- Cioby "Mr Linus, how do you debug the kernel, what tools do you use?" "Ever heard of prinf ?" (From an presentation at the "Politehnica" University of Bucharest, 1995) ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:48 ` Dumitru Ciobarcianu @ 2007-04-25 8:10 ` Pavel Machek 2007-04-25 8:22 ` Dumitru Ciobarcianu 2007-04-26 11:12 ` Pekka Enberg 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 8:10 UTC (permalink / raw) To: Dumitru Ciobarcianu Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > > userspace-driven-suspend is already in the kernel, today. So it's not > > > really "two versions side by side doing the same thing", but more of: > > > > > > A B C + D E F G H > > > > > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by > > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". > > > > Actually, we have 'D H' in kernel, today. It is called swsusp... > > (Encryption, swapFile support and Graphical progress are missing from > > today's kernel.) > > Please stop using FUD. > Graphical progress it's not in the kernel, even with suspend2. It was ascii-art, but still 'graphical', last time I checked. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 8:10 ` Pavel Machek @ 2007-04-25 8:22 ` Dumitru Ciobarcianu 2007-04-26 11:12 ` Pekka Enberg 1 sibling, 0 replies; 713+ messages in thread From: Dumitru Ciobarcianu @ 2007-04-25 8:22 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 08:10 +0000, Pavel Machek wrote: > Hi! > > > > > userspace-driven-suspend is already in the kernel, today. So it's not > > > > really "two versions side by side doing the same thing", but more of: > > > > > > > > A B C + D E F G H > > > > > > > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by > > > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". > > > > > > Actually, we have 'D H' in kernel, today. It is called swsusp... > > > (Encryption, swapFile support and Graphical progress are missing from > > > today's kernel.) > > > > Please stop using FUD. > > Graphical progress it's not in the kernel, even with suspend2. > > It was ascii-art, but still 'graphical', last time I checked. That would be suspend2ui_text , an userspace app. It also works without it. -- Cioby ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 8:10 ` Pavel Machek 2007-04-25 8:22 ` Dumitru Ciobarcianu @ 2007-04-26 11:12 ` Pekka Enberg 2007-04-26 14:48 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Pekka Enberg @ 2007-04-26 11:12 UTC (permalink / raw) To: Pavel Machek Cc: Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven, rjw On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote: > > Please stop using FUD. > > Graphical progress it's not in the kernel, even with suspend2. > > It was ascii-art, but still 'graphical', last time I checked. Suspend2 talks to an userspace client via netlink. While I find the name of the message ("redraw UI") rather appaling, there's nothing wrong in principle that userspace starts the suspend process and the kernel keeps feeding back progress information ("I froze all processes now") so it can display a graphical progress bar. The real question here is what to do with compression and encryption. However, if you settle for one compression algorithm (such as LZF in the case of suspend2) and use the _existing_ in-kernel crypto API for encryption, suddenly the benefits of userspace suspend are not clear. As you and Rafael seem to be mostly interested in uswsusp, why don't we replace the old in-kernel implementation with suspend2? Pekka ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:12 ` Pekka Enberg @ 2007-04-26 14:48 ` Rafael J. Wysocki 2007-04-26 16:10 ` Pekka Enberg 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 14:48 UTC (permalink / raw) To: Pekka Enberg Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 13:12, Pekka Enberg wrote: > On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote: > > > Please stop using FUD. > > > Graphical progress it's not in the kernel, even with suspend2. > > > > It was ascii-art, but still 'graphical', last time I checked. > > Suspend2 talks to an userspace client via netlink. While I find the > name of the message ("redraw UI") rather appaling, there's nothing > wrong in principle that userspace starts the suspend process and the > kernel keeps feeding back progress information ("I froze all processes > now") so it can display a graphical progress bar. > > The real question here is what to do with compression and encryption. > However, if you settle for one compression algorithm (such as LZF in > the case of suspend2) and use the _existing_ in-kernel crypto API for > encryption, suddenly the benefits of userspace suspend are not clear. > > As you and Rafael seem to be mostly interested in uswsusp, why don't > we replace the old in-kernel implementation with suspend2? It has a lot of common code with uswsusp. Practically, the saving of the image is the only part of it that could be removed, but this is simple and _really_ helps with debugging. In principle, we could add suspend2 as an alternative (in analogy with the I/O schedulers, for example), but I think for this purpose it should be reviewed properly. There also is a real problem with how it uses the LRU pages. It _seems_ to work, but at least to me it seems to be potentially dangerous. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 14:48 ` Rafael J. Wysocki @ 2007-04-26 16:10 ` Pekka Enberg 2007-04-26 19:28 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Pekka Enberg @ 2007-04-26 16:10 UTC (permalink / raw) To: rjw Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > In principle, we could add suspend2 as an alternative (in analogy with the I/O > schedulers, for example), but I think for this purpose it should be reviewed > properly. Yeah, this makes sense. On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > There also is a real problem with how it uses the LRU pages. It _seems_ to > work, but at least to me it seems to be potentially dangerous. I am new to suspend2 so can you please explain what exactly is dangerous about it? Pekka ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 16:10 ` Pekka Enberg @ 2007-04-26 19:28 ` Rafael J. Wysocki 2007-04-26 20:16 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 19:28 UTC (permalink / raw) To: Pekka Enberg Cc: Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > schedulers, for example), but I think for this purpose it should be reviewed > > properly. > > Yeah, this makes sense. > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > work, but at least to me it seems to be potentially dangerous. > > I am new to suspend2 so can you please explain what exactly is dangerous > about it? After freezing tasks, it first saves the contents of the LRU pages, freezes devices and then uses the LRU pages for storing the suspend image (if more memory is needed, it's allocated, but that's irrelevant here). Now, we have no warranty that the LRU pages are not updated after we've saved their contents (first potential problem here). After the image has been created, we have to unfreeze devices and save the image. Now, we have no warranty that no one will be writing to the LRU pages that we have used to store the image, for whatever reasons known to him, so the image can potentially get corrupted while it's being saved. In principle, device drivers can do this and there are some kernel threads that also can do this (we don't freeze them, because they're needed for the image saving). The design is conceptually really really complicated and it makes strong assumptions about the behavior of different subsystems. While these assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction of them in the future if suspend2 were merged. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 19:28 ` Rafael J. Wysocki @ 2007-04-26 20:16 ` Nigel Cunningham 2007-04-26 20:37 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 20:16 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 2986 bytes --] Hi. On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote: > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > > schedulers, for example), but I think for this purpose it should be reviewed > > > properly. > > > > Yeah, this makes sense. > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > > work, but at least to me it seems to be potentially dangerous. > > > > I am new to suspend2 so can you please explain what exactly is dangerous > > about it? > > After freezing tasks, it first saves the contents of the LRU pages, freezes > devices and then uses the LRU pages for storing the suspend image (if more > memory is needed, it's allocated, but that's irrelevant here). Now, we have no > warranty that the LRU pages are not updated after we've saved their contents > (first potential problem here). > > After the image has been created, we have to unfreeze devices and save the > image. Now, we have no warranty that no one will be writing to the LRU pages > that we have used to store the image, for whatever reasons known to him, so the > image can potentially get corrupted while it's being saved. > > In principle, device drivers can do this and there are some kernel threads that > also can do this (we don't freeze them, because they're needed for the image > saving). > > The design is conceptually really really complicated and it makes strong > assumptions about the behavior of different subsystems. While these > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction > of them in the future if suspend2 were merged. That's a good description of the issue, although I think _may_ and _seems_ are stating things a bit more pessimistically than is necessary. You see, we need to remember that the pages which are saved separately are LRU pages. Because userspace is frozen, their contents are going to be static. The only possibilities for modifying them come from timer routines, improperly frozen filesystems and device drivers. We have code to check that the LRU isn't changing, and I've only seen one report of modifications to about 20 LRU pages. I haven't had the time yet to chase down the cause, but hope to do so soon. The general scheme has been working for four or five years - if there was a fundamental issue, we would have found it by now. The scheme isn't complicated. The algo for figuring out whether to save the page in an atomic copy just says: Iterate through all LRU pages. For each page, ask: Is this used by the thread suspending, or by userui? No? Save separately. Yes? Save in the atomic copy.... oh, and save everything else that needs to be saved in the atomic copy. Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 20:16 ` Nigel Cunningham @ 2007-04-26 20:37 ` Rafael J. Wysocki 2007-04-26 20:49 ` David Lang 2007-04-26 20:55 ` Nigel Cunningham 0 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 20:37 UTC (permalink / raw) To: nigel Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote: > Hi. > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote: > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > > > schedulers, for example), but I think for this purpose it should be reviewed > > > > properly. > > > > > > Yeah, this makes sense. > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > > > work, but at least to me it seems to be potentially dangerous. > > > > > > I am new to suspend2 so can you please explain what exactly is dangerous > > > about it? > > > > After freezing tasks, it first saves the contents of the LRU pages, freezes > > devices and then uses the LRU pages for storing the suspend image (if more > > memory is needed, it's allocated, but that's irrelevant here). Now, we have no > > warranty that the LRU pages are not updated after we've saved their contents > > (first potential problem here). > > > > After the image has been created, we have to unfreeze devices and save the > > image. Now, we have no warranty that no one will be writing to the LRU pages > > that we have used to store the image, for whatever reasons known to him, so the > > image can potentially get corrupted while it's being saved. > > > > In principle, device drivers can do this and there are some kernel threads that > > also can do this (we don't freeze them, because they're needed for the image > > saving). > > > > The design is conceptually really really complicated and it makes strong > > assumptions about the behavior of different subsystems. While these > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction > > of them in the future if suspend2 were merged. > > That's a good description of the issue, although I think _may_ and > _seems_ are stating things a bit more pessimistically than is > necessary. I've used them to express my personal concerns. > You see, we need to remember that the pages which are saved separately > are LRU pages. Because userspace is frozen, their contents are going to > be static. The only possibilities for modifying them come from timer > routines, improperly frozen filesystems and device drivers. And kernel threads that we don't freeze deliberately. Currently, these are all worker threads, dm-related kernel threads and some others. > We have code to check that the LRU isn't changing, and I've only seen > one report of modifications to about 20 LRU pages. I haven't had the > time yet to chase down the cause, but hope to do so soon. I didn't say that would be common. If it had been, you'd have seen problems with it. To me the problem is the lack of warranty that it won't happen. > The general scheme has been working for four or five years - if there > was a fundamental issue, we would have found it by now. > > The scheme isn't complicated. Conceptually, it is complicated just because you're using the LRU. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 20:37 ` Rafael J. Wysocki @ 2007-04-26 20:49 ` David Lang 2007-04-26 20:55 ` Nigel Cunningham 1 sibling, 0 replies; 713+ messages in thread From: David Lang @ 2007-04-26 20:49 UTC (permalink / raw) To: Rafael J. Wysocki Cc: nigel, Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thu, 26 Apr 2007, Rafael J. Wysocki wrote: >> The general scheme has been working for four or five years - if there >> was a fundamental issue, we would have found it by now. >> >> The scheme isn't complicated. > > Conceptually, it is complicated just because you're using the LRU. I know that I've seen many projects that are working on or claim to have suceeded in being able to do live migration of processes from one system to another. has anyone looked at useing any of these mechanisms for snapshoting the user processes for the std situation? if you can do this a process at a time you may be able to avoid the massive blob of a write instead of what linus was saying buff = snapshot() write(buff) it would be start_snapshot() /* stops all userspace schedulers except for this process */ foreach(pid) { buff = snapshotpid(pid) write(buff) } David Lang ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 20:37 ` Rafael J. Wysocki 2007-04-26 20:49 ` David Lang @ 2007-04-26 20:55 ` Nigel Cunningham 2007-04-26 21:22 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 20:55 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 3955 bytes --] Hi. On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote: > On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote: > > Hi. > > > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote: > > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > > > > schedulers, for example), but I think for this purpose it should be reviewed > > > > > properly. > > > > > > > > Yeah, this makes sense. > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > > > > work, but at least to me it seems to be potentially dangerous. > > > > > > > > I am new to suspend2 so can you please explain what exactly is dangerous > > > > about it? > > > > > > After freezing tasks, it first saves the contents of the LRU pages, freezes > > > devices and then uses the LRU pages for storing the suspend image (if more > > > memory is needed, it's allocated, but that's irrelevant here). Now, we have no > > > warranty that the LRU pages are not updated after we've saved their contents > > > (first potential problem here). > > > > > > After the image has been created, we have to unfreeze devices and save the > > > image. Now, we have no warranty that no one will be writing to the LRU pages > > > that we have used to store the image, for whatever reasons known to him, so the > > > image can potentially get corrupted while it's being saved. > > > > > > In principle, device drivers can do this and there are some kernel threads that > > > also can do this (we don't freeze them, because they're needed for the image > > > saving). > > > > > > The design is conceptually really really complicated and it makes strong > > > assumptions about the behavior of different subsystems. While these > > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction > > > of them in the future if suspend2 were merged. > > > > That's a good description of the issue, although I think _may_ and > > _seems_ are stating things a bit more pessimistically than is > > necessary. > > I've used them to express my personal concerns. > > > You see, we need to remember that the pages which are saved separately > > are LRU pages. Because userspace is frozen, their contents are going to > > be static. The only possibilities for modifying them come from timer > > routines, improperly frozen filesystems and device drivers. > > And kernel threads that we don't freeze deliberately. Currently, these are > all worker threads, dm-related kernel threads and some others. > > > We have code to check that the LRU isn't changing, and I've only seen > > one report of modifications to about 20 LRU pages. I haven't had the > > time yet to chase down the cause, but hope to do so soon. > > I didn't say that would be common. If it had been, you'd have seen problems > with it. To me the problem is the lack of warranty that it won't happen. > > > The general scheme has been working for four or five years - if there > > was a fundamental issue, we would have found it by now. > > > > The scheme isn't complicated. > > Conceptually, it is complicated just because you're using the LRU. Well, I'm willing to look at other ideas. I actually selected LRU because it was simple. Prior to this, we did have a try at just iterating over the pages of frozen processes, but it didn't yield enough pages to be viable. I wouldn't be surprised if hunting down the cause of these changing pages leads to doing the opposite - starting with LRU pages and then removing all the ones owned by processes. (Am I right in thinking the remainder would be anonymous pages? I must learn more mm inards :>). Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 20:55 ` Nigel Cunningham @ 2007-04-26 21:22 ` Rafael J. Wysocki 2007-04-26 22:08 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 21:22 UTC (permalink / raw) To: nigel Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 22:55, Nigel Cunningham wrote: > Hi. > > On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote: > > On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote: > > > Hi. > > > > > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote: > > > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > > > > > schedulers, for example), but I think for this purpose it should be reviewed > > > > > > properly. > > > > > > > > > > Yeah, this makes sense. > > > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > > > > > work, but at least to me it seems to be potentially dangerous. > > > > > > > > > > I am new to suspend2 so can you please explain what exactly is dangerous > > > > > about it? > > > > > > > > After freezing tasks, it first saves the contents of the LRU pages, freezes > > > > devices and then uses the LRU pages for storing the suspend image (if more > > > > memory is needed, it's allocated, but that's irrelevant here). Now, we have no > > > > warranty that the LRU pages are not updated after we've saved their contents > > > > (first potential problem here). > > > > > > > > After the image has been created, we have to unfreeze devices and save the > > > > image. Now, we have no warranty that no one will be writing to the LRU pages > > > > that we have used to store the image, for whatever reasons known to him, so the > > > > image can potentially get corrupted while it's being saved. > > > > > > > > In principle, device drivers can do this and there are some kernel threads that > > > > also can do this (we don't freeze them, because they're needed for the image > > > > saving). > > > > > > > > The design is conceptually really really complicated and it makes strong > > > > assumptions about the behavior of different subsystems. While these > > > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction > > > > of them in the future if suspend2 were merged. > > > > > > That's a good description of the issue, although I think _may_ and > > > _seems_ are stating things a bit more pessimistically than is > > > necessary. > > > > I've used them to express my personal concerns. > > > > > You see, we need to remember that the pages which are saved separately > > > are LRU pages. Because userspace is frozen, their contents are going to > > > be static. The only possibilities for modifying them come from timer > > > routines, improperly frozen filesystems and device drivers. > > > > And kernel threads that we don't freeze deliberately. Currently, these are > > all worker threads, dm-related kernel threads and some others. > > > > > We have code to check that the LRU isn't changing, and I've only seen > > > one report of modifications to about 20 LRU pages. I haven't had the > > > time yet to chase down the cause, but hope to do so soon. > > > > I didn't say that would be common. If it had been, you'd have seen problems > > with it. To me the problem is the lack of warranty that it won't happen. > > > > > The general scheme has been working for four or five years - if there > > > was a fundamental issue, we would have found it by now. > > > > > > The scheme isn't complicated. > > > > Conceptually, it is complicated just because you're using the LRU. > > Well, I'm willing to look at other ideas. I actually selected LRU > because it was simple. Prior to this, we did have a try at just > iterating over the pages of frozen processes, but it didn't yield enough > pages to be viable. I wouldn't be surprised if hunting down the cause of > these changing pages leads to doing the opposite - starting with LRU > pages and then removing all the ones owned by processes. (Am I right in > thinking the remainder would be anonymous pages? I must learn more mm > inards :>). I think we can discuss that, and the other things too. I'm open to cooperation. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 21:22 ` Rafael J. Wysocki @ 2007-04-26 22:08 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 22:08 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pekka Enberg, Pavel Machek, Dumitru Ciobarcianu, Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 4505 bytes --] Hi. On Thu, 2007-04-26 at 23:22 +0200, Rafael J. Wysocki wrote: > On Thursday, 26 April 2007 22:55, Nigel Cunningham wrote: > > Hi. > > > > On Thu, 2007-04-26 at 22:37 +0200, Rafael J. Wysocki wrote: > > > On Thursday, 26 April 2007 22:16, Nigel Cunningham wrote: > > > > Hi. > > > > > > > > On Thu, 2007-04-26 at 21:28 +0200, Rafael J. Wysocki wrote: > > > > > On Thursday, 26 April 2007 18:10, Pekka Enberg wrote: > > > > > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > > > In principle, we could add suspend2 as an alternative (in analogy with the I/O > > > > > > > schedulers, for example), but I think for this purpose it should be reviewed > > > > > > > properly. > > > > > > > > > > > > Yeah, this makes sense. > > > > > > > > > > > > On 4/26/2007, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > > > > There also is a real problem with how it uses the LRU pages. It _seems_ to > > > > > > > work, but at least to me it seems to be potentially dangerous. > > > > > > > > > > > > I am new to suspend2 so can you please explain what exactly is dangerous > > > > > > about it? > > > > > > > > > > After freezing tasks, it first saves the contents of the LRU pages, freezes > > > > > devices and then uses the LRU pages for storing the suspend image (if more > > > > > memory is needed, it's allocated, but that's irrelevant here). Now, we have no > > > > > warranty that the LRU pages are not updated after we've saved their contents > > > > > (first potential problem here). > > > > > > > > > > After the image has been created, we have to unfreeze devices and save the > > > > > image. Now, we have no warranty that no one will be writing to the LRU pages > > > > > that we have used to store the image, for whatever reasons known to him, so the > > > > > image can potentially get corrupted while it's being saved. > > > > > > > > > > In principle, device drivers can do this and there are some kernel threads that > > > > > also can do this (we don't freeze them, because they're needed for the image > > > > > saving). > > > > > > > > > > The design is conceptually really really complicated and it makes strong > > > > > assumptions about the behavior of different subsystems. While these > > > > > assumptions _may_ be satisfied right now, we'd have to ensure the satisfaction > > > > > of them in the future if suspend2 were merged. > > > > > > > > That's a good description of the issue, although I think _may_ and > > > > _seems_ are stating things a bit more pessimistically than is > > > > necessary. > > > > > > I've used them to express my personal concerns. > > > > > > > You see, we need to remember that the pages which are saved separately > > > > are LRU pages. Because userspace is frozen, their contents are going to > > > > be static. The only possibilities for modifying them come from timer > > > > routines, improperly frozen filesystems and device drivers. > > > > > > And kernel threads that we don't freeze deliberately. Currently, these are > > > all worker threads, dm-related kernel threads and some others. > > > > > > > We have code to check that the LRU isn't changing, and I've only seen > > > > one report of modifications to about 20 LRU pages. I haven't had the > > > > time yet to chase down the cause, but hope to do so soon. > > > > > > I didn't say that would be common. If it had been, you'd have seen problems > > > with it. To me the problem is the lack of warranty that it won't happen. > > > > > > > The general scheme has been working for four or five years - if there > > > > was a fundamental issue, we would have found it by now. > > > > > > > > The scheme isn't complicated. > > > > > > Conceptually, it is complicated just because you're using the LRU. > > > > Well, I'm willing to look at other ideas. I actually selected LRU > > because it was simple. Prior to this, we did have a try at just > > iterating over the pages of frozen processes, but it didn't yield enough > > pages to be viable. I wouldn't be surprised if hunting down the cause of > > these changing pages leads to doing the opposite - starting with LRU > > pages and then removing all the ones owned by processes. (Am I right in > > thinking the remainder would be anonymous pages? I must learn more mm > > inards :>). > > I think we can discuss that, and the other things too. I'm open to > cooperation. :-) Great! Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:29 ` Pavel Machek 2007-04-25 7:48 ` Dumitru Ciobarcianu @ 2007-04-25 8:48 ` Nigel Cunningham 2007-04-25 13:07 ` Federico Heinz 2007-04-25 19:38 ` Kenneth Crudup 3 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-25 8:48 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Linus Torvalds, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1911 bytes --] Hi. On Wed, 2007-04-25 at 07:29 +0000, Pavel Machek wrote: > Hi! > > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate > > > the whole thing. I think they've _all_ caused problems for the "true" > > > suspend (suspend-to-ram), and the last thing I want to see is three or > > > four different suspend-to-disk implementations. So unlike Ingo, I > > > don't think "let's just integrate them all side-by-side and maintain > > > them and look who wins" is really a good idea. > > > > > > How many different magic ioctl's does the thing introduce? Is it > > > really just *two* entry-points (and how simple are they, > > > interface-wise), and nothing else? > > > > userspace-driven-suspend is already in the kernel, today. So it's not > > really "two versions side by side doing the same thing", but more of: > > > > A B C + D E F G H > > > > where "ABC" is used by the uswsusp code today, and "ABCDEFGH" is used by > > suspend2. So any "suspend2 merge" would largely be about adding "DEFGH". > > Actually, we have 'D H' in kernel, today. It is called swsusp... > (Encryption, swapFile support and Graphical progress are missing from > today's kernel.) Along with a lot of other things (see my "Reasons to merge Suspend2" email from earlier in the day). > > My original mail was about the following thing: i tried the suspend2 > > patch (which just makes "echo disk > /sys/power/state" work as expected, > > as long as you give the booting up kernel image an idea about where the > > ..and it means that 'echo disk > ...' should work w/o suspend2 patch, > too. (Just try it). You'll miss compression part, but that provides > only small speedup. Please don't spread misinformation to support your case. LZF compression (which is what all Suspend2 users use AFAIK) generally doubles the speed of your cycle. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:29 ` Pavel Machek 2007-04-25 7:48 ` Dumitru Ciobarcianu 2007-04-25 8:48 ` Nigel Cunningham @ 2007-04-25 13:07 ` Federico Heinz 2007-04-25 19:38 ` Kenneth Crudup 3 siblings, 0 replies; 713+ messages in thread From: Federico Heinz @ 2007-04-25 13:07 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven Pavel Machek wrote: > ..and it means that 'echo disk > ...' should work w/o suspend2 patch, > too. (Just try it). You'll miss compression part, but that provides > only small speedup. > In my experience, the speedup is significant, both in hibernating and in waking up, and since the full image is written to disk, the system wakes up *usable*. It takes forever for a system that wakes up from uswsusp to be usable again, it keeps tripping over page faults for *minutes*. Fede ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:29 ` Pavel Machek ` (2 preceding siblings ...) 2007-04-25 13:07 ` Federico Heinz @ 2007-04-25 19:38 ` Kenneth Crudup 3 siblings, 0 replies; 713+ messages in thread From: Kenneth Crudup @ 2007-04-25 19:38 UTC (permalink / raw) To: Pavel Machek Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Wed, 25 Apr 2007, Pavel Machek wrote: > You'll miss compression part, but that provides only small speedup. Not here: ---- fgrep -h Compressed /var/log/rawlog* Apr 22 13:41:34 vaio kernel: Compressed 85655552 bytes into 46779248 (45 percent compression). Apr 22 16:09:13 vaio kernel: Compressed 1380552704 bytes into 435656971 (68 percent compression). Apr 22 17:06:11 vaio kernel: Compressed 1488437248 bytes into 437400026 (70 percent compression). Apr 22 22:55:41 vaio kernel: Compressed 1875677184 bytes into 623450953 (66 percent compression). Apr 23 12:30:33 vaio kernel: Compressed 1731796992 bytes into 528194347 (69 percent compression). Apr 23 18:13:32 vaio kernel: Compressed 1883869184 bytes into 691016832 (63 percent compression). Apr 24 11:55:07 vaio kernel: Compressed 1795903488 bytes into 703370960 (60 percent compression). <snip> ---- -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 23:41 ` Linus Torvalds 2007-04-25 1:06 ` Olivier Galibert 2007-04-25 6:41 ` Ingo Molnar @ 2007-04-25 7:23 ` Pavel Machek 2007-04-25 8:48 ` Xavier Bestel ` (4 more replies) 2 siblings, 5 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 7:23 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > This is why there's a lot to be said for > > echo mem > /sys/power/state > > and being able to follow the path through _one_ object (the kernel) over > trying to figure out the interaction between many different parts with > different versions. The 'promise' is 'if you can get echo disk > /sys/power/state working, uswsusp will work. too'. IOW it should be ok to debug the in-kernel parts, only. Even I am running in-kernel swsusp, but my managers were pretty clear they want graphical progress bar hiding all the 'ugly' swsusp messages... and in the end the same uswsusp enables compression, too. > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > whole thing. I think they've _all_ caused problems for the "true" suspend > (suspend-to-ram), and the last thing I want to see is three or four Well, it is a bit more complex than that. suspend-to-disk is a workaround for 'suspend-to-ram eats too much power' (plus some details like being able to replace battery). suspend-to-ram is a workaround for 'idle machine takes way too much power' (plus some details like don't spin the disk so that machine is safe to carry). I'm starting to think that we should fix the idle power consumption problem. Cell phones do it right. They pretend to be ready/idle all the time, yet they have _days_ of standby. OLPC can do something like that, too: it is capable of entering suspend-to-ram with screen on and input devices ready to wake the system up. And with right network card (and right userland) ... I think normal PCs could enter suspend-to-ram during screensaver, too. When you are about to turn off the screen, machine should enable WOL on any packet, arm RTC wakeup for the next packet, and s2ram happily. (Obviously we are far away from that on PC.) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:23 ` Pavel Machek @ 2007-04-25 8:48 ` Xavier Bestel 2007-04-25 8:50 ` Nigel Cunningham 2007-04-25 9:02 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti ` (3 subsequent siblings) 4 siblings, 1 reply; 713+ messages in thread From: Xavier Bestel @ 2007-04-25 8:48 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote: > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > > whole thing. I think they've _all_ caused problems for the "true" suspend > > (suspend-to-ram), and the last thing I want to see is three or four > > Well, it is a bit more complex than that. > > suspend-to-disk is a workaround for > > 'suspend-to-ram eats too much power' (plus some details like > being able to replace battery). > > suspend-to-ram is a workaround for > > 'idle machine takes way too much power' (plus some details > like don't spin the disk so that machine is safe to carry). I think it depends on who you ask. I personally think that suspend-to- $youchoose is a workaround for the slowness of system startup. I never turn off my laptop, I just suspend it. (And guess what, it uses APM and suspend is really faster and way more reliable than each kernel implementation I could try). Xav ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 8:48 ` Xavier Bestel @ 2007-04-25 8:50 ` Nigel Cunningham 2007-04-25 9:07 ` Xavier Bestel 0 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-25 8:50 UTC (permalink / raw) To: Xavier Bestel Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1366 bytes --] Hi. On Wed, 2007-04-25 at 10:48 +0200, Xavier Bestel wrote: > On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote: > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > > > whole thing. I think they've _all_ caused problems for the "true" suspend > > > (suspend-to-ram), and the last thing I want to see is three or four > > > > Well, it is a bit more complex than that. > > > > suspend-to-disk is a workaround for > > > > 'suspend-to-ram eats too much power' (plus some details like > > being able to replace battery). > > > > suspend-to-ram is a workaround for > > > > 'idle machine takes way too much power' (plus some details > > like don't spin the disk so that machine is safe to carry). > > I think it depends on who you ask. I personally think that suspend-to- > $youchoose is a workaround for the slowness of system startup. I never > turn off my laptop, I just suspend it. > > (And guess what, it uses APM and suspend is really faster and way more > reliable than each kernel implementation I could try). If you tried Suspend2 and had problems with reliability, please send me logs. I'll do all I can to help. (I have to qualify it a bit, because I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and I'll fix it). Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 8:50 ` Nigel Cunningham @ 2007-04-25 9:07 ` Xavier Bestel 2007-04-25 9:19 ` Nigel Cunningham 2007-04-26 18:18 ` Bill Davidsen 0 siblings, 2 replies; 713+ messages in thread From: Xavier Bestel @ 2007-04-25 9:07 UTC (permalink / raw) To: nigel Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote: > > (And guess what, it uses APM and suspend is really faster and way more > > reliable than each kernel implementation I could try). > > If you tried Suspend2 and had problems with reliability, please send me > logs. I'll do all I can to help. (I have to qualify it a bit, because > I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and > I'll fix it). Does suspend2 work with APM ? After much trying, I think now the ACPI implementation of my laptop (a vintage Compaq Armada 1700) is busted, only APM works. AFAIR the problem with suspend2 was that it didn't poweroff some parts of the laptop (the led of the wifi pcmcia card was on, and the lcd light was on too), but that was last year. Kernel's suspend kind of worked but didn't resume (no reaction on button press). As I tried all this last year, I may have forgotten some things. Honestly, I like this laptop when it works flawlessly, so I don't see many reasons to try *susp* again. I'll do it when I'm bored, just not today. Thanks, Xav ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 9:07 ` Xavier Bestel @ 2007-04-25 9:19 ` Nigel Cunningham 2007-04-26 18:18 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-25 9:19 UTC (permalink / raw) To: Xavier Bestel Cc: Pavel Machek, Linus Torvalds, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1755 bytes --] Hi. On Wed, 2007-04-25 at 11:07 +0200, Xavier Bestel wrote: > On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote: > > > (And guess what, it uses APM and suspend is really faster and way more > > > reliable than each kernel implementation I could try). > > > > If you tried Suspend2 and had problems with reliability, please send me > > logs. I'll do all I can to help. (I have to qualify it a bit, because > > I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and > > I'll fix it). > > Does suspend2 work with APM ? After much trying, I think now the ACPI > implementation of my laptop (a vintage Compaq Armada 1700) is busted, > only APM works. It should do. If you set the powerdown method to 0, it will use machine_power_off() instead of trying to use acpi, fall back to machine_halt() if that fails and lastly (should not be needed) a while(1) cpu_relax() loop. > AFAIR the problem with suspend2 was that it didn't poweroff some parts > of the laptop (the led of the wifi pcmcia card was on, and the lcd light > was on too), but that was last year. Kernel's suspend kind of worked but > didn't resume (no reaction on button press). As I tried all this last > year, I may have forgotten some things. The code to poweroff those parts will be dependent on the drivers (assuming I'm making the right calls). If it's something where swsusp works and suspend2 doesn't, it will be because I'm doing something wrong. If they both don't do the right thing, then it's probably the driver. > Honestly, I like this laptop when it works flawlessly, so I don't see > many reasons to try *susp* again. I'll do it when I'm bored, just not > today. Okay :) Just let me know if I can help. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 9:07 ` Xavier Bestel 2007-04-25 9:19 ` Nigel Cunningham @ 2007-04-26 18:18 ` Bill Davidsen 1 sibling, 0 replies; 713+ messages in thread From: Bill Davidsen @ 2007-04-26 18:18 UTC (permalink / raw) To: linux-kernel; +Cc: suspend2-devel Xavier Bestel wrote: > On Wed, 2007-04-25 at 18:50 +1000, Nigel Cunningham wrote: >>> (And guess what, it uses APM and suspend is really faster and way more >>> reliable than each kernel implementation I could try). >> If you tried Suspend2 and had problems with reliability, please send me >> logs. I'll do all I can to help. (I have to qualify it a bit, because >> I'm not able to fix drivers, but if it's a Suspend2 issue, tell me and >> I'll fix it). > > Does suspend2 work with APM ? After much trying, I think now the ACPI > implementation of my laptop (a vintage Compaq Armada 1700) is busted, > only APM works. > > AFAIR the problem with suspend2 was that it didn't poweroff some parts > of the laptop (the led of the wifi pcmcia card was on, and the lcd light > was on too), but that was last year. Kernel's suspend kind of worked but > didn't resume (no reaction on button press). As I tried all this last > year, I may have forgotten some things. > Honestly, I like this laptop when it works flawlessly, so I don't see > many reasons to try *susp* again. I'll do it when I'm bored, just not > today. > Actually on some old laptops I just use the apm command, with -s (or -S, I forget by now), and that works. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang in atomic copy) 2007-04-25 7:23 ` Pavel Machek 2007-04-25 8:48 ` Xavier Bestel @ 2007-04-25 9:02 ` Romano Giannetti 2007-04-25 19:16 ` suspend2 merge Martin Steigerwald 2007-04-25 15:18 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk ` (2 subsequent siblings) 4 siblings, 1 reply; 713+ messages in thread From: Romano Giannetti @ 2007-04-25 9:02 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote: > suspend-to-disk is a workaround for > > 'suspend-to-ram eats too much power' (plus some details like > being able to replace battery). > ...and let me add 'suspend-to-disk' is a workaround for when s2ram does not work for a gazillion interacting reasons (ACPI, vga bios, drm/dri, you name it). I am quite happy with s2ram now on my AMD-based vaio, but it started to work with 2.6.17 kernels (Ubuntu Edgy, really), and the three years before that suspend-to-disk (sometime Pavel's, sometime Nigel's) saved the day (yes, it's quite faster to use suspend-to-disk that doing shutdown, reboot, and re-open all the applications). So, please do not dismiss suspend-to-disk as "crap". It has its place under the sun. Romano -- Romano Giannetti --- romano.giannetti@gmail.com Sorry for the following disclaimer, it's attached by our outgoing server and I cannot shut it up. -- La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración. This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge 2007-04-25 9:02 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti @ 2007-04-25 19:16 ` Martin Steigerwald 0 siblings, 0 replies; 713+ messages in thread From: Martin Steigerwald @ 2007-04-25 19:16 UTC (permalink / raw) To: suspend2-devel Cc: Romano Giannetti, Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven Am Mittwoch 25 April 2007 schrieb Romano Giannetti: > On Wed, 2007-04-25 at 07:23 +0000, Pavel Machek wrote: > > suspend-to-disk is a workaround for > > > > 'suspend-to-ram eats too much power' (plus some details like > > being able to replace battery). > > ...and let me add 'suspend-to-disk' is a workaround for when s2ram does > not work for a gazillion interacting reasons (ACPI, vga bios, drm/dri, > you name it). Hello Romano, for me not only. I usually do not put the batteries into my laptops if not needed cause I read and experienced that I can extend battery life by that while making sure they are always charged more than 50%. Suspend to RAM thus wouldn't work at all for me. I use suspend2 since 2.6.14 cause I never managed to get userspace software suspend working on my ThinkPad T23 and T42, not even with standard Debian kernel packages, didn't try the latest ones however, AFAIR my last try was with one 2.6.18 package. And cause its faster and the machine is more responsive after resuming than swsusp that I used upto kernel 2.6.13. With that 1.5 GB RAM on my T42 suspending to disk with suspend2 takes quite some time and resuming also, but I didn't optimize it and thus it saves out almost everything of that: martin@shambala:~> free -m total used free shared buffers cached Mem: 1518 1219 298 0 0 831 -/+ buffers/cache: 388 1130 Swap: 1908 0 1908 Probably should limit that. I would like suspend2 getting merged! Its proven technology and just works, while I couldn't get userspace software suspend to work for me. Maybe I made a mistake while setting it up, but I think setting it up at first shouldn't be that complicated than I perceived it to be. I use suspend2 for our workstations at work, too, and my workstation has a uptime of more than 43 days with more than 17 successful suspend and resume cycles. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:23 ` Pavel Machek 2007-04-25 8:48 ` Xavier Bestel 2007-04-25 9:02 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti @ 2007-04-25 15:18 ` Adrian Bunk 2007-04-25 17:34 ` Pavel Machek 2007-04-25 19:43 ` Kenneth Crudup 2007-05-26 17:37 ` Martin Steigerwald 4 siblings, 1 reply; 713+ messages in thread From: Adrian Bunk @ 2007-04-25 15:18 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, Apr 25, 2007 at 07:23:50AM +0000, Pavel Machek wrote: > Hi! > > > This is why there's a lot to be said for > > > > echo mem > /sys/power/state > > > > and being able to follow the path through _one_ object (the kernel) over > > trying to figure out the interaction between many different parts with > > different versions. > > The 'promise' is 'if you can get echo disk > /sys/power/state working, > uswsusp will work. too'. IOW it should be ok to debug the in-kernel > parts, only. > > Even I am running in-kernel swsusp, but my managers were pretty clear > they want graphical progress bar hiding all the 'ugly' swsusp > messages... and in the end the same uswsusp enables compression, too. > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > > whole thing. I think they've _all_ caused problems for the "true" suspend > > (suspend-to-ram), and the last thing I want to see is three or four > > Well, it is a bit more complex than that. > > suspend-to-disk is a workaround for > > 'suspend-to-ram eats too much power' (plus some details like > being able to replace battery). >... Why does everyone think suspend-to-disk was a laptop-only thing? My personal usage of suspend-to-disk is for turning the computer off in the evening and getting the complete FVWM with all programs running, open browser tabs,... back the next morning. All I need for suspending is: - echo "disk" > /sys/power/state All I need for getting a running and usable system back is: - turn on the power at the socket my computer is connected at - swapoff -a; swapon -a [1] - wait a bit until the above commands finished That's much more convenient than a cold boot. And it's working with a plain 2.6.16 kernel and zero userspace support. > Pavel cu Adrian [1] required step: working with 1 or 2 GB swapped out is horrible -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 15:18 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk @ 2007-04-25 17:34 ` Pavel Machek 2007-04-25 18:39 ` Adrian Bunk ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 17:34 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > Even I am running in-kernel swsusp, but my managers were pretty clear > > they want graphical progress bar hiding all the 'ugly' swsusp > > messages... and in the end the same uswsusp enables compression, too. > > > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > > > whole thing. I think they've _all_ caused problems for the "true" suspend > > > (suspend-to-ram), and the last thing I want to see is three or four > > > > Well, it is a bit more complex than that. > > > > suspend-to-disk is a workaround for > > > > 'suspend-to-ram eats too much power' (plus some details like > > being able to replace battery). > >... > > Why does everyone think suspend-to-disk was a laptop-only thing? > > My personal usage of suspend-to-disk is for turning the computer off in > the evening and getting the complete FVWM with all programs running, > open browser tabs,... back the next morning. Ok ok ok, suspend-to-disk has some other uses, too. But ... you are really using suspend-to-disk as a workaround for "my desktop takes too much power when idle". Imagine pressing "lock screensaver" combination, and your machine going to low power mode (3W?), immediately. (Quiet, too; you can't generate much noise for 3W). In the morning, you'd just press any key, machine would power up, immediately... ok, you'd have to ifconfig eth0 down, so that spurious packets on the local net would wake your machine, with all its fans etc. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 17:34 ` Pavel Machek @ 2007-04-25 18:39 ` Adrian Bunk 2007-04-25 18:50 ` Linus Torvalds 2007-04-25 18:52 ` Alon Bar-Lev 2007-04-25 22:11 ` Kenneth Crudup 2 siblings, 1 reply; 713+ messages in thread From: Adrian Bunk @ 2007-04-25 18:39 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, Apr 25, 2007 at 07:34:05PM +0200, Pavel Machek wrote: > Hi! > > > > Even I am running in-kernel swsusp, but my managers were pretty clear > > > they want graphical progress bar hiding all the 'ugly' swsusp > > > messages... and in the end the same uswsusp enables compression, too. > > > > > > > I absolutely detest all suspend-to-disk crap. Quite frankly, I hate the > > > > whole thing. I think they've _all_ caused problems for the "true" suspend > > > > (suspend-to-ram), and the last thing I want to see is three or four > > > > > > Well, it is a bit more complex than that. > > > > > > suspend-to-disk is a workaround for > > > > > > 'suspend-to-ram eats too much power' (plus some details like > > > being able to replace battery). > > >... > > > > Why does everyone think suspend-to-disk was a laptop-only thing? > > > > My personal usage of suspend-to-disk is for turning the computer off in > > the evening and getting the complete FVWM with all programs running, > > open browser tabs,... back the next morning. > > Ok ok ok, suspend-to-disk has some other uses, too. > > But ... you are really using suspend-to-disk as a workaround for "my > desktop takes too much power when idle". Imagine pressing "lock > screensaver" combination, and your machine going to low power mode > (3W?), immediately. (Quiet, too; you can't generate much noise for > 3W). In the morning, you'd just press any key, machine would power up, > immediately... ok, you'd have to ifconfig eth0 down, so that spurious > packets on the local net would wake your machine, with all its fans > etc. 3W for the complete system? In CPU state S1? [1] And even 3W would still be a waste of energy. And what would be the advantage? The socket my computer is connected at is located below my bed so I can turn the power on while still lying in bed (the computer is not reachable from my bed). OK, I could create an external power button for the computer using longer cables connected to the motherboard, but I still haven't understood why this should be better for my use case than suspend-to-disk. > Pavel cu Adrian [1] unless I'm misunderstanding [2], page 9, that's the highest state my processor supports [2] http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24309.pdf -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 18:39 ` Adrian Bunk @ 2007-04-25 18:50 ` Linus Torvalds 2007-04-25 19:02 ` Hua Zhong ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 18:50 UTC (permalink / raw) To: Adrian Bunk Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 25 Apr 2007, Adrian Bunk wrote: > > 3W for the complete system? In CPU state S1? [1] In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) the devices are off, but the motherboard and memory is powered. > And even 3W would still be a waste of energy. .. but if the alternative is a feature that just isn't worth it, and likely to not only have its own bugs, but cause bugs elsewhere? (And yes, I believe STD is both of those. There's a reason it's called "STD". Go to google and type "STD" and press "I'm feeling lucky". Google is God). Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* RE: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 18:50 ` Linus Torvalds @ 2007-04-25 19:02 ` Hua Zhong 2007-04-25 19:25 ` Adrian Bunk 2007-04-25 23:33 ` Olivier Galibert 2 siblings, 0 replies; 713+ messages in thread From: Hua Zhong @ 2007-04-25 19:02 UTC (permalink / raw) To: 'Linus Torvalds', 'Adrian Bunk' Cc: 'Pavel Machek', 'Ingo Molnar', 'Nigel Cunningham', 'Christian Hesse', 'Nick Piggin', 'Mike Galbraith', linux-kernel, 'Con Kolivas', suspend2-devel, 'Andrew Morton', 'Thomas Gleixner', 'Arjan van de Ven' > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) > the devices are off, but the motherboard and memory is powered. > > > And even 3W would still be a waste of energy. > > .. but if the alternative is a feature that just isn't worth it, and > likely to not only have its own bugs, but cause bugs elsewhere? (And > yes, > I believe STD is both of those. There's a reason it's called "STD". Go > to google and type "STD" and press "I'm feeling lucky". Google is God). Linus, the fact that you personally don't use S2D does not mean it's not useful for other people. I've been using solely laptop for six years and until recently (when my commute is now only two miles) I'd always used hibernate when I go to or leave form work. And even now if I take my laptop to somewhere farther away (like on a vacation) I need hibernation. This is one area where Windows has been doing great for many years, and it's not like Linux has not had a mature implementation for many years either. So don't you think your comments are a bit odd at this point? Hua ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 18:50 ` Linus Torvalds 2007-04-25 19:02 ` Hua Zhong @ 2007-04-25 19:25 ` Adrian Bunk 2007-04-25 19:38 ` Linus Torvalds ` (4 more replies) 2007-04-25 23:33 ` Olivier Galibert 2 siblings, 5 replies; 713+ messages in thread From: Adrian Bunk @ 2007-04-25 19:25 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: > > > On Wed, 25 Apr 2007, Adrian Bunk wrote: > > > > 3W for the complete system? In CPU state S1? [1] > > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) > the devices are off, but the motherboard and memory is powered. As far as I understand it, the CPU isn't off in S1. > > And even 3W would still be a waste of energy. > > .. but if the alternative is a feature that just isn't worth it, and > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > I believe STD is both of those. There's a reason it's called "STD". Go > to google and type "STD" and press "I'm feeling lucky". Google is God). Is there really no use case for STD? No worries if power is completely lost. Some people might boot Windows between suspending and resuming. ... > Linus cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:25 ` Adrian Bunk @ 2007-04-25 19:38 ` Linus Torvalds 2007-04-25 20:08 ` Pavel Machek ` (3 more replies) 2007-04-25 19:41 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton ` (3 subsequent siblings) 4 siblings, 4 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 19:38 UTC (permalink / raw) To: Adrian Bunk Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 25 Apr 2007, Adrian Bunk wrote: > > > > .. but if the alternative is a feature that just isn't worth it, and > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > I believe STD is both of those. There's a reason it's called "STD". Go > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > Is there really no use case for STD? People seem to have reading comprehension problems. The STD code is buggy, and has introduced bugs in STR too, largely thanks to bad design. Some of them have happily gotten fixed. Others did not, and now we have three totally different versions (two of which share some infrastructure), all of which are broken (ie the "suspend2" people will swear up-and-down that swsusp doesn't work for them, but anybody who thinks that "suspend2" will work for everybody is just being a total idiot, and I have a bridge to sell to them). I'd actually be happier *removing* STD support in the sense it is now: it's way too closely integrated with STR, even though it has absolutely nothing in common with it. When you STD, you'e actually much closer to a *shutdown* than to STR, yet the STD code continually seems to want to be in the "suspend" path, as shown even by its name. So my objections to STD have nothing to do with saving state and shutting down. They have everything to do with the fact that it is not - and will never be - a "suspend", and it shouldn't affect suspend. And that's a *fundamental* problem. If the STD people cannot even realize that they have less to do with "suspend" than to "reboot", how do you ever expect them to get anything to work, and not affect other things negatively? Yeah, I'm down on it. I'm down on it because every person involved with the whole STD thing seems to have basically zero taste, and a total inability to work with anybody else. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:38 ` Linus Torvalds @ 2007-04-25 20:08 ` Pavel Machek 2007-04-25 20:33 ` Rafael J. Wysocki 2007-04-25 22:36 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham 2007-04-25 20:20 ` Rafael J. Wysocki ` (2 subsequent siblings) 3 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 20:08 UTC (permalink / raw) To: Linus Torvalds Cc: Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > > .. but if the alternative is a feature that just isn't worth it, and > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > > I believe STD is both of those. There's a reason it's called "STD". Go > > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > > > Is there really no use case for STD? > > People seem to have reading comprehension problems. > > The STD code is buggy, and has introduced bugs in STR too, largely thanks > to bad design. Some of them have happily gotten fixed. Others did not, and > now we have three totally different versions (two of which share some > infrastructure), all of which are broken (ie the "suspend2" people will > swear up-and-down that swsusp doesn't work for them, but anybody who > thinks that "suspend2" will work for everybody is just being a total > idiot, and I have a bridge to sell to them). Well, lets get some credit to STD... it worked before STR, and it allowed debugging basic driver infrastructure. > So my objections to STD have nothing to do with saving state and shutting > down. They have everything to do with the fact that it is not - and will > never be - a "suspend", and it shouldn't affect suspend. STD needs to snapshot system, and then it needs devices to be suspended so that snapshot is consistent. > And that's a *fundamental* problem. If the STD people cannot even realize > that they have less to do with "suspend" than to "reboot", how do you ever > expect them to get anything to work, and not affect other things > negatively? STD worked first ;-). Yes, these days it has little to do with "suspend", it was mostly separated to "snapshot" and "restore". We still keep swsusp in kernel for compatibility (and because it makes debugging very easy). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:08 ` Pavel Machek @ 2007-04-25 20:33 ` Rafael J. Wysocki 2007-04-25 20:31 ` Pavel Machek 2007-04-25 22:36 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-25 20:33 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wednesday, 25 April 2007 22:08, Pavel Machek wrote: > Hi! > > > > > .. but if the alternative is a feature that just isn't worth it, and > > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > > > I believe STD is both of those. There's a reason it's called "STD". Go > > > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > > > > > Is there really no use case for STD? [--snip--] > > So my objections to STD have nothing to do with saving state and shutting > > down. They have everything to do with the fact that it is not - and will > > never be - a "suspend", and it shouldn't affect suspend. > > STD needs to snapshot system, and then it needs devices to be > suspended so that snapshot is consistent. Not suspended. _Frozen_. In fact don't want any DMA transfers or interrupts to take place when we're creating the image. That's all and that's what we're doing (or rather, trying to do). So, the "suspend" and "resume" for the functions being called for that are wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even be that difficult ... Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:33 ` Rafael J. Wysocki @ 2007-04-25 20:31 ` Pavel Machek 2007-04-27 10:21 ` driver power operations (was Re: suspend2 merge) Johannes Berg 2007-04-27 10:21 ` Johannes Berg 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 20:31 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > > > > .. but if the alternative is a feature that just isn't worth it, and > > > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > > > > I believe STD is both of those. There's a reason it's called "STD". Go > > > > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > > > > > > > Is there really no use case for STD? > > [--snip--] > > > So my objections to STD have nothing to do with saving state and shutting > > > down. They have everything to do with the fact that it is not - and will > > > never be - a "suspend", and it shouldn't affect suspend. > > > > STD needs to snapshot system, and then it needs devices to be > > suspended so that snapshot is consistent. > > Not suspended. _Frozen_. In fact don't want any DMA transfers or interrupts > to take place when we're creating the image. That's all and that's what we're > doing (or rather, trying to do). Yep, _frozen_. That's the right word. > So, the "suspend" and "resume" for the functions being called for that are > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > be that difficult ... It would be ugly big patch I'm afraid. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* driver power operations (was Re: suspend2 merge) 2007-04-25 20:31 ` Pavel Machek @ 2007-04-27 10:21 ` Johannes Berg 2007-04-27 10:21 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:21 UTC (permalink / raw) To: Pavel Machek Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 2996 bytes --] On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote: > > So, the "suspend" and "resume" for the functions being called for that are > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > be that difficult ... > > It would be ugly big patch I'm afraid. It'd be a lot of code churn, but well worth it. And most of the changes would be trivial too. You need to start looking beyond "this is ugly in the short term" and realise that it's much more maintainable in the long term if driver writers know what they're supposed to do as opposed to just hacking at it until it mostly works or just doing a full device down/up cycle including resetting full driver state. Look at it now: * FREEZE Quiesce operations so that a consistent image can be saved; * but do NOT otherwise enter a low power device state, and do * NOT emit system wakeup events. * * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring * the system from a snapshot taken after an earlier FREEZE. * Some drivers will need to reset their hardware state instead * of preserving it, to ensure that it's never mistaken for the * state which that earlier snapshot had set up. Why is prethaw even necessary? As far as I can tell it's only necessary because resume() can't tell you whether you just want to thaw or need to reset since it doesn't tell you at what point it's invoked. Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a better name?) that are called at the appropriate places (with freeze/thaw around preparing the image and freeze/restart around restoring would go a long way of clearing up the confusion in all the drivers. Of course, it'd have to be documented that freeze/thaw isn't the only valid combination but that freeze/restart is used too, but that's not hard to do nor hard to understand. And, incidentally, it could possibly make both suspend and hibernate work much faster too. The comments there talk about "minimally power management aware" drivers which always do the wrong thing for suspend, in that they always reset everything... Of course, some drivers will actually need to do that, but if freeze/suspend and thaw/restart/resume have the same prototypes (probably just int <function>(void)) then drivers can trivially assign the same there. And hibernate would benefit since a lot of drivers could do a lot less work for freeze/thaw. Or, if we don't want to have five calls and use 40 bytes (on 64-bit) just for these callback pointers for each device we could just as well have a single callback ->pm(what) and make "what" indicate which one of these five things... But then drivers can't make that code depend on the swsusp configuration which would be doable with five callbacks. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* driver power operations (was Re: suspend2 merge) 2007-04-25 20:31 ` Pavel Machek 2007-04-27 10:21 ` driver power operations (was Re: suspend2 merge) Johannes Berg @ 2007-04-27 10:21 ` Johannes Berg 2007-04-27 12:06 ` Rafael J. Wysocki ` (6 more replies) 1 sibling, 7 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:21 UTC (permalink / raw) To: Pavel Machek Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 2996 bytes --] On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote: > > So, the "suspend" and "resume" for the functions being called for that are > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > be that difficult ... > > It would be ugly big patch I'm afraid. It'd be a lot of code churn, but well worth it. And most of the changes would be trivial too. You need to start looking beyond "this is ugly in the short term" and realise that it's much more maintainable in the long term if driver writers know what they're supposed to do as opposed to just hacking at it until it mostly works or just doing a full device down/up cycle including resetting full driver state. Look at it now: * FREEZE Quiesce operations so that a consistent image can be saved; * but do NOT otherwise enter a low power device state, and do * NOT emit system wakeup events. * * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring * the system from a snapshot taken after an earlier FREEZE. * Some drivers will need to reset their hardware state instead * of preserving it, to ensure that it's never mistaken for the * state which that earlier snapshot had set up. Why is prethaw even necessary? As far as I can tell it's only necessary because resume() can't tell you whether you just want to thaw or need to reset since it doesn't tell you at what point it's invoked. Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a better name?) that are called at the appropriate places (with freeze/thaw around preparing the image and freeze/restart around restoring would go a long way of clearing up the confusion in all the drivers. Of course, it'd have to be documented that freeze/thaw isn't the only valid combination but that freeze/restart is used too, but that's not hard to do nor hard to understand. And, incidentally, it could possibly make both suspend and hibernate work much faster too. The comments there talk about "minimally power management aware" drivers which always do the wrong thing for suspend, in that they always reset everything... Of course, some drivers will actually need to do that, but if freeze/suspend and thaw/restart/resume have the same prototypes (probably just int <function>(void)) then drivers can trivially assign the same there. And hibernate would benefit since a lot of drivers could do a lot less work for freeze/thaw. Or, if we don't want to have five calls and use 40 bytes (on 64-bit) just for these callback pointers for each device we could just as well have a single callback ->pm(what) and make "what" indicate which one of these five things... But then drivers can't make that code depend on the swsusp configuration which would be doable with five callbacks. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg @ 2007-04-27 12:06 ` Rafael J. Wysocki 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:06 ` Rafael J. Wysocki ` (5 subsequent siblings) 6 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 12:06 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm On Friday, 27 April 2007 12:21, Johannes Berg wrote: > On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote: > > > > So, the "suspend" and "resume" for the functions being called for that are > > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > > be that difficult ... > > > > It would be ugly big patch I'm afraid. > > It'd be a lot of code churn, but well worth it. And most of the changes > would be trivial too. You need to start looking beyond "this is ugly in > the short term" and realise that it's much more maintainable in the long > term if driver writers know what they're supposed to do as opposed to > just hacking at it until it mostly works or just doing a full device > down/up cycle including resetting full driver state. > > Look at it now: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. > > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > better name?) that are called at the appropriate places (with > freeze/thaw around preparing the image and freeze/restart around > restoring would go a long way of clearing up the confusion in all the > drivers. Of course, it'd have to be documented that freeze/thaw isn't > the only valid combination but that freeze/restart is used too, but > that's not hard to do nor hard to understand. > > And, incidentally, it could possibly make both suspend and hibernate > work much faster too. The comments there talk about "minimally power > management aware" drivers which always do the wrong thing for suspend, > in that they always reset everything... Of course, some drivers will > actually need to do that, but if freeze/suspend and thaw/restart/resume > have the same prototypes (probably just int <function>(void)) then > drivers can trivially assign the same there. > And hibernate would benefit since a lot of drivers could do a lot less > work for freeze/thaw. I violently agree with all of the above. Moreover, for the hibernation we have two special cases that are of no interest for the suspend: 1) drivers compiled as modules and not loaded before we restore the image 2) drivers that need to allocate much memory in .freeze() > Or, if we don't want to have five calls and use 40 bytes (on 64-bit) > just for these callback pointers for each device we could just as well > have a single callback ->pm(what) and make "what" indicate which one of > these five things... But then drivers can't make that code depend on the > swsusp configuration which would be doable with five callbacks. Five callbacks are fine by me, especially if we can define reasonable defaults for the hibernation (and can we?). Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 12:06 ` Rafael J. Wysocki @ 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:40 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-27 12:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Ingo Molnar, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Johannes Berg, Linus Torvalds, Thomas Gleixner, Arjan van de Ven Hi! > > And, incidentally, it could possibly make both suspend and hibernate > > work much faster too. The comments there talk about "minimally power > > management aware" drivers which always do the wrong thing for suspend, > > in that they always reset everything... Of course, some drivers will > > actually need to do that, but if freeze/suspend and thaw/restart/resume > > have the same prototypes (probably just int <function>(void)) then > > drivers can trivially assign the same there. > > And hibernate would benefit since a lot of drivers could do a lot less > > work for freeze/thaw. > > I violently agree with all of the above. > > Moreover, for the hibernation we have two special cases that are of no interest > for the suspend: > 1) drivers compiled as modules and not loaded before we restore the image > 2) drivers that need to allocate much memory in .freeze() > > > Or, if we don't want to have five calls and use 40 bytes (on 64-bit) > > just for these callback pointers for each device we could just as well > > have a single callback ->pm(what) and make "what" indicate which one of > > these five things... But then drivers can't make that code depend on the > > swsusp configuration which would be doable with five callbacks. > > Five callbacks are fine by me, especially if we can define reasonable defaults > for the hibernation (and can we?). Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and resume() for thaw(). Anything else is just not sane way forward. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 12:06 ` Rafael J. Wysocki 2007-04-27 12:40 ` Pavel Machek @ 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:46 ` Johannes Berg 2007-04-27 12:46 ` Johannes Berg 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-27 12:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm Hi! > > And, incidentally, it could possibly make both suspend and hibernate > > work much faster too. The comments there talk about "minimally power > > management aware" drivers which always do the wrong thing for suspend, > > in that they always reset everything... Of course, some drivers will > > actually need to do that, but if freeze/suspend and thaw/restart/resume > > have the same prototypes (probably just int <function>(void)) then > > drivers can trivially assign the same there. > > And hibernate would benefit since a lot of drivers could do a lot less > > work for freeze/thaw. > > I violently agree with all of the above. > > Moreover, for the hibernation we have two special cases that are of no interest > for the suspend: > 1) drivers compiled as modules and not loaded before we restore the image > 2) drivers that need to allocate much memory in .freeze() > > > Or, if we don't want to have five calls and use 40 bytes (on 64-bit) > > just for these callback pointers for each device we could just as well > > have a single callback ->pm(what) and make "what" indicate which one of > > these five things... But then drivers can't make that code depend on the > > swsusp configuration which would be doable with five callbacks. > > Five callbacks are fine by me, especially if we can define reasonable defaults > for the hibernation (and can we?). Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and resume() for thaw(). Anything else is just not sane way forward. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 12:40 ` Pavel Machek @ 2007-04-27 12:46 ` Johannes Berg 2007-04-27 12:50 ` Pavel Machek 2007-04-27 12:46 ` Johannes Berg 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-27 12:46 UTC (permalink / raw) To: Pavel Machek Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 754 bytes --] On Fri, 2007-04-27 at 14:40 +0200, Pavel Machek wrote: > > Five callbacks are fine by me, especially if we can define reasonable defaults > > for the hibernation (and can we?). > > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and > resume() for thaw(). Anything else is just not sane way forward. I think we should remove the argument to suspend() in the same patch series. Yes, that would mean porting all drivers that currently use it, but that's not actually all that many since most drivers are dumbed-down wrt. power management. And realistically, resume for thaw makes no sense, nor does suspend for freeze, so we probably want to change those over to suspend/restart and use them. or something. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 12:46 ` Johannes Berg @ 2007-04-27 12:50 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-27 12:50 UTC (permalink / raw) To: Johannes Berg Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm Hi! > > > Five callbacks are fine by me, especially if we can define reasonable defaults > > > for the hibernation (and can we?). > > > > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and > > resume() for thaw(). Anything else is just not sane way forward. > > I think we should remove the argument to suspend() in the same patch > series. Yes, that would mean porting all drivers that currently use > it, Well, if you can do it in one patch series, go ahead. But I think such massive change all over kernel will take slightly longer, so I'd prefer to keep the dummy argument for a while (so it still compiles) and fix it slowly. Of course, it is up to the person doing the series. And yes, such series would be welcome. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) @ 2007-04-27 12:50 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-27 12:50 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven Hi! > > > Five callbacks are fine by me, especially if we can define reasonable defaults > > > for the hibernation (and can we?). > > > > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and > > resume() for thaw(). Anything else is just not sane way forward. > > I think we should remove the argument to suspend() in the same patch > series. Yes, that would mean porting all drivers that currently use > it, Well, if you can do it in one patch series, go ahead. But I think such massive change all over kernel will take slightly longer, so I'd prefer to keep the dummy argument for a while (so it still compiles) and fix it slowly. Of course, it is up to the person doing the series. And yes, such series would be welcome. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:46 ` Johannes Berg @ 2007-04-27 12:46 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 12:46 UTC (permalink / raw) To: Pavel Machek Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 754 bytes --] On Fri, 2007-04-27 at 14:40 +0200, Pavel Machek wrote: > > Five callbacks are fine by me, especially if we can define reasonable defaults > > for the hibernation (and can we?). > > Well, we still can default to suspend(PMSG_FREEZE) for freeze(), and > resume() for thaw(). Anything else is just not sane way forward. I think we should remove the argument to suspend() in the same patch series. Yes, that would mean porting all drivers that currently use it, but that's not actually all that many since most drivers are dumbed-down wrt. power management. And realistically, resume for thaw makes no sense, nor does suspend for freeze, so we probably want to change those over to suspend/restart and use them. or something. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg 2007-04-27 12:06 ` Rafael J. Wysocki @ 2007-04-27 12:06 ` Rafael J. Wysocki 2007-04-27 14:34 ` Alan Stern ` (4 subsequent siblings) 6 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 12:06 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, Andrew Morton, Pavel Machek, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven On Friday, 27 April 2007 12:21, Johannes Berg wrote: > On Wed, 2007-04-25 at 22:31 +0200, Pavel Machek wrote: > > > > So, the "suspend" and "resume" for the functions being called for that are > > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > > be that difficult ... > > > > It would be ugly big patch I'm afraid. > > It'd be a lot of code churn, but well worth it. And most of the changes > would be trivial too. You need to start looking beyond "this is ugly in > the short term" and realise that it's much more maintainable in the long > term if driver writers know what they're supposed to do as opposed to > just hacking at it until it mostly works or just doing a full device > down/up cycle including resetting full driver state. > > Look at it now: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. > > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > better name?) that are called at the appropriate places (with > freeze/thaw around preparing the image and freeze/restart around > restoring would go a long way of clearing up the confusion in all the > drivers. Of course, it'd have to be documented that freeze/thaw isn't > the only valid combination but that freeze/restart is used too, but > that's not hard to do nor hard to understand. > > And, incidentally, it could possibly make both suspend and hibernate > work much faster too. The comments there talk about "minimally power > management aware" drivers which always do the wrong thing for suspend, > in that they always reset everything... Of course, some drivers will > actually need to do that, but if freeze/suspend and thaw/restart/resume > have the same prototypes (probably just int <function>(void)) then > drivers can trivially assign the same there. > And hibernate would benefit since a lot of drivers could do a lot less > work for freeze/thaw. I violently agree with all of the above. Moreover, for the hibernation we have two special cases that are of no interest for the suspend: 1) drivers compiled as modules and not loaded before we restore the image 2) drivers that need to allocate much memory in .freeze() > Or, if we don't want to have five calls and use 40 bytes (on 64-bit) > just for these callback pointers for each device we could just as well > have a single callback ->pm(what) and make "what" indicate which one of > these five things... But then drivers can't make that code depend on the > swsusp configuration which would be doable with five callbacks. Five callbacks are fine by me, especially if we can define reasonable defaults for the hibernation (and can we?). Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg @ 2007-04-27 14:34 ` Alan Stern 2007-04-27 12:06 ` Rafael J. Wysocki ` (5 subsequent siblings) 6 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-04-27 14:34 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven On Fri, 27 Apr 2007, Johannes Berg wrote: > Look at it now: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. I think you're wrong here. It's a little hard to say because the terminology is confusing and not yet standardized. For the sake of argument, let's call the stages of STD and STR by these names (also noted are the current PSMG values): Suspend to disk: "prepare to create snapshot" (= FREEZE) "continue after snapshot" (= RESUME) Resume from disk: "prepare to restore snapshot" (= PRETHAW) "continue after restore" (= RESUME) Suspend to RAM: "suspend" (= SUSPEND) "resume" (= RESUME) The real reason for adding PRETHAW was that drivers couldn't distinguish between "continue after restore" and "resume", other than by examining the device's state -- since the PM core doesn't pass any information to the resume() method. I suppose we could have modified the "prepare to create snapshot" code instead, but doing so would mean that "continue after snapshot" and "continue after restore" would always do the same thing, which is not necessarily a good idea. Anyway, based on this analysis it seems reasonable to have Six (6) method pointers. Suggested names (in the same order as above): pre_snaphot() post_snapshot() pre_restore() post_restore() suspend() resume() People apparently assume that pre_snapshot() and pre_restore() would always do the same thing and hence be redundant. I'm not so sure; time will tell. Doing it this way certainly is more clear. Then there's the question of having early_ and late_ versions of some of these things (i.e., one called with interrupts enabled, the other with interrupts disabled). I don't know to what extent that would be necessary; perhaps the each method call should occur in two phases with the interrupt-enable status changed in between. Then the interrupt-enable setting could be passed as an argument. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) @ 2007-04-27 14:34 ` Alan Stern 0 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-04-27 14:34 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, linux-pm, Arjan van de Ven On Fri, 27 Apr 2007, Johannes Berg wrote: > Look at it now: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. I think you're wrong here. It's a little hard to say because the terminology is confusing and not yet standardized. For the sake of argument, let's call the stages of STD and STR by these names (also noted are the current PSMG values): Suspend to disk: "prepare to create snapshot" (= FREEZE) "continue after snapshot" (= RESUME) Resume from disk: "prepare to restore snapshot" (= PRETHAW) "continue after restore" (= RESUME) Suspend to RAM: "suspend" (= SUSPEND) "resume" (= RESUME) The real reason for adding PRETHAW was that drivers couldn't distinguish between "continue after restore" and "resume", other than by examining the device's state -- since the PM core doesn't pass any information to the resume() method. I suppose we could have modified the "prepare to create snapshot" code instead, but doing so would mean that "continue after snapshot" and "continue after restore" would always do the same thing, which is not necessarily a good idea. Anyway, based on this analysis it seems reasonable to have Six (6) method pointers. Suggested names (in the same order as above): pre_snaphot() post_snapshot() pre_restore() post_restore() suspend() resume() People apparently assume that pre_snapshot() and pre_restore() would always do the same thing and hence be redundant. I'm not so sure; time will tell. Doing it this way certainly is more clear. Then there's the question of having early_ and late_ versions of some of these things (i.e., one called with interrupts enabled, the other with interrupts disabled). I don't know to what extent that would be necessary; perhaps the each method call should occur in two phases with the interrupt-enable status changed in between. Then the interrupt-enable setting could be passed as an argument. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 14:34 ` Alan Stern (?) @ 2007-04-27 14:39 ` Johannes Berg 2007-04-27 14:49 ` Johannes Berg -1 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-27 14:39 UTC (permalink / raw) To: Alan Stern Cc: Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 2168 bytes --] On Fri, 2007-04-27 at 10:34 -0400, Alan Stern wrote: > For the sake of argument, let's call the stages of STD and STR by these > names (also noted are the current PSMG values): > > Suspend to disk: > "prepare to create snapshot" (= FREEZE) > "continue after snapshot" (= RESUME) > > Resume from disk: > "prepare to restore snapshot" (= PRETHAW) > "continue after restore" (= RESUME) > > Suspend to RAM: > "suspend" (= SUSPEND) > "resume" (= RESUME) > > The real reason for adding PRETHAW was that drivers couldn't distinguish > between "continue after restore" and "resume", other than by examining the > device's state -- since the PM core doesn't pass any information to the > resume() method. That's pretty much what I said about prethaw though, no? Anyway, > Anyway, based on this analysis it seems reasonable to have Six (6) method > pointers. Suggested names (in the same order as above): > > pre_snaphot() > post_snapshot() > pre_restore() > post_restore() > suspend() > resume() > > People apparently assume that pre_snapshot() and pre_restore() would > always do the same thing and hence be redundant. I'm not so sure; time > will tell. Doing it this way certainly is more clear. Right. I did assume that pre_snapshot and pre_restore would be effectively the same since they both have to quiesce the device and assume not much more. I'm not averse to making it explicit, many drivers that don't care can just assign the same function. > Then there's the question of having early_ and late_ versions of some of > these things (i.e., one called with interrupts enabled, the other with > interrupts disabled). I don't know to what extent that would be > necessary; perhaps the each method call should occur in two phases with > the interrupt-enable status changed in between. Then the interrupt-enable > setting could be passed as an argument. Good point. Though if we go for passing the interrupt-enable setting as an argument then many drivers will have the same "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even strictly necessary. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 14:39 ` [linux-pm] " Johannes Berg @ 2007-04-27 14:49 ` Johannes Berg 0 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 14:49 UTC (permalink / raw) To: Alan Stern Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, linux-pm, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 439 bytes --] On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote: > Good point. Though if we go for passing the interrupt-enable setting as > an argument then many drivers will have the same > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even > strictly necessary. Eh, the point I actually wanted to make is that many drivers don't care for the irqs disabled case and would have to add code to exclude it. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) @ 2007-04-27 14:49 ` Johannes Berg 0 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 14:49 UTC (permalink / raw) To: Alan Stern Cc: Nick Piggin, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 439 bytes --] On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote: > Good point. Though if we go for passing the interrupt-enable setting as > an argument then many drivers will have the same > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even > strictly necessary. Eh, the point I actually wanted to make is that many drivers don't care for the irqs disabled case and would have to add code to exclude it. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 14:49 ` Johannes Berg (?) @ 2007-04-27 15:20 ` Rafael J. Wysocki 2007-04-27 15:27 ` Johannes Berg ` (2 more replies) -1 siblings, 3 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 15:20 UTC (permalink / raw) To: Johannes Berg Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, linux-pm, Arjan van de Ven On Friday, 27 April 2007 16:49, Johannes Berg wrote: > On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote: > > > Good point. Though if we go for passing the interrupt-enable setting as > > an argument then many drivers will have the same > > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even > > strictly necessary. > > Eh, the point I actually wanted to make is that many drivers don't care > for the irqs disabled case and would have to add code to exclude it. I think we can use 'stages' and pass them as arguments to the functions. In that case we can have two callbacks for the hibernation (I'd prefer to say 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback and one 'activate' callback that can be called many times in one snapshot/restore cycle with different arguments, for example: quiesce(PREPARE) -- that may be needed for drivers that allocate much memory before quiescing devices (if any) ... quiesce(PRE_SNAPSHOT) ... quiesce(PRE_SNAPSHOT_IRQ_OFF) ... activate(POST_SNAPSHOT_IRQ_OFF) ... activate(POST_SNAPSHOT) ... activate(FINISH) etc. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 15:20 ` [linux-pm] " Rafael J. Wysocki @ 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:52 ` Linus Torvalds 2 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 15:27 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, linux-pm, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 942 bytes --] On Fri, 2007-04-27 at 17:20 +0200, Rafael J. Wysocki wrote: > I think we can use 'stages' and pass them as arguments to the functions. > > In that case we can have two callbacks for the hibernation (I'd prefer to say > 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback > and one 'activate' callback that can be called many times in one > snapshot/restore cycle with different arguments, for example: But you're not proposing to add suspend/resume to this interface too, I hope :) > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > before quiescing devices (if any) > ... > quiesce(PRE_SNAPSHOT) > ... > quiesce(PRE_SNAPSHOT_IRQ_OFF) > ... > activate(POST_SNAPSHOT_IRQ_OFF) > ... > activate(POST_SNAPSHOT) > ... > activate(FINISH) I'm still not sure I like having to switch on the argument for every implementation. Is it really worth it? johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 15:20 ` [linux-pm] " Rafael J. Wysocki 2007-04-27 15:27 ` Johannes Berg @ 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:52 ` Linus Torvalds 2 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 15:27 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 942 bytes --] On Fri, 2007-04-27 at 17:20 +0200, Rafael J. Wysocki wrote: > I think we can use 'stages' and pass them as arguments to the functions. > > In that case we can have two callbacks for the hibernation (I'd prefer to say > 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback > and one 'activate' callback that can be called many times in one > snapshot/restore cycle with different arguments, for example: But you're not proposing to add suspend/resume to this interface too, I hope :) > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > before quiescing devices (if any) > ... > quiesce(PRE_SNAPSHOT) > ... > quiesce(PRE_SNAPSHOT_IRQ_OFF) > ... > activate(POST_SNAPSHOT_IRQ_OFF) > ... > activate(POST_SNAPSHOT) > ... > activate(FINISH) I'm still not sure I like having to switch on the argument for every implementation. Is it really worth it? johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 15:20 ` [linux-pm] " Rafael J. Wysocki @ 2007-04-27 15:52 ` Linus Torvalds 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:52 ` Linus Torvalds 2 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-27 15:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, linux-pm, Arjan van de Ven On Fri, 27 Apr 2007, Rafael J. Wysocki wrote: > > I think we can use 'stages' and pass them as arguments to the functions. No, no NOOOO! If you use stages, just describe them in the function name instead. > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > before quiescing devices (if any) > ... > quiesce(PRE_SNAPSHOT) > ... > quiesce(PRE_SNAPSHOT_IRQ_OFF) There is *no* advantage to this (and _lots_ of disadvantages) compared to saying dev->snapshot_prepare(dev); dev->snapshot_freeze(dev); dev->snapshot(dev) The latter is - more readable - MUCH easier for programmers to write readable code for (if-statements and case-statements are *by*definition* more complicated to parse both for humans and for CPU's - static information is good) - allows for the different stages to have different arguments, and somewhat related to that, to have better static C type checking. Look here, which one is more readable: int some_mixed_function(int arg) { do_one_thing(); if (arg == SLEEP) do_another_thing(); else do_yet_another_thing(); } or int do_sleep(void) { do_one_thing(); do_another_thing(); } int prepare_to_sleep(void) { do_one_thing(); do_yet_another_thing(); } and quite frankly, while the second case may take more lines of code, anybody who says that it's not clearer what it does (because it can "self-document" with function names etc) is either lying, or just a really bad programmer. The second case is also likely faster and probably not larger code-size-wise either, since it does static decisions _statically_ (since all callers are realistically going to use a constant argument anyway, and the argument really is static). Finally, the second case is *much* easier to fix, exactly because it doesn't mix up the cases. You can change the arguments, you can have totally different locking, you don't need things like int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL; etc, and it's just more logical. So don't overload a function. That's the *bug* with the current "dev->suspend()" interface already. Don't re-create it. The current one overloads two *totally*different* operations onto one function. Just don't do it. Not in the suspend part, not *ever*. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) @ 2007-04-27 15:52 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-27 15:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Johannes Berg, Ingo Molnar, Arjan van de Ven On Fri, 27 Apr 2007, Rafael J. Wysocki wrote: > > I think we can use 'stages' and pass them as arguments to the functions. No, no NOOOO! If you use stages, just describe them in the function name instead. > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > before quiescing devices (if any) > ... > quiesce(PRE_SNAPSHOT) > ... > quiesce(PRE_SNAPSHOT_IRQ_OFF) There is *no* advantage to this (and _lots_ of disadvantages) compared to saying dev->snapshot_prepare(dev); dev->snapshot_freeze(dev); dev->snapshot(dev) The latter is - more readable - MUCH easier for programmers to write readable code for (if-statements and case-statements are *by*definition* more complicated to parse both for humans and for CPU's - static information is good) - allows for the different stages to have different arguments, and somewhat related to that, to have better static C type checking. Look here, which one is more readable: int some_mixed_function(int arg) { do_one_thing(); if (arg == SLEEP) do_another_thing(); else do_yet_another_thing(); } or int do_sleep(void) { do_one_thing(); do_another_thing(); } int prepare_to_sleep(void) { do_one_thing(); do_yet_another_thing(); } and quite frankly, while the second case may take more lines of code, anybody who says that it's not clearer what it does (because it can "self-document" with function names etc) is either lying, or just a really bad programmer. The second case is also likely faster and probably not larger code-size-wise either, since it does static decisions _statically_ (since all callers are realistically going to use a constant argument anyway, and the argument really is static). Finally, the second case is *much* easier to fix, exactly because it doesn't mix up the cases. You can change the arguments, you can have totally different locking, you don't need things like int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL; etc, and it's just more logical. So don't overload a function. That's the *bug* with the current "dev->suspend()" interface already. Don't re-create it. The current one overloads two *totally*different* operations onto one function. Just don't do it. Not in the suspend part, not *ever*. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 15:52 ` Linus Torvalds (?) @ 2007-04-27 18:34 ` Rafael J. Wysocki -1 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 18:34 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Johannes Berg, Ingo Molnar, Arjan van de Ven On Friday, 27 April 2007 17:52, Linus Torvalds wrote: > > On Fri, 27 Apr 2007, Rafael J. Wysocki wrote: > > > > I think we can use 'stages' and pass them as arguments to the functions. > > No, no NOOOO! > > If you use stages, just describe them in the function name instead. > > > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > > before quiescing devices (if any) > > ... > > quiesce(PRE_SNAPSHOT) > > ... > > quiesce(PRE_SNAPSHOT_IRQ_OFF) > > There is *no* advantage to this (and _lots_ of disadvantages) compared to > saying > > dev->snapshot_prepare(dev); > dev->snapshot_freeze(dev); > dev->snapshot(dev) > > The latter is > - more readable > - MUCH easier for programmers to write readable code for (if-statements > and case-statements are *by*definition* more complicated to parse both > for humans and for CPU's - static information is good) > - allows for the different stages to have different arguments, and > somewhat related to that, to have better static C type checking. > > Look here, which one is more readable: > > int some_mixed_function(int arg) > { > do_one_thing(); > if (arg == SLEEP) > do_another_thing(); > else > do_yet_another_thing(); > } > > or > > int do_sleep(void) > { > do_one_thing(); > do_another_thing(); > } > > int prepare_to_sleep(void) > { > do_one_thing(); > do_yet_another_thing(); > } > > and quite frankly, while the second case may take more lines of code, > anybody who says that it's not clearer what it does (because it can > "self-document" with function names etc) is either lying, or just a really > bad programmer. The second case is also likely faster and probably not > larger code-size-wise either, since it does static decisions _statically_ > (since all callers are realistically going to use a constant argument > anyway, and the argument really is static). > > Finally, the second case is *much* easier to fix, exactly because it > doesn't mix up the cases. You can change the arguments, you can have > totally different locking, you don't need things like > > int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL; > > etc, and it's just more logical. > > So don't overload a function. That's the *bug* with the current > "dev->suspend()" interface already. Don't re-create it. The current one > overloads two *totally*different* operations onto one function. > > Just don't do it. Not in the suspend part, not *ever*. OK, I won't. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 15:52 ` Linus Torvalds (?) (?) @ 2007-04-27 18:34 ` Rafael J. Wysocki -1 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 18:34 UTC (permalink / raw) To: Linus Torvalds Cc: Johannes Berg, Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, linux-pm, Arjan van de Ven On Friday, 27 April 2007 17:52, Linus Torvalds wrote: > > On Fri, 27 Apr 2007, Rafael J. Wysocki wrote: > > > > I think we can use 'stages' and pass them as arguments to the functions. > > No, no NOOOO! > > If you use stages, just describe them in the function name instead. > > > quiesce(PREPARE) -- that may be needed for drivers that allocate much memory > > before quiescing devices (if any) > > ... > > quiesce(PRE_SNAPSHOT) > > ... > > quiesce(PRE_SNAPSHOT_IRQ_OFF) > > There is *no* advantage to this (and _lots_ of disadvantages) compared to > saying > > dev->snapshot_prepare(dev); > dev->snapshot_freeze(dev); > dev->snapshot(dev) > > The latter is > - more readable > - MUCH easier for programmers to write readable code for (if-statements > and case-statements are *by*definition* more complicated to parse both > for humans and for CPU's - static information is good) > - allows for the different stages to have different arguments, and > somewhat related to that, to have better static C type checking. > > Look here, which one is more readable: > > int some_mixed_function(int arg) > { > do_one_thing(); > if (arg == SLEEP) > do_another_thing(); > else > do_yet_another_thing(); > } > > or > > int do_sleep(void) > { > do_one_thing(); > do_another_thing(); > } > > int prepare_to_sleep(void) > { > do_one_thing(); > do_yet_another_thing(); > } > > and quite frankly, while the second case may take more lines of code, > anybody who says that it's not clearer what it does (because it can > "self-document" with function names etc) is either lying, or just a really > bad programmer. The second case is also likely faster and probably not > larger code-size-wise either, since it does static decisions _statically_ > (since all callers are realistically going to use a constant argument > anyway, and the argument really is static). > > Finally, the second case is *much* easier to fix, exactly because it > doesn't mix up the cases. You can change the arguments, you can have > totally different locking, you don't need things like > > int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL; > > etc, and it's just more logical. > > So don't overload a function. That's the *bug* with the current > "dev->suspend()" interface already. Don't re-create it. The current one > overloads two *totally*different* operations onto one function. > > Just don't do it. Not in the suspend part, not *ever*. OK, I won't. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 14:49 ` Johannes Berg (?) (?) @ 2007-04-27 15:20 ` Rafael J. Wysocki -1 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 15:20 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Ingo Molnar, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Friday, 27 April 2007 16:49, Johannes Berg wrote: > On Fri, 2007-04-27 at 16:39 +0200, Johannes Berg wrote: > > > Good point. Though if we go for passing the interrupt-enable setting as > > an argument then many drivers will have the same > > "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even > > strictly necessary. > > Eh, the point I actually wanted to make is that many drivers don't care > for the irqs disabled case and would have to add code to exclude it. I think we can use 'stages' and pass them as arguments to the functions. In that case we can have two callbacks for the hibernation (I'd prefer to say 'hibernation' instead of 'suspend to disk' from now on), one 'quiesce' callback and one 'activate' callback that can be called many times in one snapshot/restore cycle with different arguments, for example: quiesce(PREPARE) -- that may be needed for drivers that allocate much memory before quiescing devices (if any) ... quiesce(PRE_SNAPSHOT) ... quiesce(PRE_SNAPSHOT_IRQ_OFF) ... activate(POST_SNAPSHOT_IRQ_OFF) ... activate(POST_SNAPSHOT) ... activate(FINISH) etc. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 14:49 ` Johannes Berg @ 2007-04-27 15:41 ` Linus Torvalds -1 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-27 15:41 UTC (permalink / raw) To: Johannes Berg Cc: Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, linux-pm, Arjan van de Ven On Fri, 27 Apr 2007, Johannes Berg wrote: > > Eh, the point I actually wanted to make is that many drivers don't care > for the irqs disabled case and would have to add code to exclude it. You really *really* want to do a two-phase thing, at least for the case I care about. Whether that snapshotting thing does or not, I could care less. There's a damn good reason why the kernel uses /* phase 1 */ for_each_dev() dev->suspend(dev); cli(); /* phase 2 */ for_each_dev() dev->suspend_late(dev); (and the reverse case on resume). The reason is simply that there are two totally different cases: things like disks etc want to spin down and do slow and high-level operations, while things like USB controllers and console devices do *not* want to be suspended early, because if you do, you lose debuggability. So some things really *really* want to be done when they know that there isn't anything else going on any more, and they want to delay the shutdown to the very end. While other things really *require* that they can send requests that can take time, and cannot run with interrupts disabled. I actually think that "snapshot" is totally different, exactly because for snapshotting, the slow operations like spinning down disks etc probably don't really even exist, and would always be no-ops. But who knows.. Anyway, I do have a final comment: DO NOT PASS "STATE FLAGS" TO DRIVERS (or, even worse, assume that drivers would test "implicit" state by calling the same function under two different states, and then have the drivers test for "are interrupts disabled? Then I need to do something else"). If drivers are possibly going to do two different things, make it two different entry-points. There's absolutely no downsides. It's _clearer_ to the device writer when he gets called two different ways that it's not the same case, and in case a particular device can do the same thing for both cases, he can just set the function pointer to the same entry for both. Never EVER pass dynamic flags that modify behaviour. It's simply bad programming. A function should do *one* thing, and do it well. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) @ 2007-04-27 15:41 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-27 15:41 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Ingo Molnar, Thomas Gleixner, Arjan van de Ven On Fri, 27 Apr 2007, Johannes Berg wrote: > > Eh, the point I actually wanted to make is that many drivers don't care > for the irqs disabled case and would have to add code to exclude it. You really *really* want to do a two-phase thing, at least for the case I care about. Whether that snapshotting thing does or not, I could care less. There's a damn good reason why the kernel uses /* phase 1 */ for_each_dev() dev->suspend(dev); cli(); /* phase 2 */ for_each_dev() dev->suspend_late(dev); (and the reverse case on resume). The reason is simply that there are two totally different cases: things like disks etc want to spin down and do slow and high-level operations, while things like USB controllers and console devices do *not* want to be suspended early, because if you do, you lose debuggability. So some things really *really* want to be done when they know that there isn't anything else going on any more, and they want to delay the shutdown to the very end. While other things really *require* that they can send requests that can take time, and cannot run with interrupts disabled. I actually think that "snapshot" is totally different, exactly because for snapshotting, the slow operations like spinning down disks etc probably don't really even exist, and would always be no-ops. But who knows.. Anyway, I do have a final comment: DO NOT PASS "STATE FLAGS" TO DRIVERS (or, even worse, assume that drivers would test "implicit" state by calling the same function under two different states, and then have the drivers test for "are interrupts disabled? Then I need to do something else"). If drivers are possibly going to do two different things, make it two different entry-points. There's absolutely no downsides. It's _clearer_ to the device writer when he gets called two different ways that it's not the same case, and in case a particular device can do the same thing for both cases, he can just set the function pointer to the same entry for both. Never EVER pass dynamic flags that modify behaviour. It's simply bad programming. A function should do *one* thing, and do it well. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 14:34 ` Alan Stern (?) (?) @ 2007-04-27 14:39 ` Johannes Berg -1 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 14:39 UTC (permalink / raw) To: Alan Stern Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, linux-pm, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 2168 bytes --] On Fri, 2007-04-27 at 10:34 -0400, Alan Stern wrote: > For the sake of argument, let's call the stages of STD and STR by these > names (also noted are the current PSMG values): > > Suspend to disk: > "prepare to create snapshot" (= FREEZE) > "continue after snapshot" (= RESUME) > > Resume from disk: > "prepare to restore snapshot" (= PRETHAW) > "continue after restore" (= RESUME) > > Suspend to RAM: > "suspend" (= SUSPEND) > "resume" (= RESUME) > > The real reason for adding PRETHAW was that drivers couldn't distinguish > between "continue after restore" and "resume", other than by examining the > device's state -- since the PM core doesn't pass any information to the > resume() method. That's pretty much what I said about prethaw though, no? Anyway, > Anyway, based on this analysis it seems reasonable to have Six (6) method > pointers. Suggested names (in the same order as above): > > pre_snaphot() > post_snapshot() > pre_restore() > post_restore() > suspend() > resume() > > People apparently assume that pre_snapshot() and pre_restore() would > always do the same thing and hence be redundant. I'm not so sure; time > will tell. Doing it this way certainly is more clear. Right. I did assume that pre_snapshot and pre_restore would be effectively the same since they both have to quiesce the device and assume not much more. I'm not averse to making it explicit, many drivers that don't care can just assign the same function. > Then there's the question of having early_ and late_ versions of some of > these things (i.e., one called with interrupts enabled, the other with > interrupts disabled). I don't know to what extent that would be > necessary; perhaps the each method call should occur in two phases with > the interrupt-enable status changed in between. Then the interrupt-enable > setting could be passed as an argument. Good point. Though if we go for passing the interrupt-enable setting as an argument then many drivers will have the same "if (irqs_disabled()) return" code. Hm. I guess passing it isn't even strictly necessary. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 14:34 ` Alan Stern ` (2 preceding siblings ...) (?) @ 2007-04-27 15:12 ` Rafael J. Wysocki 2007-04-27 15:24 ` Johannes Berg 2007-04-27 15:24 ` Johannes Berg -1 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 15:12 UTC (permalink / raw) To: linux-pm Cc: Alan Stern, Johannes Berg, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, Arjan van de Ven On Friday, 27 April 2007 16:34, Alan Stern wrote: > On Fri, 27 Apr 2007, Johannes Berg wrote: > > > Look at it now: > > > > * FREEZE Quiesce operations so that a consistent image can be saved; > > * but do NOT otherwise enter a low power device state, and do > > * NOT emit system wakeup events. > > * > > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > > * the system from a snapshot taken after an earlier FREEZE. > > * Some drivers will need to reset their hardware state instead > > * of preserving it, to ensure that it's never mistaken for the > > * state which that earlier snapshot had set up. > > > > Why is prethaw even necessary? As far as I can tell it's only necessary > > because resume() can't tell you whether you just want to thaw or need to > > reset since it doesn't tell you at what point it's invoked. > > I think you're wrong here. It's a little hard to say because the > terminology is confusing and not yet standardized. > > For the sake of argument, let's call the stages of STD and STR by these > names (also noted are the current PSMG values): > > Suspend to disk: > "prepare to create snapshot" (= FREEZE) > "continue after snapshot" (= RESUME) > > Resume from disk: > "prepare to restore snapshot" (= PRETHAW) > "continue after restore" (= RESUME) > > Suspend to RAM: > "suspend" (= SUSPEND) > "resume" (= RESUME) > > The real reason for adding PRETHAW was that drivers couldn't distinguish > between "continue after restore" and "resume", other than by examining the > device's state -- since the PM core doesn't pass any information to the > resume() method. > > I suppose we could have modified the "prepare to create snapshot" code > instead, but doing so would mean that "continue after snapshot" and > "continue after restore" would always do the same thing, which is not > necessarily a good idea. > > Anyway, based on this analysis it seems reasonable to have Six (6) method > pointers. Suggested names (in the same order as above): > > pre_snaphot() > post_snapshot() > pre_restore() > post_restore() > suspend() > resume() > > People apparently assume that pre_snapshot() and pre_restore() would > always do the same thing and hence be redundant. I'm not so sure; time > will tell. Doing it this way certainly is more clear. How do we differentiate between post_snapshot() and post_restore()? I mean, after the restore we're entering the same code path as after the snapshot, so do we use a global var for this purpose? Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 15:12 ` [linux-pm] " Rafael J. Wysocki @ 2007-04-27 15:24 ` Johannes Berg 2007-04-27 15:24 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 15:24 UTC (permalink / raw) To: Rafael J. Wysocki Cc: linux-pm, Alan Stern, Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Andrew Morton, Linus Torvalds, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 432 bytes --] On Fri, 2007-04-27 at 17:12 +0200, Rafael J. Wysocki wrote: > How do we differentiate between post_snapshot() and post_restore()? > I mean, after the restore we're entering the same code path as after the > snapshot, so do we use a global var for this purpose? That's pretty easy to do though, we already know at which point we are so we just put an if(...) invoke_post_snapshot() else invoke_post_restore(). johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 15:12 ` [linux-pm] " Rafael J. Wysocki 2007-04-27 15:24 ` Johannes Berg @ 2007-04-27 15:24 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 15:24 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, linux-pm, Linus Torvalds, Ingo Molnar, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 432 bytes --] On Fri, 2007-04-27 at 17:12 +0200, Rafael J. Wysocki wrote: > How do we differentiate between post_snapshot() and post_restore()? > I mean, after the restore we're entering the same code path as after the > snapshot, so do we use a global var for this purpose? That's pretty easy to do though, we already know at which point we are so we just put an if(...) invoke_post_snapshot() else invoke_post_restore(). johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 14:34 ` Alan Stern ` (3 preceding siblings ...) (?) @ 2007-04-27 15:12 ` Rafael J. Wysocki -1 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 15:12 UTC (permalink / raw) To: linux-pm Cc: Nick Piggin, Thomas Gleixner, Pavel Machek, Mike Galbraith, Kernel development list, Con Kolivas, Adrian Bunk, Andrew Morton, suspend2-devel, Johannes Berg, Linus Torvalds, Ingo Molnar, Arjan van de Ven On Friday, 27 April 2007 16:34, Alan Stern wrote: > On Fri, 27 Apr 2007, Johannes Berg wrote: > > > Look at it now: > > > > * FREEZE Quiesce operations so that a consistent image can be saved; > > * but do NOT otherwise enter a low power device state, and do > > * NOT emit system wakeup events. > > * > > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > > * the system from a snapshot taken after an earlier FREEZE. > > * Some drivers will need to reset their hardware state instead > > * of preserving it, to ensure that it's never mistaken for the > > * state which that earlier snapshot had set up. > > > > Why is prethaw even necessary? As far as I can tell it's only necessary > > because resume() can't tell you whether you just want to thaw or need to > > reset since it doesn't tell you at what point it's invoked. > > I think you're wrong here. It's a little hard to say because the > terminology is confusing and not yet standardized. > > For the sake of argument, let's call the stages of STD and STR by these > names (also noted are the current PSMG values): > > Suspend to disk: > "prepare to create snapshot" (= FREEZE) > "continue after snapshot" (= RESUME) > > Resume from disk: > "prepare to restore snapshot" (= PRETHAW) > "continue after restore" (= RESUME) > > Suspend to RAM: > "suspend" (= SUSPEND) > "resume" (= RESUME) > > The real reason for adding PRETHAW was that drivers couldn't distinguish > between "continue after restore" and "resume", other than by examining the > device's state -- since the PM core doesn't pass any information to the > resume() method. > > I suppose we could have modified the "prepare to create snapshot" code > instead, but doing so would mean that "continue after snapshot" and > "continue after restore" would always do the same thing, which is not > necessarily a good idea. > > Anyway, based on this analysis it seems reasonable to have Six (6) method > pointers. Suggested names (in the same order as above): > > pre_snaphot() > post_snapshot() > pre_restore() > post_restore() > suspend() > resume() > > People apparently assume that pre_snapshot() and pre_restore() would > always do the same thing and hence be redundant. I'm not so sure; time > will tell. Doing it this way certainly is more clear. How do we differentiate between post_snapshot() and post_restore()? I mean, after the restore we're entering the same code path as after the snapshot, so do we use a global var for this purpose? Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg ` (2 preceding siblings ...) 2007-04-27 14:34 ` Alan Stern @ 2007-04-27 15:56 ` David Brownell 2007-04-27 15:56 ` [linux-pm] " David Brownell ` (2 subsequent siblings) 6 siblings, 0 replies; 713+ messages in thread From: David Brownell @ 2007-04-27 15:56 UTC (permalink / raw) To: linux-pm Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, Thomas Gleixner, Pavel Machek, Johannes Berg, Linus Torvalds, Andrew Morton, Arjan van de Ven On Friday 27 April 2007, Johannes Berg wrote: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? Read the patch comments for the patch adding that transition. Briefly, adding that transition to swsusp resume was a significant bugfix for all drivers that rely on controller state to determine how to resume. (That's mostly drivers that are intelligent about wakeup events... so unless you're working with such drivers, the issue may be unclear.) > As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. More like: because swsusp overloaded the suspend()/resume() code paths to do double duty. Instead of just putting devices into low power states (just *which* state is another discussion), they evolved into support for swsusp transitions... causing trouble (and sometimes breakage) for non-swsusp models. > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > better name?) that are called at the appropriate places (with > freeze/thaw around preparing the image and freeze/restart around > restoring would go a long way of clearing up the confusion in all the > drivers. Of course, it'd have to be documented that freeze/thaw isn't > the only valid combination but that freeze/restart is used too, but > that's not hard to do nor hard to understand. I suspect that after snapshot resume restart() should always be used. That shouldn't be hard to understand at all. It'd be sub-optimal in the same cases today's system resume is sub-optimal: devices that were in low power states before system suspend wouldn't be that way after system resume. > And, incidentally, it could possibly make both suspend and hibernate > work much faster too. The comments there talk about "minimally power > management aware" drivers which always do the wrong thing for suspend, > in that they always reset everything... That comment was purely about existing practice ... and was mostly about resume() processing, not suspend() paths. It's an unfortunate reality that most device drivers are stupid in terms of power management, so we need to be clear about just how stupid they're allowed to be without being terminally broken. Additionally, it would be a Good Thing if changes to clean up the swsusp-related code paths didn't make "real suspend" more painful. > Of course, some drivers will > actually need to do that, but if freeze/suspend and thaw/restart/resume > have the same prototypes (probably just int <function>(void)) then > drivers can trivially assign the same there. > And hibernate would benefit since a lot of drivers could do a lot less > work for freeze/thaw. That actually gets into discussions from a while back about wanting to be able to quiesce() devices, as separate from actually putting them into low power states. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg ` (3 preceding siblings ...) 2007-04-27 15:56 ` David Brownell @ 2007-04-27 15:56 ` David Brownell 2007-04-27 18:31 ` Rafael J. Wysocki 2007-04-27 18:31 ` Rafael J. Wysocki 2007-05-07 12:29 ` Pavel Machek 2007-05-07 12:29 ` Pavel Machek 6 siblings, 2 replies; 713+ messages in thread From: David Brownell @ 2007-04-27 15:56 UTC (permalink / raw) To: linux-pm Cc: Johannes Berg, Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven On Friday 27 April 2007, Johannes Berg wrote: > > * FREEZE Quiesce operations so that a consistent image can be saved; > * but do NOT otherwise enter a low power device state, and do > * NOT emit system wakeup events. > * > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > * the system from a snapshot taken after an earlier FREEZE. > * Some drivers will need to reset their hardware state instead > * of preserving it, to ensure that it's never mistaken for the > * state which that earlier snapshot had set up. > > Why is prethaw even necessary? Read the patch comments for the patch adding that transition. Briefly, adding that transition to swsusp resume was a significant bugfix for all drivers that rely on controller state to determine how to resume. (That's mostly drivers that are intelligent about wakeup events... so unless you're working with such drivers, the issue may be unclear.) > As far as I can tell it's only necessary > because resume() can't tell you whether you just want to thaw or need to > reset since it doesn't tell you at what point it's invoked. More like: because swsusp overloaded the suspend()/resume() code paths to do double duty. Instead of just putting devices into low power states (just *which* state is another discussion), they evolved into support for swsusp transitions... causing trouble (and sometimes breakage) for non-swsusp models. > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > better name?) that are called at the appropriate places (with > freeze/thaw around preparing the image and freeze/restart around > restoring would go a long way of clearing up the confusion in all the > drivers. Of course, it'd have to be documented that freeze/thaw isn't > the only valid combination but that freeze/restart is used too, but > that's not hard to do nor hard to understand. I suspect that after snapshot resume restart() should always be used. That shouldn't be hard to understand at all. It'd be sub-optimal in the same cases today's system resume is sub-optimal: devices that were in low power states before system suspend wouldn't be that way after system resume. > And, incidentally, it could possibly make both suspend and hibernate > work much faster too. The comments there talk about "minimally power > management aware" drivers which always do the wrong thing for suspend, > in that they always reset everything... That comment was purely about existing practice ... and was mostly about resume() processing, not suspend() paths. It's an unfortunate reality that most device drivers are stupid in terms of power management, so we need to be clear about just how stupid they're allowed to be without being terminally broken. Additionally, it would be a Good Thing if changes to clean up the swsusp-related code paths didn't make "real suspend" more painful. > Of course, some drivers will > actually need to do that, but if freeze/suspend and thaw/restart/resume > have the same prototypes (probably just int <function>(void)) then > drivers can trivially assign the same there. > And hibernate would benefit since a lot of drivers could do a lot less > work for freeze/thaw. That actually gets into discussions from a while back about wanting to be able to quiesce() devices, as separate from actually putting them into low power states. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] driver power operations (was Re: suspend2 merge) 2007-04-27 15:56 ` [linux-pm] " David Brownell @ 2007-04-27 18:31 ` Rafael J. Wysocki 2007-04-27 18:31 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 18:31 UTC (permalink / raw) To: David Brownell Cc: linux-pm, Johannes Berg, Pavel Machek, Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven On Friday, 27 April 2007 17:56, David Brownell wrote: > On Friday 27 April 2007, Johannes Berg wrote: > > > > * FREEZE Quiesce operations so that a consistent image can be saved; > > * but do NOT otherwise enter a low power device state, and do > > * NOT emit system wakeup events. > > * > > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > > * the system from a snapshot taken after an earlier FREEZE. > > * Some drivers will need to reset their hardware state instead > > * of preserving it, to ensure that it's never mistaken for the > > * state which that earlier snapshot had set up. > > > > Why is prethaw even necessary? > > Read the patch comments for the patch adding that transition. Briefly, > adding that transition to swsusp resume was a significant bugfix for > all drivers that rely on controller state to determine how to resume. > > (That's mostly drivers that are intelligent about wakeup events... so > unless you're working with such drivers, the issue may be unclear.) > > > > As far as I can tell it's only necessary > > because resume() can't tell you whether you just want to thaw or need to > > reset since it doesn't tell you at what point it's invoked. > > More like: because swsusp overloaded the suspend()/resume() code paths > to do double duty. > > Instead of just putting devices into low power states (just *which* state > is another discussion), they evolved into support for swsusp transitions... > causing trouble (and sometimes breakage) for non-swsusp models. > > > > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > > better name?) that are called at the appropriate places (with > > freeze/thaw around preparing the image and freeze/restart around > > restoring would go a long way of clearing up the confusion in all the > > drivers. Of course, it'd have to be documented that freeze/thaw isn't > > the only valid combination but that freeze/restart is used too, but > > that's not hard to do nor hard to understand. > > I suspect that after snapshot resume restart() should always be used. > That shouldn't be hard to understand at all. It'd be sub-optimal in > the same cases today's system resume is sub-optimal: devices that > were in low power states before system suspend wouldn't be that way > after system resume. > > > > And, incidentally, it could possibly make both suspend and hibernate > > work much faster too. The comments there talk about "minimally power > > management aware" drivers which always do the wrong thing for suspend, > > in that they always reset everything... > > That comment was purely about existing practice ... and was mostly > about resume() processing, not suspend() paths. > > It's an unfortunate reality that most device drivers are stupid in > terms of power management, so we need to be clear about just how > stupid they're allowed to be without being terminally broken. > > Additionally, it would be a Good Thing if changes to clean up the > swsusp-related code paths didn't make "real suspend" more painful. > > > > Of course, some drivers will > > actually need to do that, but if freeze/suspend and thaw/restart/resume > > have the same prototypes (probably just int <function>(void)) then > > drivers can trivially assign the same there. > > And hibernate would benefit since a lot of drivers could do a lot less > > work for freeze/thaw. > > That actually gets into discussions from a while back about wanting > to be able to quiesce() devices, as separate from actually putting > them into low power states. Yes, exactly. Moreover, I think we should separate the current suspend code from the hibernation (aka STD) code paths we're discussing. I mean, we need hibernation-specific equivalents of drivers/base/power/suspend.c etc. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 15:56 ` [linux-pm] " David Brownell 2007-04-27 18:31 ` Rafael J. Wysocki @ 2007-04-27 18:31 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 18:31 UTC (permalink / raw) To: David Brownell Cc: Nick Piggin, Ingo Molnar, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, Johannes Berg, Thomas Gleixner, Pavel Machek, linux-pm, Linus Torvalds, Andrew Morton, Arjan van de Ven On Friday, 27 April 2007 17:56, David Brownell wrote: > On Friday 27 April 2007, Johannes Berg wrote: > > > > * FREEZE Quiesce operations so that a consistent image can be saved; > > * but do NOT otherwise enter a low power device state, and do > > * NOT emit system wakeup events. > > * > > * PRETHAW Quiesce as if for FREEZE; additionally, prepare for restoring > > * the system from a snapshot taken after an earlier FREEZE. > > * Some drivers will need to reset their hardware state instead > > * of preserving it, to ensure that it's never mistaken for the > > * state which that earlier snapshot had set up. > > > > Why is prethaw even necessary? > > Read the patch comments for the patch adding that transition. Briefly, > adding that transition to swsusp resume was a significant bugfix for > all drivers that rely on controller state to determine how to resume. > > (That's mostly drivers that are intelligent about wakeup events... so > unless you're working with such drivers, the issue may be unclear.) > > > > As far as I can tell it's only necessary > > because resume() can't tell you whether you just want to thaw or need to > > reset since it doesn't tell you at what point it's invoked. > > More like: because swsusp overloaded the suspend()/resume() code paths > to do double duty. > > Instead of just putting devices into low power states (just *which* state > is another discussion), they evolved into support for swsusp transitions... > causing trouble (and sometimes breakage) for non-swsusp models. > > > > Having ->freeze(), ->thaw() and ->restart() (can somebody come up with a > > better name?) that are called at the appropriate places (with > > freeze/thaw around preparing the image and freeze/restart around > > restoring would go a long way of clearing up the confusion in all the > > drivers. Of course, it'd have to be documented that freeze/thaw isn't > > the only valid combination but that freeze/restart is used too, but > > that's not hard to do nor hard to understand. > > I suspect that after snapshot resume restart() should always be used. > That shouldn't be hard to understand at all. It'd be sub-optimal in > the same cases today's system resume is sub-optimal: devices that > were in low power states before system suspend wouldn't be that way > after system resume. > > > > And, incidentally, it could possibly make both suspend and hibernate > > work much faster too. The comments there talk about "minimally power > > management aware" drivers which always do the wrong thing for suspend, > > in that they always reset everything... > > That comment was purely about existing practice ... and was mostly > about resume() processing, not suspend() paths. > > It's an unfortunate reality that most device drivers are stupid in > terms of power management, so we need to be clear about just how > stupid they're allowed to be without being terminally broken. > > Additionally, it would be a Good Thing if changes to clean up the > swsusp-related code paths didn't make "real suspend" more painful. > > > > Of course, some drivers will > > actually need to do that, but if freeze/suspend and thaw/restart/resume > > have the same prototypes (probably just int <function>(void)) then > > drivers can trivially assign the same there. > > And hibernate would benefit since a lot of drivers could do a lot less > > work for freeze/thaw. > > That actually gets into discussions from a while back about wanting > to be able to quiesce() devices, as separate from actually putting > them into low power states. Yes, exactly. Moreover, I think we should separate the current suspend code from the hibernation (aka STD) code paths we're discussing. I mean, we need hibernation-specific equivalents of drivers/base/power/suspend.c etc. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg ` (4 preceding siblings ...) 2007-04-27 15:56 ` [linux-pm] " David Brownell @ 2007-05-07 12:29 ` Pavel Machek 2007-05-07 12:29 ` Pavel Machek 6 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-07 12:29 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Andrew Morton, Mike Galbraith, linux-kernel, Con Kolivas, Adrian Bunk, suspend2-devel, linux-pm, Thomas Gleixner, Linus Torvalds, Ingo Molnar, Arjan van de Ven Hi! > > > So, the "suspend" and "resume" for the functions being called for that are > > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > > be that difficult ... > > > > It would be ugly big patch I'm afraid. > > It'd be a lot of code churn, but well worth it. And most of the changes > would be trivial too. You need to start looking beyond "this is ugly in > the short term" and realise that it's much more maintainable in the long > term if driver writers know what they're supposed to do as opposed to > just hacking at it until it mostly works or just doing a full device > down/up cycle including resetting full driver state. I do not disagree with you. It will be ugly big patch, but it is probably worth it, so the patch will be welcome. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: driver power operations (was Re: suspend2 merge) 2007-04-27 10:21 ` Johannes Berg ` (5 preceding siblings ...) 2007-05-07 12:29 ` Pavel Machek @ 2007-05-07 12:29 ` Pavel Machek 6 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-07 12:29 UTC (permalink / raw) To: Johannes Berg Cc: Rafael J. Wysocki, Nick Piggin, Mike Galbraith, linux-kernel, Adrian Bunk, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven, linux-pm Hi! > > > So, the "suspend" and "resume" for the functions being called for that are > > > wrong, but then we call them with PMSG_FREEZE. ;-) Still, we could add > > > .freeze() and .thaw() callbacks for hibernation just fine. This wouldn't even > > > be that difficult ... > > > > It would be ugly big patch I'm afraid. > > It'd be a lot of code churn, but well worth it. And most of the changes > would be trivial too. You need to start looking beyond "this is ugly in > the short term" and realise that it's much more maintainable in the long > term if driver writers know what they're supposed to do as opposed to > just hacking at it until it mostly works or just doing a full device > down/up cycle including resetting full driver state. I do not disagree with you. It will be ugly big patch, but it is probably worth it, so the patch will be welcome. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:08 ` Pavel Machek 2007-04-25 20:33 ` Rafael J. Wysocki @ 2007-04-25 22:36 ` Manu Abraham 1 sibling, 0 replies; 713+ messages in thread From: Manu Abraham @ 2007-04-25 22:36 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Pavel Machek wrote: > STD needs to snapshot system, and then it needs devices to be > suspended so that snapshot is consistent. One question though, there are devices that can be suspended (broken suspend) and restore in such a case wouldn't work at all. The only possible way would be then to reinitialize the device instead of restore ? Manu ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:38 ` Linus Torvalds 2007-04-25 20:08 ` Pavel Machek @ 2007-04-25 20:20 ` Rafael J. Wysocki 2007-04-25 20:24 ` Linus Torvalds 2007-04-25 20:23 ` Adrian Bunk 2007-04-27 12:36 ` suspend2 merge Martin Steigerwald 3 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-25 20:20 UTC (permalink / raw) To: Linus Torvalds Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wednesday, 25 April 2007 21:38, Linus Torvalds wrote: > > On Wed, 25 Apr 2007, Adrian Bunk wrote: Well, I told Pavel that I wouldn't take part in this thread, but since you're making some rude and unfounded personal remarks, I feel I have to speak. [--snip--] > And that's a *fundamental* problem. If the STD people cannot even realize > that they have less to do with "suspend" than to "reboot", how do you ever > expect them to get anything to work, and not affect other things > negatively? That's not true. > Yeah, I'm down on it. I'm down on it because every person involved with > the whole STD thing seems to have basically zero taste, and a total > inability to work with anybody else. Please ask anyone who's worked with me if he's had any problem with that. If anyone say I'm unable to work with anybody else, I'd say you're right. Till then, I feel offended. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:20 ` Rafael J. Wysocki @ 2007-04-25 20:24 ` Linus Torvalds 2007-04-25 21:30 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 20:24 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 25 Apr 2007, Rafael J. Wysocki wrote: > > Please ask anyone who's worked with me if he's had any problem with that. > If anyone say I'm unable to work with anybody else, I'd say you're right. Till > then, I feel offended. I'll apologise (and virtually kiss your hairy feet) if you could actually show me a single implementation that people can agree on. But until then, I claim that the suspend-to-disk people cannot work with each other. And no, "three different implementations" doesn't cut it. Even _two_ is too much. We need to get *rid* of something, not add more. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:24 ` Linus Torvalds @ 2007-04-25 21:30 ` Pavel Machek 2007-04-25 21:40 ` Rafael J. Wysocki 2007-04-25 22:22 ` Nigel Cunningham 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 21:30 UTC (permalink / raw) To: Linus Torvalds Cc: Rafael J. Wysocki, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > Please ask anyone who's worked with me if he's had any problem with that. > > If anyone say I'm unable to work with anybody else, I'd say you're right. Till > > then, I feel offended. > > I'll apologise (and virtually kiss your hairy feet) if you could actually > show me a single implementation that people can agree on. > > But until then, I claim that the suspend-to-disk people cannot work with > each other. It is not Rafael's fault. Actually it is quite hard to work with Nigel, because he implements every feature someone asks for, and wants to merge them all :-(. I don't expect to ever agree with Nigel on anything important, sorry. > And no, "three different implementations" doesn't cut it. Even _two_ is > too much. We need to get *rid* of something, not add more. swsusp can be dropped. It is nice -- self contained, extremely easy to setup, Andrew likes it. uswsusp has all the features, and pretty elegant design. With klibc (or some way to ship userland code with kernel, and put it into initramfs or something) we can reasonably drop swsusp. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:30 ` Pavel Machek @ 2007-04-25 21:40 ` Rafael J. Wysocki 2007-04-25 21:46 ` Pavel Machek 2007-04-25 22:22 ` Nigel Cunningham 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-25 21:40 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wednesday, 25 April 2007 23:30, Pavel Machek wrote: > Hi! > > > > Please ask anyone who's worked with me if he's had any problem with that. > > > If anyone say I'm unable to work with anybody else, I'd say you're right. Till > > > then, I feel offended. > > > > I'll apologise (and virtually kiss your hairy feet) if you could actually > > show me a single implementation that people can agree on. > > > > But until then, I claim that the suspend-to-disk people cannot work with > > each other. > > It is not Rafael's fault. Actually it is quite hard to work with > Nigel, because he implements every feature someone asks for, and wants > to merge them all :-(. I don't expect to ever agree with Nigel on > anything important, sorry. > > > And no, "three different implementations" doesn't cut it. Even _two_ is > > too much. We need to get *rid* of something, not add more. > > swsusp can be dropped. It is nice -- self contained, extremely easy to > setup, Andrew likes it. uswsusp has all the features, and pretty > elegant design. With klibc (or some way to ship userland code with > kernel, and put it into initramfs or something) we can reasonably drop > swsusp. Well, I think we still need it and will need it in the future, at least for debugging. Moreover, I think there are many users of it. Let's not drop things that are helping us. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:40 ` Rafael J. Wysocki @ 2007-04-25 21:46 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 21:46 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Linus Torvalds, Adrian Bunk, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > > And no, "three different implementations" doesn't cut it. Even _two_ is > > > too much. We need to get *rid* of something, not add more. > > > > swsusp can be dropped. It is nice -- self contained, extremely easy to > > setup, Andrew likes it. uswsusp has all the features, and pretty > > elegant design. With klibc (or some way to ship userland code with > > kernel, and put it into initramfs or something) we can reasonably drop > > swsusp. > > Well, I think we still need it and will need it in the future, at least for > debugging. Moreover, I think there are many users of it. > > Let's not drop things that are helping us. :-) Yes, it is very nice for debugging. But if I _had_ to choose, I'd rather remove swsusp than uswsusp. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:30 ` Pavel Machek 2007-04-25 21:40 ` Rafael J. Wysocki @ 2007-04-25 22:22 ` Nigel Cunningham 1 sibling, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-25 22:22 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Rafael J. Wysocki, Adrian Bunk, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1671 bytes --] Hello. On Wed, 2007-04-25 at 23:30 +0200, Pavel Machek wrote: > Hi! > > > > Please ask anyone who's worked with me if he's had any problem with that. > > > If anyone say I'm unable to work with anybody else, I'd say you're right. Till > > > then, I feel offended. > > > > I'll apologise (and virtually kiss your hairy feet) if you could actually > > show me a single implementation that people can agree on. > > > > But until then, I claim that the suspend-to-disk people cannot work with > > each other. > > It is not Rafael's fault. Actually it is quite hard to work with > Nigel, because he implements every feature someone asks for, and wants > to merge them all :-(. I don't expect to ever agree with Nigel on > anything important, sorry. I'm sorry that you feel that way, Pavel. I can agree that I implement features that people ask for, but I think saying "every feature someone asks for" is going a bit far (I won't ask you to prove that). My desire is to provide Linux with hibernation support that does more than just the bare minimum. Different people have different usage scenarios, and this has led to me implementing more and different features. As to wanting to merge them all, this is true. No one wants to put time into something only to have it left out. But I don't see why you think this is a bad thing. Many kernel guys claim the thing follows an evolutionary model. Well, here's software that has been developed out of tree - evolved if you like - and which many people would consider more mature ('evolved'?) than [u]swsusp. If evolutionary theory is to be followed, let the fittest survive! Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:38 ` Linus Torvalds 2007-04-25 20:08 ` Pavel Machek 2007-04-25 20:20 ` Rafael J. Wysocki @ 2007-04-25 20:23 ` Adrian Bunk 2007-04-25 22:19 ` Kenneth Crudup 2007-04-27 12:36 ` suspend2 merge Martin Steigerwald 3 siblings, 1 reply; 713+ messages in thread From: Adrian Bunk @ 2007-04-25 20:23 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, Apr 25, 2007 at 12:38:47PM -0700, Linus Torvalds wrote: > > > On Wed, 25 Apr 2007, Adrian Bunk wrote: > > > > > > .. but if the alternative is a feature that just isn't worth it, and > > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > > I believe STD is both of those. There's a reason it's called "STD". Go > > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > > > Is there really no use case for STD? >... > I'd actually be happier *removing* STD support in the sense it is now: > it's way too closely integrated with STR, even though it has absolutely > nothing in common with it. When you STD, you'e actually much closer to a > *shutdown* than to STR, yet the STD code continually seems to want to be > in the "suspend" path, as shown even by its name. > > So my objections to STD have nothing to do with saving state and shutting > down. They have everything to do with the fact that it is not - and will > never be - a "suspend", and it shouldn't affect suspend. >... There are two completely different points: - I say that the feature STD has use cases where STR is not a replacement - you say you dislike the current implementation of STD For me it was a serious regression if STD was removed without any replacement. If someone would replace the STD implementation with what you want it to be I wouldn't care and you were happy. > Linus cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:23 ` Adrian Bunk @ 2007-04-25 22:19 ` Kenneth Crudup 0 siblings, 0 replies; 713+ messages in thread From: Kenneth Crudup @ 2007-04-25 22:19 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Adrian Bunk wrote: > For me it was a serious regression if STD was removed without any > replacement. Amen. I have even made material donations to the SS2 effort to give the developer what he'd needed to fix an issue with a certain configuration and will do so again if need be, as that expense would be minor compared to the productivity (== billing) loss that arises from having to start over from scratch from each of the 3+ times per day I have to shutdown and physically relocate my machine from place-to-place or client-to-client. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge 2007-04-25 19:38 ` Linus Torvalds ` (2 preceding siblings ...) 2007-04-25 20:23 ` Adrian Bunk @ 2007-04-27 12:36 ` Martin Steigerwald 3 siblings, 0 replies; 713+ messages in thread From: Martin Steigerwald @ 2007-04-27 12:36 UTC (permalink / raw) To: suspend2-devel Cc: Linus Torvalds, Adrian Bunk, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek, Ingo Molnar, Andrew Morton, Arjan van de Ven Am Mittwoch 25 April 2007 schrieb Linus Torvalds: > And that's a *fundamental* problem. If the STD people cannot even > realize that they have less to do with "suspend" than to "reboot", how > do you ever expect them to get anything to work, and not affect other > things negatively? > > Yeah, I'm down on it. I'm down on it because every person involved with > the whole STD thing seems to have basically zero taste, and a total > inability to work with anybody else. Hello Linus! I am no kernel developer. But I understand what you are trying to tell here. I agree that suspend to ram and snapshot should be handled differently by drivers. And unlike schedulers - whether it be I/O or process related ones - I think it should be quite easy to settle and decide on *one* implementation for each feature. It least it doesn't look as difficult as deciding on a scheduler which works for all the different workloads to me. I do not believe that the reasons preventing this to happen until now are of pure technical nature. I think snapshotting is a very important feature. I would patch it into my kernels if it was removed. But then I am using suspend2 anyway. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:25 ` Adrian Bunk 2007-04-25 19:38 ` Linus Torvalds @ 2007-04-25 19:41 ` Andrew Morton 2007-04-25 19:55 ` Pavel Machek ` (2 subsequent siblings) 4 siblings, 0 replies; 713+ messages in thread From: Andrew Morton @ 2007-04-25 19:41 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Thomas Gleixner, Arjan van de Ven On Wed, 25 Apr 2007 21:25:12 +0200 Adrian Bunk <bunk@stusta.de> wrote: > > > And even 3W would still be a waste of energy. > > > > .. but if the alternative is a feature that just isn't worth it, and > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > I believe STD is both of those. There's a reason it's called "STD". Go > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > Is there really no use case for STD? I use it all the time. The batteries only seem to last a day or so in STR. Plus one is supposed to power off all electrical equipment during takeoff and landing. > No worries if power is completely lost. To change batteries. > Some people might boot Windows between suspending and resuming. I use that often too. (But I won't when I get around to upgrading the X driver to get the VaioOfDeath's external video output working under Linux) I don't think I need a fancy splash screen tho. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:25 ` Adrian Bunk 2007-04-25 19:38 ` Linus Torvalds 2007-04-25 19:41 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton @ 2007-04-25 19:55 ` Pavel Machek 2007-04-25 22:13 ` Kenneth Crudup 2007-04-26 1:25 ` Antonino A. Daplas 4 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-25 19:55 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven Hi! > > > 3W for the complete system? In CPU state S1? [1] > > > > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) > > the devices are off, but the motherboard and memory is powered. > > As far as I understand it, the CPU isn't off in S1. > > > > And even 3W would still be a waste of energy. > > > > .. but if the alternative is a feature that just isn't worth it, and > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > I believe STD is both of those. There's a reason it's called "STD". Go > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > Is there really no use case for STD? Of course there are use cases for STD... lots of them... that's why I'm maintaining it. It has some "interesting" use cases, like suspend on one machine, transfer image to identical one, resume there, dual-boot to windows; there are "normal" use cases, like machines not capable of S3. I hope we are not dropping STD just now... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:25 ` Adrian Bunk ` (2 preceding siblings ...) 2007-04-25 19:55 ` Pavel Machek @ 2007-04-25 22:13 ` Kenneth Crudup 2007-04-26 1:25 ` Antonino A. Daplas 4 siblings, 0 replies; 713+ messages in thread From: Kenneth Crudup @ 2007-04-25 22:13 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Pavel Machek, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Adrian Bunk wrote: > Some people might boot Windows between suspending and resuming. Oh yeah- that, too. Since iTunes doesn't work well with VMWare, I do this all the time. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:25 ` Adrian Bunk ` (3 preceding siblings ...) 2007-04-25 22:13 ` Kenneth Crudup @ 2007-04-26 1:25 ` Antonino A. Daplas 4 siblings, 0 replies; 713+ messages in thread From: Antonino A. Daplas @ 2007-04-26 1:25 UTC (permalink / raw) To: Adrian Bunk Cc: Linus Torvalds, Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, 2007-04-25 at 21:25 +0200, Adrian Bunk wrote: > On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: > > > > > > On Wed, 25 Apr 2007, Adrian Bunk wrote: > > > > > > 3W for the complete system? In CPU state S1? [1] > > > > In STR, 3W is quite realistic. The CPU is off, all (or most - up to you) > > the devices are off, but the motherboard and memory is powered. > > As far as I understand it, the CPU isn't off in S1. > > > > And even 3W would still be a waste of energy. It is, especially if you're living in a place where power infrastructure is unreliable (such as where I live). Currently, because of the summer heat, power demand exceeds power supply so we experience practically daily rotating 4-hour power interruption. That 3W saved multiplied by the total number of computers is a lot. In this perspective, S2D (or shutdown) is preferred over S2RAM. Tony ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 18:50 ` Linus Torvalds 2007-04-25 19:02 ` Hua Zhong 2007-04-25 19:25 ` Adrian Bunk @ 2007-04-25 23:33 ` Olivier Galibert 2007-04-26 1:56 ` Nigel Cunningham 2 siblings, 1 reply; 713+ messages in thread From: Olivier Galibert @ 2007-04-25 23:33 UTC (permalink / raw) To: Linus Torvalds Cc: Adrian Bunk, Pavel Machek, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: > .. but if the alternative is a feature that just isn't worth it, and > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > I believe STD is both of those. There's a reason it's called "STD". Go > to google and type "STD" and press "I'm feeling lucky". Google is God). If it was correctly designed, it would be possible to change the hardware or even the kernel through a STD cycle. And that would be damn interesting on servers. In any case, if I could trust it, I'd use it when I need to move servers around and I don't want to lose what is running. Riding power cuts that way would be nice. OG. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:33 ` Olivier Galibert @ 2007-04-26 1:56 ` Nigel Cunningham 2007-04-26 7:27 ` David Lang 0 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 1:56 UTC (permalink / raw) To: Olivier Galibert Cc: Linus Torvalds, Adrian Bunk, Pavel Machek, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1246 bytes --] Hi. On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote: > On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: > > .. but if the alternative is a feature that just isn't worth it, and > > likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > > I believe STD is both of those. There's a reason it's called "STD". Go > > to google and type "STD" and press "I'm feeling lucky". Google is God). > > If it was correctly designed, it would be possible to change the > hardware or even the kernel through a STD cycle. And that would be > damn interesting on servers. Those are different issues - hardware hot/cold plugging for the first. Changing the kernel through a cycle - that's not a design fault. The problem there is that the kernel and it's associated data structures are part of the state. Changing the kernel and keeping the image would require exactly correspondence in data structures, memory map and so on. That's why the same kernel is required. > In any case, if I could trust it, I'd use it when I need to move > servers around and I don't want to lose what is running. Riding power > cuts that way would be nice. That's what Rafael and I working on. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 1:56 ` Nigel Cunningham @ 2007-04-26 7:27 ` David Lang 2007-04-26 9:45 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: David Lang @ 2007-04-26 7:27 UTC (permalink / raw) To: Nigel Cunningham Cc: Olivier Galibert, Linus Torvalds, Adrian Bunk, Pavel Machek, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On Thu, 26 Apr 2007, Nigel Cunningham wrote: > Hi. > > On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote: >> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: >>> .. but if the alternative is a feature that just isn't worth it, and >>> likely to not only have its own bugs, but cause bugs elsewhere? (And yes, >>> I believe STD is both of those. There's a reason it's called "STD". Go >>> to google and type "STD" and press "I'm feeling lucky". Google is God). >> >> If it was correctly designed, it would be possible to change the >> hardware or even the kernel through a STD cycle. And that would be >> damn interesting on servers. > > Those are different issues - hardware hot/cold plugging for the first. > > Changing the kernel through a cycle - that's not a design fault. The > problem there is that the kernel and it's associated data structures are > part of the state. Changing the kernel and keeping the image would > require exactly correspondence in data structures, memory map and so on. > That's why the same kernel is required. that depends on exactly what you save in your snapshot. one approach is to try and save absolutly everything in ram (this is the current approach) if you do this then you do need to use the same kernel for the reasons that you list. however, you could also decide to only save the information about processes on the system (i.e. what you absolutly have to) and let the kernel re-initialize itself (along with it's devices) then you could use a different kernel safely. doing this should also save you a significant amount of storage when makeing your snapshot David Lang ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 7:27 ` David Lang @ 2007-04-26 9:45 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 9:45 UTC (permalink / raw) To: David Lang Cc: Olivier Galibert, Linus Torvalds, Adrian Bunk, Pavel Machek, Ingo Molnar, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 2028 bytes --] Hi. On Thu, 2007-04-26 at 00:27 -0700, David Lang wrote: > On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > > Hi. > > > > On Thu, 2007-04-26 at 01:33 +0200, Olivier Galibert wrote: > >> On Wed, Apr 25, 2007 at 11:50:45AM -0700, Linus Torvalds wrote: > >>> .. but if the alternative is a feature that just isn't worth it, and > >>> likely to not only have its own bugs, but cause bugs elsewhere? (And yes, > >>> I believe STD is both of those. There's a reason it's called "STD". Go > >>> to google and type "STD" and press "I'm feeling lucky". Google is God). > >> > >> If it was correctly designed, it would be possible to change the > >> hardware or even the kernel through a STD cycle. And that would be > >> damn interesting on servers. > > > > Those are different issues - hardware hot/cold plugging for the first. > > > > Changing the kernel through a cycle - that's not a design fault. The > > problem there is that the kernel and it's associated data structures are > > part of the state. Changing the kernel and keeping the image would > > require exactly correspondence in data structures, memory map and so on. > > That's why the same kernel is required. > > that depends on exactly what you save in your snapshot. > > one approach is to try and save absolutly everything in ram (this is the current > approach) > > if you do this then you do need to use the same kernel for the reasons that you > list. > > however, you could also decide to only save the information about processes on > the system (i.e. what you absolutly have to) and let the kernel re-initialize > itself (along with it's devices) then you could use a different kernel safely. > doing this should also save you a significant amount of storage when makeing > your snapshot Well, there is cryopid for individual processes. I suppose you could potentially try doing a mass cryopiding. That would make things a lot more complicated though. I'm not saying it's not doable. Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 17:34 ` Pavel Machek 2007-04-25 18:39 ` Adrian Bunk @ 2007-04-25 18:52 ` Alon Bar-Lev 2007-04-25 22:11 ` Kenneth Crudup 2 siblings, 0 replies; 713+ messages in thread From: Alon Bar-Lev @ 2007-04-25 18:52 UTC (permalink / raw) To: Pavel Machek Cc: Adrian Bunk, Linus Torvalds, Ingo Molnar, Nigel Cunningham, Christian Hesse, Nick Piggin, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, Andrew Morton, Thomas Gleixner, Arjan van de Ven On 4/25/07, Pavel Machek <pavel@ucw.cz> wrote: > Ok ok ok, suspend-to-disk has some other uses, too. > > But ... you are really using suspend-to-disk as a workaround for "my > desktop takes too much power when idle". Imagine pressing "lock > screensaver" combination, and your machine going to low power mode > (3W?), immediately. (Quiet, too; you can't generate much noise for > 3W). In the morning, you'd just press any key, machine would power up, > immediately... ok, you'd have to ifconfig eth0 down, so that spurious > packets on the local net would wake your machine, with all its fans > etc. > Pavel You are assuming that: 1. You have battery backup, or external power never fail. 2. You don't disconnect the filesystem from the device. 3. The security level of turned on device equals to a turned off one. 4. You turn on the same device that turned off. 5. You do not wish to boot another OS on this machine. None of the above are always true... but why assume? Just make this work... If Nigel wish to maintain this please let him, you can be in charge of the s2ram. Best Regards, Alon Bar-Lev. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 17:34 ` Pavel Machek 2007-04-25 18:39 ` Adrian Bunk 2007-04-25 18:52 ` Alon Bar-Lev @ 2007-04-25 22:11 ` Kenneth Crudup 2 siblings, 0 replies; 713+ messages in thread From: Kenneth Crudup @ 2007-04-25 22:11 UTC (permalink / raw) To: Pavel Machek Cc: Adrian Bunk, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Linus Torvalds, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Pavel Machek wrote: > But ... you are really using suspend-to-disk as a workaround for "my > desktop takes too much power when idle". While rare is the day admittedly, that my machine isn't on, there are days I take a break from loooong days and won't work for 2-5 days at a time. My main revenue machine is a laptop with a fast, but last-generation mobile processor and 2GB of DDR2 SDRAM. I think it's ridiculous to expect that I could resume off battery (and this thing is a behemoth, with a 17" screen and backlight and a lot of little juice-eating peripherals that'll go thru a 4.4A/Hr battery in a little over 90 mins, even with conservative power settings) after that kind of delay. I don't even like "suspend to RAM, then suspend to disk on battery low" 'cause that means when I turn it on again I have a low battery for an hour and a half. The only acceptable power usage when (completely) idle, IMO, is *zero*. -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:23 ` Pavel Machek ` (2 preceding siblings ...) 2007-04-25 15:18 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk @ 2007-04-25 19:43 ` Kenneth Crudup 2007-04-25 20:08 ` Linus Torvalds 2007-05-26 17:37 ` Martin Steigerwald 4 siblings, 1 reply; 713+ messages in thread From: Kenneth Crudup @ 2007-04-25 19:43 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Pavel Machek wrote: > I'm starting to think that we should fix the idle power consumption > problem. Cell phones do it right. They pretend to be ready/idle all > the time, yet they have _days_ of standby. My laptop goes nearly everywhere I do; I DO NOT want it on when I'm travelling around to clients or between home and office or on a plane, and I lose a lot of productivity the times I have to restart from a cold boot as when I'm working I tend to have up ~10 xterms and while my browsers have "restart", that's not infallible. Any working suspend-to-disk method takes care of that for me. (I'm really not sure why Linus hates S2D so much, though. Back in the day there was a lot more BIOS support, but that's been years now.) -Kenny -- Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 19:43 ` Kenneth Crudup @ 2007-04-25 20:08 ` Linus Torvalds 2007-04-25 20:27 ` Pavel Machek 2007-04-26 0:41 ` Thomas Orgis 0 siblings, 2 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 20:08 UTC (permalink / raw) To: Kenneth Crudup Cc: Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Kenneth Crudup wrote: > > Any working suspend-to-disk method takes care of that for me. (I'm > really not sure why Linus hates S2D so much, though. Back in the day > there was a lot more BIOS support, but that's been years now.) The really sad part is that APM actually did this better.. It's not often you can say that, and APM didn't do diddly-squat for run-time power management, but when it comes to suspend-to-disk, APM actually did ok. I think you could do STD right too, but: - if you think it's about suspending devices, you are immediately disqualified. If you call the device driver "suspend" or "resume" functions, you are doing something wrong. - "suspend" is: snapshot memory, and anything you do *after* the snapshot is totally irrelevant. You MUST NOT suspend devices before, since devices are what that snapshot should be written out to, and you MUST NOT suspend devices afterwards either, because that shows that you are a moron who didn't understand the "machine will be turned off" part. - "resume" is basically: get image into memory, turn *off* every device, put image into its proper location, and call the "startup" function. If you call a device "resume()" function, you again show that you are a moron, because you're not resuming anything at all, you're resetting the device from scratch. You _reinitialize_ the device. You don't resume it, and somebody may hve (and indeed, *WILL HAVE* used the device in between). There should be absolutely zero shared code, and the *last* thing you should do is to call the device with the same function, and give it a flag to tell it to do one thing or the other. The important thing to take away from this is that it has nothing to do with "suspend" or "resume" at any level what-so-ever. Not at a device level, not at a system level, and not even when it comes to hardware. But for completely idiotic and wrong reasons, it is currently intimately involved in suspend/resume, and calls the same device management functions as a suspend/resume thing does, and shares a lot of the code. And THAT is why I hate the kernel STD. It is fundamentally confused. In ways that APM was not, I'd like to point out. I'd love to get it fixed. But the first fix is to not call it "suspend", because language *does* matter, and using that term is what I'm convinced has confused so many people. If it had been called "snapshot + restore", I suspect a lot of people wouldn't have been so confused about what it does and how it needs to do it, and wouldn't have tried to shoehorn it into the same corner of the kernel as "suspend-to-ram" (where you really *can* do things like "suspend" devices, and while they might certainly lose power in between, they also really might not, and they certainly won't be *doing* things in between). Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:08 ` Linus Torvalds @ 2007-04-25 20:27 ` Pavel Machek 2007-04-25 20:44 ` Linus Torvalds 2007-04-26 0:41 ` Thomas Orgis 1 sibling, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 20:27 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! _Can we get a suspend-to-RAM maintainer_? Noone cares about s2ram these days. I do care a little, seife maintains whitelist, you care for mac mini, Len/Andrew/Intel acpi team helps sometimes... But I feel we should have someone listed in the MAINTAINERS file. Patrick was close, but... > > Any working suspend-to-disk method takes care of that for me. (I'm > > really not sure why Linus hates S2D so much, though. Back in the day > > there was a lot more BIOS support, but that's been years now.) > > The really sad part is that APM actually did this better.. I agree that APM STR worked better than current ACPI STR. I think swsusp already works better than APM STD, but... > I think you could do STD right too, but: > > - if you think it's about suspending devices, you are immediately > disqualified. If you call the device driver "suspend" or "resume" > functions, you are doing something wrong. > > - "suspend" is: snapshot memory, and anything you do *after* the snapshot > is totally irrelevant. You MUST NOT suspend devices before, since > devices are what that snapshot should be written out to, and you MUST > NOT suspend devices afterwards either, because that shows that you are > a moron who didn't understand the "machine will be turned off" part. Can I get you on IRC somewhere? No, I do not think I'm a moron, and yes, I need to suspend^Wsnapshot the devices before, so I have that in the snapshot. Of course, I'll need to resume^Wrestore the devices before writing snapshot. That's okay, it does not take long. > - "resume" is basically: get image into memory, turn *off* every device, Exactly. I need to turn devices *off* before restoring image, and I need them *off* before saving image, too -- DMAs are dangerous. I currently do that using "suspend" and "resume" hooks, before they turn DMAs / IRQs off as a sideeffect. > put image into its proper location, and call the "startup" function. If > you call a device "resume()" function, you again show that you are a > moron, because you're not resuming anything at all, you're resetting > the device from scratch. You _reinitialize_ the device. You don't > resume it, and somebody may hve (and indeed, *WILL HAVE* used the > device in between). There should be absolutely zero shared code, and > the *last* thing you should do is to call the device with the same > function, and give it a flag to tell it to do one thing or the other. Well, "startup" function or how you want to call it has to deal with device not initialized (s2disk driver is module, s2ram with hw powered off) and has to deal with device initialized (s2disk driver in kernel case, s2ram device was powered). I fear that "resume"/"restore" functions need to be pretty robust, anyway... > And THAT is why I hate the kernel STD. It is fundamentally confused. In > ways that APM was not, I'd like to point out. Ok, yes, it is confused/confusing. > I'd love to get it fixed. But the first fix is to not call it "suspend", > because language *does* matter, and using that term is what I'm convinced > has confused so many people. > If it had been called "snapshot + restore", I suspect a lot of > people snapshot/restore sounds okay to me. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:27 ` Pavel Machek @ 2007-04-25 20:44 ` Linus Torvalds 2007-04-25 21:07 ` Rafael J. Wysocki 2007-04-25 21:44 ` Pavel Machek 0 siblings, 2 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 20:44 UTC (permalink / raw) To: Pavel Machek Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Pavel Machek wrote: > > Can I get you on IRC somewhere? No, I do not think I'm a moron, and > yes, I need to suspend^Wsnapshot the devices before, so I have that in > the snapshot. Of course, I'll need to resume^Wrestore the devices > before writing snapshot. That's okay, it does not take long. You do NOT need to "suspend" the devices, and that's the whole point. You may want to save the device info somewhere, BUT THAT IS SOMETHING TOTALLY DIFFERENT! This is *exactly* the confusion I'm talking about. The STD and STR codepaths try to use the same function for two TOTALLY DIFFERENT things. STR actually wants to "suspend". STD actually wants to "atomic snapshot", and it must not allow allocations or anything like that, because the whole snapshot image should be done atomically as one event. But it should *not* suspend, because that device may actually be needed afterwards. So not the same thing at all. So here's what "suspend()" wants: - suspend() - preparatory work, can error our, can delay, can park the disk, etc etc. - suspend_late() - called late, with interrupts disabled, should actually suspend if the early suspend didn't do it already And here is what "snapshot()" wants: - prepare_to_snapshot() (for memory allocation) - snapshot() - called late, with interrupts disabled, save state. and there is absolutely _zero_ overlap between them. There just isn't anything in common. Yes, both are two-phase (for the simple reason that both want an "atomic" part), but there's really no real overlap. Just trying to *make* them be the same operations is just going to introduce flags that then cause them to be totally different *and* confusing and generate bugs. It also means that people do one of them, and "it works" for that case, and the other case is totally broken, but it's not obvious, because doing one means that the system _thinks_ that you did both! In the very unlikely case that some driver actually *wants* to use the same function for snapshots and suspending, that driver could just go ahead and _use_ the same function pointer. But now, as things are set up, we force a total confusion on drivers by calling them through the same interface for two totally different things. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:44 ` Linus Torvalds @ 2007-04-25 21:07 ` Rafael J. Wysocki 2007-04-25 21:44 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-25 21:07 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wednesday, 25 April 2007 22:44, Linus Torvalds wrote: > > On Wed, 25 Apr 2007, Pavel Machek wrote: > > > > Can I get you on IRC somewhere? No, I do not think I'm a moron, and > > yes, I need to suspend^Wsnapshot the devices before, so I have that in > > the snapshot. Of course, I'll need to resume^Wrestore the devices > > before writing snapshot. That's okay, it does not take long. > > You do NOT need to "suspend" the devices, and that's the whole point. > > You may want to save the device info somewhere, BUT THAT IS SOMETHING > TOTALLY DIFFERENT! > > This is *exactly* the confusion I'm talking about. The STD and STR > codepaths try to use the same function for two TOTALLY DIFFERENT things. > > STR actually wants to "suspend". > > STD actually wants to "atomic snapshot", and it must not allow allocations > or anything like that, because the whole snapshot image should be done > atomically as one event. But it should *not* suspend, because that device > may actually be needed afterwards. > > So not the same thing at all. > > So here's what "suspend()" wants: > - suspend() - preparatory work, can error our, can delay, can park the > disk, etc etc. > - suspend_late() - called late, with interrupts disabled, should actually > suspend if the early suspend didn't do it already > > And here is what "snapshot()" wants: > - prepare_to_snapshot() (for memory allocation) > - snapshot() - called late, with interrupts disabled, save state. > > and there is absolutely _zero_ overlap between them. There just isn't > anything in common. Yes, both are two-phase (for the simple reason that > both want an "atomic" part), but there's really no real overlap. > > Just trying to *make* them be the same operations is just going to > introduce flags that then cause them to be totally different *and* > confusing and generate bugs. It also means that people do one of them, and > "it works" for that case, and the other case is totally broken, but it's > not obvious, because doing one means that the system _thinks_ that you did > both! > > In the very unlikely case that some driver actually *wants* to use the > same function for snapshots and suspending, that driver could just go > ahead and _use_ the same function pointer. But now, as things are set up, > we force a total confusion on drivers by calling them through the same > interface for two totally different things. I agree, except there are surprisingly many drivers like that. You're right, we should be doing all of it in a different way, but this means a lot of changes and we can't do them overnight. As I wrote in the reply to Pavel, I think we can introduce .freeze(), .thaw() (and .prethaw() for that matter) callbacks for hibernation and make drivers use them, but that will be a long series of patches. Still, I think it's doable. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:44 ` Linus Torvalds 2007-04-25 21:07 ` Rafael J. Wysocki @ 2007-04-25 21:44 ` Pavel Machek 2007-04-25 22:18 ` Linus Torvalds 1 sibling, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 21:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > Can I get you on IRC somewhere? No, I do not think I'm a moron, and > > yes, I need to suspend^Wsnapshot the devices before, so I have that in > > the snapshot. Of course, I'll need to resume^Wrestore the devices > > before writing snapshot. That's okay, it does not take long. > > You do NOT need to "suspend" the devices, and that's the whole point. > > You may want to save the device info somewhere, BUT THAT IS SOMETHING > TOTALLY DIFFERENT! > > This is *exactly* the confusion I'm talking about. The STD and STR > codepaths try to use the same function for two TOTALLY DIFFERENT things. > > STR actually wants to "suspend". > > STD actually wants to "atomic snapshot", and it must not allow allocations > or anything like that, because the whole snapshot image should be done > atomically as one event. But it should *not* suspend, because that device > may actually be needed afterwards. > > So not the same thing at all. Not the same... but they are still related. "freeze" (for atomic snapshot) is actually subset of "suspend"... freeze needs DMAs off and saved state, and you need DMAs off and saved state for "suspend". So it is actually correct to do "suspend" when you want "freeze"; it is just slow. That's why they only differ in parameter these days. > So here's what "suspend()" wants: > - suspend() - preparatory work, can error our, can delay, can park the > disk, etc etc. > - suspend_late() - called late, with interrupts disabled, should actually > suspend if the early suspend didn't do it already > > And here is what "snapshot()" wants: > - prepare_to_snapshot() (for memory allocation) Lets call this "freeze"? > - snapshot() - called late, with interrupts disabled, save state. > and there is absolutely _zero_ overlap between them. There just isn't > anything in common. Yes, both are two-phase (for the simple reason that > both want an "atomic" part), but there's really no real overlap. As I tried to explain, you can use suspend() to stop DMAs and save state, and that's enough to get sane snapshot. (You do cli() before doing snapshot, that helps with irqs). > Just trying to *make* them be the same operations is just going to > introduce flags that then cause them to be totally different *and* > confusing and generate bugs. It also means that people do one of them, and > "it works" for that case, and the other case is totally broken, but it's > not obvious, because doing one means that the system _thinks_ that you did > both! > > In the very unlikely case that some driver actually *wants* to use the > same function for snapshots and suspending, that driver could just go > ahead and _use_ the same function pointer. But now, as things are set up, > we force a total confusion on drivers by calling them through the same > interface for two totally different things. Ok ok, we can do suspend(PMSG_SUSPEND) -> suspend() suspend(PMSG_FREEZE) -> freeze() . We'll need to do big search&replace over the kernel etc. But if you think it helps with confusion... I'd still like to keep people using same method for both. It means suspend path gets more testing, even when some stuff it does is not strictly neccessary for freeze. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 21:44 ` Pavel Machek @ 2007-04-25 22:18 ` Linus Torvalds 2007-04-25 22:27 ` Nigel Cunningham ` (4 more replies) 0 siblings, 5 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 22:18 UTC (permalink / raw) To: Pavel Machek Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Pavel Machek wrote: > > Not the same... but they are still related. "freeze" (for atomic > snapshot) is actually subset of "suspend"... freeze needs DMAs off and > saved state, and you need DMAs off and saved state for "suspend". THEY HAVE ABSOLUTELY NOTHING IN COMMON! Nobody in their right mind thinks that "disable DMA" and "suspend" are similar operations. > So it is actually correct to do "suspend" when you want "freeze"; it > is just slow. That's why they only differ in parameter these days. It is *not* correct to "suspend" when you want "freeze". I don't understand how you can even *claim* something like that. Here's a trivial example: - SCSI disk Tell me, what does "suspend" do, and what does "freeze" (snapshot) do? And name *one* thing that have in common. I'll tell you: Nada. Zero. Zilch. Nothing. "Freeze" for a disk is a total no-op. There is no DMA, there is no nothing. In contrast, "suspend" for a disk is a totally valid operation. Anybody who claims that these two operations are "related" is a moron. I'm sorry Pavel, but that's exactly how it is. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:18 ` Linus Torvalds @ 2007-04-25 22:27 ` Nigel Cunningham 2007-04-25 22:55 ` Linus Torvalds 2007-04-25 22:42 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek ` (3 subsequent siblings) 4 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-25 22:27 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1259 bytes --] Hi. On Wed, 2007-04-25 at 15:18 -0700, Linus Torvalds wrote: > > On Wed, 25 Apr 2007, Pavel Machek wrote: > > > > Not the same... but they are still related. "freeze" (for atomic > > snapshot) is actually subset of "suspend"... freeze needs DMAs off and > > saved state, and you need DMAs off and saved state for "suspend". > > THEY HAVE ABSOLUTELY NOTHING IN COMMON! > > Nobody in their right mind thinks that "disable DMA" and "suspend" are > similar operations. > > > So it is actually correct to do "suspend" when you want "freeze"; it > > is just slow. That's why they only differ in parameter these days. > > It is *not* correct to "suspend" when you want "freeze". > > I don't understand how you can even *claim* something like that. > > Here's a trivial example: > - SCSI disk > > Tell me, what does "suspend" do, and what does "freeze" (snapshot) do? > > And name *one* thing that have in common. Set/reset the scsi transaction id thingy? Hibernation didn't work with SCSI for a long time precisely because that support was missing. Don't get me wrong, I agree on the whole - Suspend2 worked fine on the whole under 2.4 without a driver model. But they do have a bit in common. Regards, Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:27 ` Nigel Cunningham @ 2007-04-25 22:55 ` Linus Torvalds 2007-04-25 23:13 ` Pavel Machek ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 22:55 UTC (permalink / raw) To: Nigel Cunningham Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > > > And name *one* thing that have in common. > > Set/reset the scsi transaction id thingy? Hibernation didn't work with > SCSI for a long time precisely because that support was missing. And by "hibernation", you mean what? You mean "snapshot + shutdown", right? Think about it for five seconds, and then ask yourself: at which point in the "snapshot + shutdown" sequence would you actually tell a disk to shut down? If you said "snapshot", then you'd be *wrong*. That's my _point_. The snapshot() function should not (and MUST NOT) tell disks to shut down, because unlike suspend(), we're still going to _use_ those disks afterwards (why? To write out the snapshot image!). In other words, the act of creating a snapshot has *nothing* to do with suspend. Now, after you've created (and written out) the snapshot, what do you actually end up doing? That's right - you end up _shutting down_ the machine, and yes, as part of the _shutdown_ sequence you may actually end up doing a lot of the things that a suspend would do. But that's long *after* you've actually done the "snapshot" part, and has absolutely nothing to do with it. That's where I started: whole "suspend to disk" thing actually has _more_ to do with "shutdown" than with "suspend". Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:55 ` Linus Torvalds @ 2007-04-25 23:13 ` Pavel Machek 2007-04-25 23:29 ` Linus Torvalds 2007-04-26 1:40 ` Nigel Cunningham 2007-04-26 10:39 ` Johannes Berg 2 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 23:13 UTC (permalink / raw) To: Linus Torvalds Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > > And name *one* thing that have in common. > > > > Set/reset the scsi transaction id thingy? Hibernation didn't work with > > SCSI for a long time precisely because that support was missing. > > And by "hibernation", you mean what? You mean "snapshot + shutdown", > right? > > Think about it for five seconds, and then ask yourself: at which point in > the "snapshot + shutdown" sequence would you actually tell a disk to shut > down? Current design is: Twice. Once during snapshot (then we spin it up when the snapshot is done), and once during shutdown. Yep, we optimize away spindown, because it takes too long, so SCSI disks are actually very bad example. > If you said "snapshot", then you'd be *wrong*. > > That's my _point_. The snapshot() function should not (and MUST NOT) tell > disks to shut down, because unlike suspend(), we're still going to _use_ > those disks afterwards (why? To write out the snapshot image!). No, I'd like you to understand that we actually CAN tell the disks to spin down, because we'll call resume and spin them back again before writing the image. We used to do it. We still can do it, but it is slow. Yes, it is quite confusing. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:13 ` Pavel Machek @ 2007-04-25 23:29 ` Linus Torvalds 2007-04-25 23:45 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 23:29 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > Current design is: Broken. Yes. I've tried to tell you. > Twice. Once during snapshot (then we spin it up when the snapshot is > done), and once during shutdown. And nobody can possibly say that is "sane". But it's a direct result of the incorrect thinking that "suspend()" and "snapshot()" have anything what-so-ever to do with each other. > Yep, we optimize away spindown, because it takes too long, so SCSI > disks are actually very bad example. No. SCSI disks are a *good* example. It's an example of how you (incorrectly) call the same function for two totally different things, and then that function is smart enough that it *understands* that they are totally different. But the *confusion* remains. It remains in your head, and you've poisoned people like Alan too, that usually are not confused. And THAT is the main problem (although there are also indirect problems like "fixing one may break the other", but I actually think that the fundamental problem is the confusion it creates, which in turn causes bugs to happen because people are confused and think that they should do the same thing for suspend and for snapshot). > No, I'd like you to understand that we actually CAN tell the disks to > spin down, because we'll call resume and spin them back again before > writing the image. We used to do it. We still can do it, but it is > slow. > > Yes, it is quite confusing. It's worse than just confusing, it's *idiotic*. It _can_ work in practice, but - we have pretty damn solid evidence that it doesn't work all that often in practice - the fact that something *can* be done the stupid way is in no way an argument that it *should* be done the stupid way. I claim that the current STD is *stupid*. Yes, it can work. But that doesn't make it less stupid. What's your argument? Your argument seems to be that it's not stupid, because it can work. Can't you see that that simply isn't an argument at all? "stupid and wrong" doesn't mean "cannot work in theory". But it *does* mean that people get confused, and it *does* mean that there are likely more bugs, because confused people do not tend to write very good code. I'm not claiming that the current code cannot work. It clearly *does* work for a lot of people. But I'm claiming that it's STUPID. So don't argue that "it works". Windows works, kind of. That doesn't make it less stupid and badly designed! Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:29 ` Linus Torvalds @ 2007-04-25 23:45 ` Pavel Machek 2007-04-26 1:48 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 23:45 UTC (permalink / raw) To: Linus Torvalds Cc: Nigel Cunningham, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > Current design is: > > Broken. Yes. I've tried to tell you. Ok. ... > It's worse than just confusing, it's *idiotic*. > > It _can_ work in practice, but > - we have pretty damn solid evidence that it doesn't work all that often > in practice > - the fact that something *can* be done the stupid way is in no way an > argument that it *should* be done the stupid way. > > I claim that the current STD is *stupid*. Yes, it can work. But that > doesn't make it less stupid. Good. So you understand how it works. > What's your argument? Your argument seems to be that it's not stupid, > because it can work. Can't you see that that simply isn't an > argument at I tried keeping module_init/thaw/resume similar code, so that driver authors can debug suspend-to-disk, cross their fingers, and have suspend-to-ram work, too. Now, perhaps enough people do std/str these days so this is not important any longer... lets hope so. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:45 ` Pavel Machek @ 2007-04-26 1:48 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 1:48 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 1623 bytes --] Hi. On Thu, 2007-04-26 at 01:45 +0200, Pavel Machek wrote: > > What's your argument? Your argument seems to be that it's not stupid, > > because it can work. Can't you see that that simply isn't an > > argument at > > I tried keeping module_init/thaw/resume similar code, so that driver > authors can debug suspend-to-disk, cross their fingers, and have > suspend-to-ram work, too. > Now, perhaps enough people do std/str these days so this is not > important any longer... lets hope so. Noooo! It's important and getting more important. More and more, people are going to be demanding better power saving (climate change and all that stuff). The best power saving is to have the thing completely off, so STD is more important. The second best power saving is STR, so that's important too. But even more important is good power saving all the time. For that reason, I agree completely with Linus. The current model is far too limited. It shouldn't be so suspend-to-ram/disk centric, and should instead focus on run-time power management, with suspend to ram and disk as particular instances of run-time power management. It should make appropriate differentiation between snapshotting and suspending to ram. I do disagree that the current suspend-to-disk algorithm is broken. We do need a point at which we say "Ok, drivers, record your state." - the current device_suspend and device_resume calls. But that doesn't mean the need to be called device_suspend/resume or do what they do now. I'd love to help make this happen, but I'm afraid I just don't have the time. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:55 ` Linus Torvalds 2007-04-25 23:13 ` Pavel Machek @ 2007-04-26 1:40 ` Nigel Cunningham 2007-04-26 2:04 ` Linus Torvalds 2007-04-26 10:39 ` Johannes Berg 2 siblings, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 1:40 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 2410 bytes --] Hi. On Wed, 2007-04-25 at 15:55 -0700, Linus Torvalds wrote: > > On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > > > > > And name *one* thing that have in common. > > > > Set/reset the scsi transaction id thingy? Hibernation didn't work with > > SCSI for a long time precisely because that support was missing. > > And by "hibernation", you mean what? You mean "snapshot + shutdown", > right? > > Think about it for five seconds, and then ask yourself: at which point in > the "snapshot + shutdown" sequence would you actually tell a disk to shut > down? > > If you said "snapshot", then you'd be *wrong*. No, I didn't. I agree with you that they should be separate and distinct. I'm just pointing out that you're overstretching your argument a little. There are some similiarities in that in both cases we want to get the driver into some quiet state and out of it again. The difference comes from the fact that the quite states don't have to be the same and shouldn't be the same. I won't insult your intelligence by describing the differences in more detail. > That's my _point_. The snapshot() function should not (and MUST NOT) tell > disks to shut down, because unlike suspend(), we're still going to _use_ > those disks afterwards (why? To write out the snapshot image!). > > In other words, the act of creating a snapshot has *nothing* to do with > suspend. Absolutely. It's about getting the data we need to restore it to the same state post-hibernation-cycle (or, more correctly) post-atomic-restore. > Now, after you've created (and written out) the snapshot, what do you > actually end up doing? > > That's right - you end up _shutting down_ the machine, and yes, as part > of the _shutdown_ sequence you may actually end up doing a lot of the > things that a suspend would do. But that's long *after* you've actually > done the "snapshot" part, and has absolutely nothing to do with it. > > That's where I started: whole "suspend to disk" thing actually has _more_ > to do with "shutdown" than with "suspend". That's where I think you're overstretching the argument. Like suspend (to ram), we're concerned at the snapshot point with getting the hardware in the same state at a later stage. Unlike suspend, we don't necessarily want it to enter a low power-usage state as part of that state preservation. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 1:40 ` Nigel Cunningham @ 2007-04-26 2:04 ` Linus Torvalds 2007-04-26 2:13 ` Nigel Cunningham 2007-04-26 2:31 ` Nigel Cunningham 0 siblings, 2 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 2:04 UTC (permalink / raw) To: Nigel Cunningham Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > That's where I think you're overstretching the argument. Like suspend >(to ram), we're concerned at the snapshot point with getting the hardware >in the same state at a later stage. Really, no. "suspend to ram" doesn't _have_ a "snapshot point". I've tried to explain this multiple times, I don't know why it's not apparently sinking in. This is much more fundamental than the fact that you don't want to stop disks for snapshotting, although it really boils down to all the same issues: the operations are simply not at all the same! I agree 100% that "snapshot to disk" is a "snapshot event". You have to create a single point in time when everything is stable. And I'd much rather call it "snapshot to disk" than "suspend to disk" to make it clear that it's something _totally_ different from "suspend". Because the thing is, "suspend to ram" is *not* a snapshot event. At no point do you actually need to "snapshot" the system at all. You can just gradually shut more and more things down, and equally gradually bring them back up. There simply is *never* any "snapshot" time from a device standpoint, because you can just shut down devices in the right order AND YOU ARE DONE. Really. [ Obviously s2ram does have one "magic moment", namely the time when the CPU does the magic read from the northbridge that actually turns off power for the CPU. But that's really a total non-event from a device standpoint, so while it's undoubtedly a very interesting moment in the suspend sequence, it's not really relevant in any way for device drivers in general. Not at all like the "snapshot moment" that requires the whole system to be totally quiescent in a "snapshot to disk"! ] And the reason s2ram doesn't have a that "snapshot" moment is exactly that the RAM contents are just always there, so there's no need to have a "synchronization event" when ram and devices match. The RAM will *always* match whatever any particular device has done to it, and the proper way to handle things is to just do a simple per-device "save-and-suspend" event. And yes, the _individual_ "save-and-suspend" events obviously needs to be "atomic", but it's purely about that particular individual device, so there's never any cross-device issues about that. For example, if you're a USB hub controller, which is just about the most complex issue you can have, you obviously want to "save the state" with the controller in a STOPPED state, but that should just go without saying: if the controller isn't stopped, you simply *cannot* save the state, since the state is changing under you. The difference is, that the USB driver needs to just "stop, save, and suspend" as one simple operation for s2ram. In contrast, when doing snapshot to disk, it cannot do that, because while it does want to do the "stop" part, it needs to do so _separately_ from the "save" part because you need to stop everything else *too* before you "save" anythng at all. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 2:04 ` Linus Torvalds @ 2007-04-26 2:13 ` Nigel Cunningham 2007-04-26 3:03 ` Linus Torvalds 2007-04-26 2:31 ` Nigel Cunningham 1 sibling, 1 reply; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 2:13 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 3837 bytes --] Hi. On Wed, 2007-04-25 at 19:04 -0700, Linus Torvalds wrote: > > On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > > > That's where I think you're overstretching the argument. Like suspend > >(to ram), we're concerned at the snapshot point with getting the hardware > >in the same state at a later stage. > > Really, no. > > "suspend to ram" doesn't _have_ a "snapshot point". Sorry. I wasn't clear. I wasn't saying that suspend to ram has a snapshot point. I was trying to say it has a point where you're seeking to save information (PCI state / SCSI transaction number or whatever) that you'll need to get the hardware into the same state at a later stage. That (saving information) is the point of similarity. > I've tried to explain this multiple times, I don't know why it's not > apparently sinking in. This is much more fundamental than the fact that > you don't want to stop disks for snapshotting, although it really boils > down to all the same issues: the operations are simply not at all the > same! Miscommunication, I think. Does the above help? > I agree 100% that "snapshot to disk" is a "snapshot event". You have to > create a single point in time when everything is stable. And I'd much > rather call it "snapshot to disk" than "suspend to disk" to make it clear > that it's something _totally_ different from "suspend". > > Because the thing is, "suspend to ram" is *not* a snapshot event. At no > point do you actually need to "snapshot" the system at all. You can just > gradually shut more and more things down, and equally gradually bring them > back up. There simply is *never* any "snapshot" time from a device > standpoint, because you can just shut down devices in the right order AND > YOU ARE DONE. > > Really. I suppose that's another point of similarity - for snapshotting, the same ordering is probably needed? > [ Obviously s2ram does have one "magic moment", namely the time when the > CPU does the magic read from the northbridge that actually turns off > power for the CPU. But that's really a total non-event from a device > standpoint, so while it's undoubtedly a very interesting moment in the > suspend sequence, it's not really relevant in any way for device > drivers in general. Not at all like the "snapshot moment" that requires > the whole system to be totally quiescent in a "snapshot to disk"! ] > > And the reason s2ram doesn't have a that "snapshot" moment is exactly that > the RAM contents are just always there, so there's no need to have a > "synchronization event" when ram and devices match. The RAM will *always* > match whatever any particular device has done to it, and the proper way to > handle things is to just do a simple per-device "save-and-suspend" event. Yeah. > And yes, the _individual_ "save-and-suspend" events obviously needs to be > "atomic", but it's purely about that particular individual device, so > there's never any cross-device issues about that. No interdependencies? I'm not sure. > For example, if you're a USB hub controller, which is just about the most > complex issue you can have, you obviously want to "save the state" with > the controller in a STOPPED state, but that should just go without saying: > if the controller isn't stopped, you simply *cannot* save the state, since > the state is changing under you. > > The difference is, that the USB driver needs to just "stop, save, and > suspend" as one simple operation for s2ram. In contrast, when doing > snapshot to disk, it cannot do that, because while it does want to do the > "stop" part, it needs to do so _separately_ from the "save" part because > you need to stop everything else *too* before you "save" anythng at all. Agree. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 2:13 ` Nigel Cunningham @ 2007-04-26 3:03 ` Linus Torvalds 2007-04-26 3:34 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 3:03 UTC (permalink / raw) To: Nigel Cunningham Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > Sorry. I wasn't clear. I wasn't saying that suspend to ram has a > snapshot point. I was trying to say it has a point where you're seeking > to save information (PCI state / SCSI transaction number or whatever) > that you'll need to get the hardware into the same state at a later > stage. That (saving information) is the point of similarity. Yes, they do both save information, but I'm not actually convinced they would necessarily even save the *same* information. Let's just take an example of USB, and to make things more interesting, say that the disk you want to suspend to is itself over USB (not necessarily something you _want_ to do, but I think we can all agree that it's something that should potentially work, no?) Now, USB devices actually have per-connection state (at a minimum, the "toggle" bit or whatever), and that's obviously something that will inevitably *change* as a result of the device being used after snapshotting (and even if not used, by the rediscovery by the first kernel to boot), and we fundamentally cannot put the final toggle state in the snapshot. So in the snapshot-to-disk scenario, there are some pieces of data that simply fundamentally *cannot* be snapshotted, because they are not controller state, they are "connection" state. So in that case, you basically know that you *have* to rebuild the connection when you do the "snapshot_resume()" thing. So there's no point in even keeping these kinds of connection states (the same is true of keyboards, mice, anything else - it's how USB works). In contrast, in suspend-to-RAM, USB connections might just be things you actually want to keep open and active, and you *can* do so, in ways you simply cannot do with "snapshot to disk". In fact, if you are something like an OLPC and actually go to s2ram very aggressively, you might well want to keep the connection established, because it's conceivable that you might otherwise lose keypresses etc issues) See? There are real *technical* reasons to believe that the two "save state" operations are really fundamentally different. There are reasons to believe that a s2ram can actually happen while keeping some connections open that cannot be kept open over a disk snapshot. Do they *have* to be different? Of course not. For many devices the "save" and "freeze" operations will likely all be no-ops, and there would be absolutely no difference between suspending and snapshotting, because the driver state already natively contains all the information needed to get the device going again. Equally, I don't doubt that in many drivers you'll have very similar "save state" logic, but in fact I believe that in many cases that "save state" logic will often just be a simple pci_save_state(dev); call, so it's literally the case that they will not be just shared between the "suspend" and "snapshot" case, they'll be shared across all simple PCI devices too! But that doesn't mean that the functions to do so should be the same. You might have static int mypcidevice_suspend(struct pci_dev *dev) { pci_save_state(dev); pci_set_power_state(dev, PCI_D3); return 0; } static int mupcidevice_snapshot(struct pci_dev *dev) { pci_save_state(dev); return 0; } and who cares if they both have that same call to a shared "save state" function? They're still totally different operations, and the fact that *some* devices may save the same things doesn't make them any more similar! See above why some devices might save totally *different* things for a "snapshot" vs a "suspend" event. > I suppose that's another point of similarity - for snapshotting, the > same ordering is probably needed? I agree that you're likely to walk the device list in the same order. The whole "shut down leaf devices first", "start up root devices first" is pretty fundamental. But that's true of reboot and device discovery too. Should that ordering mean that we should use the "discovery()" function and pass it a flag and say "you shouldn't discover, you should snapshot or suspend now"? No. Everybody agrees that device discovery is something different from device suspend. The fact that it's done in a topological order and thus they bear some kind of inverse relationship to each other doesn't make them "the same". > > And yes, the _individual_ "save-and-suspend" events obviously needs to be > > "atomic", but it's purely about that particular individual device, so > > there's never any cross-device issues about that. > > No interdependencies? I'm not sure. Well, we pretty much count on it, since we will *suspend* the devices at the same time. So if they had interdependencies that aren't described by the ordering we enforce, they are pretty much screwed anyway ;) So yes, the device list needs to be topologically sorted (and you need to walk it in the right direction), but apart from that we'd *better* not have any interdependencies, or we simply cannot suspend at all. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 3:03 ` Linus Torvalds @ 2007-04-26 3:34 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 3:34 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 5779 bytes --] Hi. On Wed, 2007-04-25 at 20:03 -0700, Linus Torvalds wrote: > > On Thu, 26 Apr 2007, Nigel Cunningham wrote: > > > > Sorry. I wasn't clear. I wasn't saying that suspend to ram has a > > snapshot point. I was trying to say it has a point where you're seeking > > to save information (PCI state / SCSI transaction number or whatever) > > that you'll need to get the hardware into the same state at a later > > stage. That (saving information) is the point of similarity. > > Yes, they do both save information, but I'm not actually convinced they > would necessarily even save the *same* information. > > Let's just take an example of USB, and to make things more interesting, > say that the disk you want to suspend to is itself over USB (not > necessarily something you _want_ to do, but I think we can all agree that > it's something that should potentially work, no?) Agreed - it would be nice. > Now, USB devices actually have per-connection state (at a minimum, the > "toggle" bit or whatever), and that's obviously something that will > inevitably *change* as a result of the device being used after > snapshotting (and even if not used, by the rediscovery by the first kernel > to boot), and we fundamentally cannot put the final toggle state in the > snapshot. > > So in the snapshot-to-disk scenario, there are some pieces of data that > simply fundamentally *cannot* be snapshotted, because they are not > controller state, they are "connection" state. > > So in that case, you basically know that you *have* to rebuild the > connection when you do the "snapshot_resume()" thing. So there's no point > in even keeping these kinds of connection states (the same is true of > keyboards, mice, anything else - it's how USB works). Sort of agree - you might want to record some serial number that might let you recognise it as the same thing at resume time when everything is re-hotplugged (assuming it's even there then). Nevertheless, I don't think that diminishes what you're saying. > In contrast, in suspend-to-RAM, USB connections might just be things you > actually want to keep open and active, and you *can* do so, in ways you > simply cannot do with "snapshot to disk". In fact, if you are something > like an OLPC and actually go to s2ram very aggressively, you might well > want to keep the connection established, because it's conceivable that you > might otherwise lose keypresses etc issues) > > See? There are real *technical* reasons to believe that the two "save > state" operations are really fundamentally different. There are reasons to > believe that a s2ram can actually happen while keeping some connections > open that cannot be kept open over a disk snapshot. > > Do they *have* to be different? Of course not. For many devices the "save" > and "freeze" operations will likely all be no-ops, and there would be > absolutely no difference between suspending and snapshotting, because the > driver state already natively contains all the information needed to get > the device going again. > > Equally, I don't doubt that in many drivers you'll have very similar "save > state" logic, but in fact I believe that in many cases that "save state" > logic will often just be a simple > > pci_save_state(dev); > > call, so it's literally the case that they will not be just shared between > the "suspend" and "snapshot" case, they'll be shared across all simple PCI > devices too! > > But that doesn't mean that the functions to do so should be the same. You > might have > > static int mypcidevice_suspend(struct pci_dev *dev) > { > pci_save_state(dev); > pci_set_power_state(dev, PCI_D3); > return 0; > } > > static int mupcidevice_snapshot(struct pci_dev *dev) > { > pci_save_state(dev); > return 0; > } > > and who cares if they both have that same call to a shared "save state" > function? They're still totally different operations, and the fact that > *some* devices may save the same things doesn't make them any more > similar! See above why some devices might save totally *different* things > for a "snapshot" vs a "suspend" event. No disagreement here. > > I suppose that's another point of similarity - for snapshotting, the > > same ordering is probably needed? > > I agree that you're likely to walk the device list in the same order. The > whole "shut down leaf devices first", "start up root devices first" is > pretty fundamental. > > But that's true of reboot and device discovery too. Should that ordering > mean that we should use the "discovery()" function and pass it a flag and > say "you shouldn't discover, you should snapshot or suspend now"? No. > Everybody agrees that device discovery is something different from device > suspend. The fact that it's done in a topological order and thus they bear > some kind of inverse relationship to each other doesn't make them "the > same". > > > > And yes, the _individual_ "save-and-suspend" events obviously needs to be > > > "atomic", but it's purely about that particular individual device, so > > > there's never any cross-device issues about that. > > > > No interdependencies? I'm not sure. > > Well, we pretty much count on it, since we will *suspend* the devices at > the same time. So if they had interdependencies that aren't described by > the ordering we enforce, they are pretty much screwed anyway ;) > > So yes, the device list needs to be topologically sorted (and you need to > walk it in the right direction), but apart from that we'd *better* not > have any interdependencies, or we simply cannot suspend at all. Thanks for your reply. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 2:04 ` Linus Torvalds 2007-04-26 2:13 ` Nigel Cunningham @ 2007-04-26 2:31 ` Nigel Cunningham 1 sibling, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 2:31 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 539 bytes --] Hi. Hmm. Perhaps I should have added to that last reply that recognising that they store similar information doesn't mean I think they need the same high-level routine for both state transitions. I'd really like to see each driver have some sort of state machine controlling its power management, into which these calls were just another input (an important one, but just another alongside information about policy, whether we're on battery (UPS or laptop) or AC, whether the device is actually being used and so on. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:55 ` Linus Torvalds 2007-04-25 23:13 ` Pavel Machek 2007-04-26 1:40 ` Nigel Cunningham @ 2007-04-26 10:39 ` Johannes Berg 2007-04-26 11:30 ` Pavel Machek 2 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-26 10:39 UTC (permalink / raw) To: Linus Torvalds Cc: Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Pavel Machek, Thomas Gleixner, Ingo Molnar, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 544 bytes --] On Wed, 2007-04-25 at 15:55 -0700, Linus Torvalds wrote: > That's where I started: whole "suspend to disk" thing actually has _more_ > to do with "shutdown" than with "suspend". From looking at pm_ops which I was recently working with a lot, it seems that it was designed by somebody who was reading the ACPI documentation and was otherwise pretty clueless, even at that level std tries to look like suspend. IMHO that is one of the first things that should be ripped out, no pm_ops for STD, it's a pain to work with. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:39 ` Johannes Berg @ 2007-04-26 11:30 ` Pavel Machek 2007-04-26 11:41 ` Johannes Berg 2007-04-26 16:31 ` Johannes Berg 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:30 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven Hi! > > That's where I started: whole "suspend to disk" thing actually has _more_ > > to do with "shutdown" than with "suspend". > > From looking at pm_ops which I was recently working with a lot, it seems > that it was designed by somebody who was reading the ACPI documentation > and was otherwise pretty clueless, even at that level std tries to look > like suspend. IMHO that is one of the first things that should be ripped > out, no pm_ops for STD, it's a pain to work with. That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 low-level enter is pretty similar). Patches would be welcome, as would be "suspend-to-ram maintainer". Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:30 ` Pavel Machek @ 2007-04-26 11:41 ` Johannes Berg 2007-04-26 16:31 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 11:41 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 428 bytes --] On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > low-level enter is pretty similar). But that doesn't excuse abusing the same interface, IMHO. > Patches would be welcome :) I'll see what I can do. Shouldn't be too hard to add an interface just for ACPI here and get platform disk-mode into there from a different angle. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:30 ` Pavel Machek @ 2007-04-26 16:31 ` Johannes Berg 2007-04-26 16:31 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 16:31 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > From looking at pm_ops which I was recently working with a lot, it seems > > that it was designed by somebody who was reading the ACPI documentation > > and was otherwise pretty clueless, even at that level std tries to look > > like suspend. IMHO that is one of the first things that should be ripped > > out, no pm_ops for STD, it's a pain to work with. > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > low-level enter is pretty similar). > > Patches would be welcome That was easier than I thought. This applies on top of a patch that makes kernel/power/user.c optional since I had no idea how to fix it, problems I see: * it surfaces kernel implementation details about pm_ops and thus makes the whole thing very fragile * it has yet another interface (yuck) to determine whether to reboot, shut down etc, doesn't use /sys/power/disk * I generally had no idea wtf it is doing in some places Anyway, this patch is only compile tested, it * introduces include/linux/hibernate.h with hibernate_ops and a new hibernate() function to hibernate the system * rips apart a lot of the suspend code and puts it back together using the hibernate_ops * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) * might apply/compile against -mm, I have all my and some of Rafael's suspend/hibernate work in my tree. * breaks user suspend as I noted above * is incomplete, somewhere pm_suspend_disk() is still defined iirc johannes --- Documentation/power/userland-swsusp.txt | 26 +++---- drivers/acpi/sleep/main.c | 89 ++++++++++++++++++++---- drivers/acpi/sleep/proc.c | 3 drivers/i2c/chips/tps65010.c | 2 include/linux/hibernate.h | 36 +++++++++ include/linux/pm.h | 31 -------- kernel/power/disk.c | 117 +++++++++++++++++++------------- kernel/power/main.c | 47 +++++------- kernel/power/power.h | 13 --- kernel/power/user.c | 28 +------ kernel/sys.c | 3 11 files changed, 231 insertions(+), 164 deletions(-) --- wireless-dev.orig/include/linux/pm.h 2007-04-26 18:15:00.440691185 +0200 +++ wireless-dev/include/linux/pm.h 2007-04-26 18:15:09.410691185 +0200 @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t; #define PM_SUSPEND_ON ((__force suspend_state_t) 0) #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1) #define PM_SUSPEND_MEM ((__force suspend_state_t) 3) -#define PM_SUSPEND_DISK ((__force suspend_state_t) 4) -#define PM_SUSPEND_MAX ((__force suspend_state_t) 5) - -typedef int __bitwise suspend_disk_method_t; - -/* invalid must be 0 so struct pm_ops initialisers can leave it out */ -#define PM_DISK_INVALID ((__force suspend_disk_method_t) 0) -#define PM_DISK_PLATFORM ((__force suspend_disk_method_t) 1) -#define PM_DISK_SHUTDOWN ((__force suspend_disk_method_t) 2) -#define PM_DISK_REBOOT ((__force suspend_disk_method_t) 3) -#define PM_DISK_TEST ((__force suspend_disk_method_t) 4) -#define PM_DISK_TESTPROC ((__force suspend_disk_method_t) 5) -#define PM_DISK_MAX ((__force suspend_disk_method_t) 6) +#define PM_SUSPEND_MAX ((__force suspend_state_t) 4) /** * struct pm_ops - Callbacks for managing platform dependent suspend states. * @valid: Callback to determine whether the given state can be entered. - * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is - * always valid and never passed to this call. If not assigned, - * no suspend states are valid. * Valid states are advertised in /sys/power/state but can still * be rejected by prepare or enter if the conditions aren't right. * There is a %pm_valid_only_mem function available that can be assigned @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho * * @finish: Called when the system has left the given state and all devices * are resumed. The return value is ignored. - * - * @pm_disk_mode: The generic code always allows one of the shutdown methods - * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and - * %PM_DISK_TESTPROC. If this variable is set, the mode it is set - * to is allowed in addition to those modes and is also made default. - * When this mode is sent selected, the @prepare call will be called - * before suspending to disk (if present), the @enter call should be - * present and will be called after all state has been saved and the - * machine is ready to be powered off; the @finish callback is called - * after state has been restored. All these calls are called with - * %PM_SUSPEND_DISK as the state. */ struct pm_ops { int (*valid)(suspend_state_t state); int (*prepare)(suspend_state_t state); int (*enter)(suspend_state_t state); int (*finish)(suspend_state_t state); - suspend_disk_method_t pm_disk_mode; }; /** @@ -276,8 +249,6 @@ extern void device_power_up(void); extern void device_resume(void); #ifdef CONFIG_PM -extern suspend_disk_method_t pm_disk_mode; - extern int device_suspend(pm_message_t state); extern int device_prepare_suspend(pm_message_t state); --- wireless-dev.orig/kernel/power/main.c 2007-04-26 18:15:00.790691185 +0200 +++ wireless-dev/kernel/power/main.c 2007-04-26 18:15:09.410691185 +0200 @@ -21,6 +21,7 @@ #include <linux/resume-trace.h> #include <linux/freezer.h> #include <linux/vmstat.h> +#include <linux/hibernate.h> #include "power.h" @@ -30,7 +31,6 @@ DEFINE_MUTEX(pm_mutex); struct pm_ops *pm_ops; -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN; /** * pm_set_ops - Set the global power method table. @@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops) { mutex_lock(&pm_mutex); pm_ops = ops; - if (ops && ops->pm_disk_mode != PM_DISK_INVALID) { - pm_disk_mode = ops->pm_disk_mode; - } else - pm_disk_mode = PM_DISK_SHUTDOWN; mutex_unlock(&pm_mutex); } @@ -184,24 +180,12 @@ static void suspend_finish(suspend_state static const char * const pm_states[PM_SUSPEND_MAX] = { [PM_SUSPEND_STANDBY] = "standby", [PM_SUSPEND_MEM] = "mem", - [PM_SUSPEND_DISK] = "disk", }; static inline int valid_state(suspend_state_t state) { - /* Suspend-to-disk does not really need low-level support. - * It can work with shutdown/reboot if needed. If it isn't - * configured, then it cannot be supported. - */ - if (state == PM_SUSPEND_DISK) -#ifdef CONFIG_SOFTWARE_SUSPEND - return 1; -#else - return 0; -#endif - - /* all other states need lowlevel support and need to be - * valid to the lowlevel implementation, no valid callback + /* All states need lowlevel support and need to be valid + * to the lowlevel implementation, no valid callback * implies that none are valid. */ if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state)) return 0; @@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s if (!mutex_trylock(&pm_mutex)) return -EBUSY; - if (state == PM_SUSPEND_DISK) { - error = pm_suspend_disk(); - goto Unlock; - } - pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]); if ((error = suspend_prepare(state))) goto Unlock; @@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s /** * pm_suspend - Externally visible function for suspending system. - * @state: Enumarted value of state to enter. + * @state: Enumerated value of state to enter. * * Determine whether or not value is within range, get state * structure, and enter (above). @@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL); static ssize_t state_show(struct subsystem * subsys, char * buf) { int i; - char * s = buf; + char *s = buf; for (i = 0; i < PM_SUSPEND_MAX; i++) { if (pm_states[i] && valid_state(i)) - s += sprintf(s,"%s ", pm_states[i]); + s += sprintf(s, "%s ", pm_states[i]); } - s += sprintf(s,"\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + s += sprintf(s, "%s\n", "disk"); +#else + if (s != buf) + /* convert the last space to a newline */ + *(s-1) = "\n"; +#endif return (s - buf); } @@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys p = memchr(buf, '\n', n); len = p ? p - buf : n; + /* first check hibernate */ + if (strncmp(buf, "disk", len)) { + error = hibernate(); + return error ? error : n; + } + for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { if (*s && !strncmp(buf, *s, len)) break; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ wireless-dev/include/linux/hibernate.h 2007-04-26 18:21:38.130691185 +0200 @@ -0,0 +1,36 @@ +#ifndef __LINUX_HIBERNATE +#define __LINUX_HIBERNATE +/* + * hibernate ('suspend to disk') functionality + */ + +/** + * struct hibernate_ops - hibernate platform support + * + * The methods in this structure allow a platform to override what + * happens for shutting down the machine when going into hibernation. + * + * All three methods must be assigned. + * + * @prepare: prepare system for hibernation + * @enter: shut down system after state has been saved to disk + * @finish: finish/clean up after state has been reloaded + */ +struct hibernate_ops { + int (*prepare)(void); + int (*enter)(void); + void (*finish)(void); +}; + +/** + * hibernate_set_ops - set the global hibernate operations + * @ops: the hibernate operations to use from now on. + */ +void hibernate_set_ops(struct hibernate_ops *ops); + +/** + * hibernate - hibernate the system + */ +int hibernate(void); + +#endif /* __LINUX_HIBERNATE */ --- wireless-dev.orig/kernel/power/disk.c 2007-04-26 18:15:00.800691185 +0200 +++ wireless-dev/kernel/power/disk.c 2007-04-26 18:15:09.420691185 +0200 @@ -21,45 +21,72 @@ #include <linux/console.h> #include <linux/cpu.h> #include <linux/freezer.h> +#include <linux/hibernate.h> #include "power.h" -static int noresume = 0; +static int noresume; char resume_file[256] = CONFIG_PM_STD_PARTITION; dev_t swsusp_resume_device; sector_t swsusp_resume_block; +static struct hibernate_ops *hibernate_ops; +static int pm_disk_mode; + +enum { + PM_DISK_INVALID, + PM_DISK_PLATFORM, + PM_DISK_TEST, + PM_DISK_TESTPROC, + PM_DISK_SHUTDOWN, + PM_DISK_REBOOT, + /* keep last */ + __PM_DISK_AFTER_LAST +}; +#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1) +#define PM_DISK_FIRST (PM_DISK_INVALID + 1) + +void hibernate_set_ops(struct hibernate_ops *ops) +{ + BUG_ON(!hibernate_ops->prepare); + BUG_ON(!hibernate_ops->enter); + BUG_ON(!hibernate_ops->finish); + mutex_lock(&pm_mutex); + hibernate_ops = ops; + mutex_unlock(&pm_mutex); +} + + /** - * platform_prepare - prepare the machine for hibernation using the - * platform driver if so configured and return an error code if it fails + * hibernate_platform_prepare - prepare the machine for hibernation using + * the platform driver if so configured and return an error code if it + * fails. */ -static inline int platform_prepare(void) +int hibernate_platform_prepare(void) { - int error = 0; - switch (pm_disk_mode) { case PM_DISK_TEST: case PM_DISK_TESTPROC: case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: break; - default: - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); + case PM_DISK_PLATFORM: + if (hibernate_ops) + return hibernate_ops->prepare(); } - return error; + return 0; } /** - * power_down - Shut machine down for hibernate. + * hibernate_power_down - Shut machine down for hibernate. * * Use the platform driver, if configured so; otherwise try * to power off or reboot. */ -static void power_down(void) +static void hibernate_power_down(void) { switch (pm_disk_mode) { case PM_DISK_TEST: @@ -70,11 +97,10 @@ static void power_down(void) case PM_DISK_REBOOT: kernel_restart(NULL); break; - default: - if (pm_ops && pm_ops->enter) { + case PM_DISK_PLATFORM: + if (hibernate_ops) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - pm_ops->enter(PM_SUSPEND_DISK); - break; + hibernate_ops->enter(); } } @@ -85,7 +111,7 @@ static void power_down(void) while(1); } -static inline void platform_finish(void) +void hibernate_platform_finish(void) { switch (pm_disk_mode) { case PM_DISK_TEST: @@ -93,9 +119,9 @@ static inline void platform_finish(void) case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: break; - default: - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); + case PM_DISK_PLATFORM: + if (hibernate_ops) + hibernate_ops->finish(); } } @@ -118,13 +144,13 @@ static int prepare_processes(void) } /** - * pm_suspend_disk - The granpappy of hibernation power management. + * hibernate - The granpappy of hibernation power management. * * If not, then call swsusp to do its thing, then figure out how * to power down the system. */ -int pm_suspend_disk(void) +int hibernate(void) { int error; @@ -147,7 +173,7 @@ int pm_suspend_disk(void) if (error) goto Finish; - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; @@ -175,13 +201,13 @@ int pm_suspend_disk(void) if (in_suspend) { enable_nonboot_cpus(); - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); pr_debug("PM: writing image.\n"); error = swsusp_write(); if (!error) - power_down(); + hibernate_power_down(); else { swsusp_free(); goto Finish; @@ -194,7 +220,7 @@ int pm_suspend_disk(void) Enable_cpus: enable_nonboot_cpus(); Resume_devices: - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); Finish: @@ -211,7 +237,7 @@ int pm_suspend_disk(void) * Called as a late_initcall (so all devices are discovered and * initialized), we call swsusp to see if we have a saved image or not. * If so, we quiesce devices, the restore the saved image. We will - * return above (in pm_suspend_disk() ) if everything goes well. + * return above (in hibernate() ) if everything goes well. * Otherwise, we fail gracefully and return to the normally * scheduled program. * @@ -311,12 +337,13 @@ static const char * const pm_disk_modes[ * * Suspend-to-disk can be handled in several ways. We have a few options * for putting the system to sleep - using the platform driver (e.g. ACPI - * or other pm_ops), powering off the system or rebooting the system - * (for testing) as well as the two test modes. + * or other hibernate_ops), powering off the system or rebooting the + * system (for testing) as well as the two test modes. * * The system can support 'platform', and that is known a priori (and - * encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot' - * as alternatives, as well as the test modes 'test' and 'testproc'. + * encoded by the presence of hibernate_ops). However, the user may choose + * 'shutdown' or 'reboot' as alternatives, as well as the test modes 'test' + * and 'testproc'. * * show() will display what the mode is currently set to. * store() will accept one of @@ -328,7 +355,7 @@ static const char * const pm_disk_modes[ * 'testproc' * * It will only change to 'platform' if the system - * supports it (as determined from pm_ops->pm_disk_mode). + * supports it (as determined by having hibernate_ops). */ static ssize_t disk_show(struct subsystem * subsys, char * buf) @@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste int i; char *start = buf; - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { + for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) { if (!pm_disk_modes[i]) continue; switch (i) { @@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste case PM_DISK_TEST: case PM_DISK_TESTPROC: break; - default: - if (pm_ops && pm_ops->enter && - (i == pm_ops->pm_disk_mode)) + case PM_DISK_PLATFORM: + if (hibernate_ops) break; /* not a valid mode, continue with loop */ continue; @@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst int i; int len; char *p; - suspend_disk_method_t mode = 0; + int mode = PM_DISK_INVALID; p = memchr(buf, '\n', n); len = p ? p - buf : n; mutex_lock(&pm_mutex); - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { + for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) { if (!strncmp(buf, pm_disk_modes[i], len)) { mode = i; break; } } - if (mode) { + if (mode != PM_DISK_INVALID) { switch (mode) { case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: @@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst case PM_DISK_TESTPROC: pm_disk_mode = mode; break; - default: - if (pm_ops && pm_ops->enter && - (mode == pm_ops->pm_disk_mode)) + case PM_DISK_PLATFORM: + if (hibernate_ops) pm_disk_mode = mode; else error = -EINVAL; } - } else { + } else error = -EINVAL; - } - pr_debug("PM: suspend-to-disk mode set to '%s'\n", - pm_disk_modes[mode]); + if (!error) + pr_debug("PM: suspend-to-disk mode set to '%s'\n", + pm_disk_modes[mode]); mutex_unlock(&pm_mutex); return error ? error : n; } --- wireless-dev.orig/kernel/power/user.c 2007-04-26 18:15:01.130691185 +0200 +++ wireless-dev/kernel/power/user.c 2007-04-26 18:15:09.420691185 +0200 @@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil return res; } -static inline int platform_prepare(void) -{ - int error = 0; - - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); - - return error; -} - -static inline void platform_finish(void) -{ - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); -} - static inline int snapshot_suspend(int platform_suspend) { int error; @@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p goto Finish; if (platform_suspend) { - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; } @@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p enable_nonboot_cpus(); Resume_devices: if (platform_suspend) - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); @@ -188,7 +172,7 @@ static inline int snapshot_restore(int p mutex_lock(&pm_mutex); pm_prepare_console(); if (platform_suspend) { - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; } @@ -204,7 +188,7 @@ static inline int snapshot_restore(int p enable_nonboot_cpus(); Resume_devices: if (platform_suspend) - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); @@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode * case PMOPS_ENTER: if (data->platform_suspend) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - error = pm_ops->enter(PM_SUSPEND_DISK); + error = hibernate_ops->enter(); + /* how can this possibly do the right thing? */ error = 0; } break; case PMOPS_FINISH: if (data->platform_suspend) + /* and why doesn't this invoke anything??? */ error = 0; break; --- wireless-dev.orig/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:02.120691185 +0200 +++ wireless-dev/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:09.440691185 +0200 @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t to resume the system from RAM if there's enough battery power or restore its state on the basis of the saved suspend image otherwise) -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and - pmops->finish methods (the in-kernel swsusp knows these as the "platform - method") which are needed on many machines to (among others) speed up - the resume by letting the BIOS skip some steps or to let the system - recognise the correct state of the hardware after the resume (in - particular on many machines this ensures that unplugged AC - adapters get correctly detected and that kacpid does not run wild after - the resume). The last ioctl() argument can take one of the three - values, defined in kernel/power/power.h: +SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare, + hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel + swsusp knows these as the "platform method") which are needed on many + machines to (among others) speed up the resume by letting the BIOS skip + some steps or to let the system recognise the correct state of the + hardware after the resume (in particular on many machines this ensures + that unplugged AC adapters get correctly detected and that kacpid does + not run wild after the resume). The last ioctl() argument can take one + of the three values, defined in kernel/power/power.h: PMOPS_PREPARE - make the kernel carry out the - pm_ops->prepare(PM_SUSPEND_DISK) operation + hibernate_ops->prepare() operation PMOPS_ENTER - make the kernel power off the system by calling - pm_ops->enter(PM_SUSPEND_DISK) + hibernate_ops->enter() PMOPS_FINISH - make the kernel carry out the - pm_ops->finish(PM_SUSPEND_DISK) operation + hibernate_ops->finish() operation + Note that the actual constants are misnamed because they surface + internal kernel implementation details that have changed. The device's read() operation can be used to transfer the snapshot image from the kernel. It has the following limitations: --- wireless-dev.orig/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:02.150691185 +0200 +++ wireless-dev/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:09.440691185 +0200 @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp * also needs to get error handling and probably * an #ifdef CONFIG_SOFTWARE_SUSPEND */ - pm_suspend(PM_SUSPEND_DISK); + hibernate(); #endif poll = 1; } --- wireless-dev.orig/kernel/sys.c 2007-04-26 18:15:01.310691185 +0200 +++ wireless-dev/kernel/sys.c 2007-04-26 18:15:09.450691185 +0200 @@ -25,6 +25,7 @@ #include <linux/security.h> #include <linux/dcookies.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <linux/tty.h> #include <linux/signal.h> #include <linux/cn_proc.h> @@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i #ifdef CONFIG_SOFTWARE_SUSPEND case LINUX_REBOOT_CMD_SW_SUSPEND: { - int ret = pm_suspend(PM_SUSPEND_DISK); + int ret = hibernate(); unlock_kernel(); return ret; } --- wireless-dev.orig/drivers/acpi/sleep/main.c 2007-04-26 18:15:02.290691185 +0200 +++ wireless-dev/drivers/acpi/sleep/main.c 2007-04-26 18:15:09.630691185 +0200 @@ -15,6 +15,7 @@ #include <linux/dmi.h> #include <linux/device.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <acpi/acpi_bus.h> #include <acpi/acpi_drivers.h> #include "sleep.h" @@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = { [PM_SUSPEND_ON] = ACPI_STATE_S0, [PM_SUSPEND_STANDBY] = ACPI_STATE_S1, [PM_SUSPEND_MEM] = ACPI_STATE_S3, - [PM_SUSPEND_DISK] = ACPI_STATE_S4, [PM_SUSPEND_MAX] = ACPI_STATE_S5 }; @@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t do_suspend_lowlevel(); break; - case PM_SUSPEND_DISK: - if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM) - status = acpi_enter_sleep_state(acpi_state); - break; - case PM_SUSPEND_MAX: - acpi_power_off(); - break; - default: return -EINVAL; } @@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state) suspend_state_t states[] = { [1] = PM_SUSPEND_STANDBY, [3] = PM_SUSPEND_MEM, - [4] = PM_SUSPEND_DISK, [5] = PM_SUSPEND_MAX }; if (acpi_state < 6 && states[acpi_state]) return pm_suspend(states[acpi_state]); + if (acpi_state == 4) + return hibernate(); return -EINVAL; } @@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = { .finish = acpi_pm_finish, }; +#ifdef CONFIG_SOFTWARE_SUSPEND +static int acpi_hib_prepare(void) +{ + return acpi_sleep_prepare(ACPI_STATE_S4); +} + +static int acpi_hib_enter(void) +{ + acpi_status status = AE_OK; + unsigned long flags = 0; + u32 acpi_state = acpi_suspend_states[pm_state]; + + ACPI_FLUSH_CPU_CACHE(); + + /* Do arch specific saving of state. */ + int error = acpi_save_state_mem(); + if (error) + return error; + + local_irq_save(flags); + acpi_enable_wakeup_device(acpi_state); + status = acpi_enter_sleep_state(acpi_state); + + /* ACPI 3.0 specs (P62) says that it's the responsabilty + * of the OSPM to clear the status bit [ implying that the + * POWER_BUTTON event should not reach userspace ] + */ + if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3)) + acpi_clear_event(ACPI_EVENT_POWER_BUTTON); + + local_irq_restore(flags); + printk(KERN_DEBUG "Back to C!\n"); + + /* restore processor state + * We should only be here if we're coming back from STR or STD. + * And, in the case of the latter, the memory image should have already + * been loaded from disk. + */ + acpi_restore_state_mem(); + + return ACPI_SUCCESS(status) ? 0 : -EFAULT; +} + +static void acpi_hib_finish(void) +{ + acpi_leave_sleep_state(ACPI_STATE_S4); + acpi_disable_wakeup_device(ACPI_STATE_S4); + + /* reset firmware waking vector */ + acpi_set_firmware_waking_vector((acpi_physical_address) 0); + + if (init_8259A_after_S1) { + printk("Broken toshiba laptop -> kicking interrupts\n"); + init_8259A(0); + } + return 0; +} + +static struct hibernate_ops acpi_hib_ops = { + .prepare = acpi_hib_prepare, + .enter = acpi_hib_enter, + .finish = acpi_hib_finish, +}; +#endif /* CONFIG_SOFTWARE_SUSPEND */ + /* * Toshiba fails to preserve interrupts over S1, reinitialization * of 8259 is needed after S1 resume. @@ -227,13 +285,16 @@ int __init acpi_sleep_init(void) sleep_states[i] = 1; printk(" S%d", i); } - if (i == ACPI_STATE_S4) { - if (sleep_states[i]) - acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM; - } } printk(")\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + if (sleep_states[ACPI_STATE_S4]) + hibernate_set_ops(&acpi_hib_ops); +#else + sleep_states[ACPI_STATE_S4] = 0; +#endif + pm_set_ops(&acpi_pm_ops); return 0; } --- wireless-dev.orig/kernel/power/power.h 2007-04-26 18:15:01.240691185 +0200 +++ wireless-dev/kernel/power/power.h 2007-04-26 18:15:09.630691185 +0200 @@ -13,16 +13,6 @@ struct swsusp_info { -#ifdef CONFIG_SOFTWARE_SUSPEND -extern int pm_suspend_disk(void); - -#else -static inline int pm_suspend_disk(void) -{ - return -EPERM; -} -#endif - extern struct mutex pm_mutex; #define power_attr(_name) \ @@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t struct timeval; extern void swsusp_show_speed(struct timeval *, struct timeval *, unsigned int, char *); + +extern int hibernate_platform_prepare(void); +extern void hibernate_platform_finish(void); --- wireless-dev.orig/drivers/acpi/sleep/proc.c 2007-04-26 18:15:02.720691185 +0200 +++ wireless-dev/drivers/acpi/sleep/proc.c 2007-04-26 18:15:09.630691185 +0200 @@ -1,6 +1,7 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <linux/bcd.h> #include <asm/uaccess.h> @@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil state = simple_strtoul(str, NULL, 0); #ifdef CONFIG_SOFTWARE_SUSPEND if (state == 4) { - error = pm_suspend(PM_SUSPEND_DISK); + error = hibernate(); goto Done; } #endif ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) @ 2007-04-26 16:31 ` Johannes Berg 0 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 16:31 UTC (permalink / raw) To: Pavel Machek Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > From looking at pm_ops which I was recently working with a lot, it seems > > that it was designed by somebody who was reading the ACPI documentation > > and was otherwise pretty clueless, even at that level std tries to look > > like suspend. IMHO that is one of the first things that should be ripped > > out, no pm_ops for STD, it's a pain to work with. > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > low-level enter is pretty similar). > > Patches would be welcome That was easier than I thought. This applies on top of a patch that makes kernel/power/user.c optional since I had no idea how to fix it, problems I see: * it surfaces kernel implementation details about pm_ops and thus makes the whole thing very fragile * it has yet another interface (yuck) to determine whether to reboot, shut down etc, doesn't use /sys/power/disk * I generally had no idea wtf it is doing in some places Anyway, this patch is only compile tested, it * introduces include/linux/hibernate.h with hibernate_ops and a new hibernate() function to hibernate the system * rips apart a lot of the suspend code and puts it back together using the hibernate_ops * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) * might apply/compile against -mm, I have all my and some of Rafael's suspend/hibernate work in my tree. * breaks user suspend as I noted above * is incomplete, somewhere pm_suspend_disk() is still defined iirc johannes --- Documentation/power/userland-swsusp.txt | 26 +++---- drivers/acpi/sleep/main.c | 89 ++++++++++++++++++++---- drivers/acpi/sleep/proc.c | 3 drivers/i2c/chips/tps65010.c | 2 include/linux/hibernate.h | 36 +++++++++ include/linux/pm.h | 31 -------- kernel/power/disk.c | 117 +++++++++++++++++++------------- kernel/power/main.c | 47 +++++------- kernel/power/power.h | 13 --- kernel/power/user.c | 28 +------ kernel/sys.c | 3 11 files changed, 231 insertions(+), 164 deletions(-) --- wireless-dev.orig/include/linux/pm.h 2007-04-26 18:15:00.440691185 +0200 +++ wireless-dev/include/linux/pm.h 2007-04-26 18:15:09.410691185 +0200 @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t; #define PM_SUSPEND_ON ((__force suspend_state_t) 0) #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1) #define PM_SUSPEND_MEM ((__force suspend_state_t) 3) -#define PM_SUSPEND_DISK ((__force suspend_state_t) 4) -#define PM_SUSPEND_MAX ((__force suspend_state_t) 5) - -typedef int __bitwise suspend_disk_method_t; - -/* invalid must be 0 so struct pm_ops initialisers can leave it out */ -#define PM_DISK_INVALID ((__force suspend_disk_method_t) 0) -#define PM_DISK_PLATFORM ((__force suspend_disk_method_t) 1) -#define PM_DISK_SHUTDOWN ((__force suspend_disk_method_t) 2) -#define PM_DISK_REBOOT ((__force suspend_disk_method_t) 3) -#define PM_DISK_TEST ((__force suspend_disk_method_t) 4) -#define PM_DISK_TESTPROC ((__force suspend_disk_method_t) 5) -#define PM_DISK_MAX ((__force suspend_disk_method_t) 6) +#define PM_SUSPEND_MAX ((__force suspend_state_t) 4) /** * struct pm_ops - Callbacks for managing platform dependent suspend states. * @valid: Callback to determine whether the given state can be entered. - * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is - * always valid and never passed to this call. If not assigned, - * no suspend states are valid. * Valid states are advertised in /sys/power/state but can still * be rejected by prepare or enter if the conditions aren't right. * There is a %pm_valid_only_mem function available that can be assigned @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho * * @finish: Called when the system has left the given state and all devices * are resumed. The return value is ignored. - * - * @pm_disk_mode: The generic code always allows one of the shutdown methods - * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and - * %PM_DISK_TESTPROC. If this variable is set, the mode it is set - * to is allowed in addition to those modes and is also made default. - * When this mode is sent selected, the @prepare call will be called - * before suspending to disk (if present), the @enter call should be - * present and will be called after all state has been saved and the - * machine is ready to be powered off; the @finish callback is called - * after state has been restored. All these calls are called with - * %PM_SUSPEND_DISK as the state. */ struct pm_ops { int (*valid)(suspend_state_t state); int (*prepare)(suspend_state_t state); int (*enter)(suspend_state_t state); int (*finish)(suspend_state_t state); - suspend_disk_method_t pm_disk_mode; }; /** @@ -276,8 +249,6 @@ extern void device_power_up(void); extern void device_resume(void); #ifdef CONFIG_PM -extern suspend_disk_method_t pm_disk_mode; - extern int device_suspend(pm_message_t state); extern int device_prepare_suspend(pm_message_t state); --- wireless-dev.orig/kernel/power/main.c 2007-04-26 18:15:00.790691185 +0200 +++ wireless-dev/kernel/power/main.c 2007-04-26 18:15:09.410691185 +0200 @@ -21,6 +21,7 @@ #include <linux/resume-trace.h> #include <linux/freezer.h> #include <linux/vmstat.h> +#include <linux/hibernate.h> #include "power.h" @@ -30,7 +31,6 @@ DEFINE_MUTEX(pm_mutex); struct pm_ops *pm_ops; -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN; /** * pm_set_ops - Set the global power method table. @@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops) { mutex_lock(&pm_mutex); pm_ops = ops; - if (ops && ops->pm_disk_mode != PM_DISK_INVALID) { - pm_disk_mode = ops->pm_disk_mode; - } else - pm_disk_mode = PM_DISK_SHUTDOWN; mutex_unlock(&pm_mutex); } @@ -184,24 +180,12 @@ static void suspend_finish(suspend_state static const char * const pm_states[PM_SUSPEND_MAX] = { [PM_SUSPEND_STANDBY] = "standby", [PM_SUSPEND_MEM] = "mem", - [PM_SUSPEND_DISK] = "disk", }; static inline int valid_state(suspend_state_t state) { - /* Suspend-to-disk does not really need low-level support. - * It can work with shutdown/reboot if needed. If it isn't - * configured, then it cannot be supported. - */ - if (state == PM_SUSPEND_DISK) -#ifdef CONFIG_SOFTWARE_SUSPEND - return 1; -#else - return 0; -#endif - - /* all other states need lowlevel support and need to be - * valid to the lowlevel implementation, no valid callback + /* All states need lowlevel support and need to be valid + * to the lowlevel implementation, no valid callback * implies that none are valid. */ if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state)) return 0; @@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s if (!mutex_trylock(&pm_mutex)) return -EBUSY; - if (state == PM_SUSPEND_DISK) { - error = pm_suspend_disk(); - goto Unlock; - } - pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]); if ((error = suspend_prepare(state))) goto Unlock; @@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s /** * pm_suspend - Externally visible function for suspending system. - * @state: Enumarted value of state to enter. + * @state: Enumerated value of state to enter. * * Determine whether or not value is within range, get state * structure, and enter (above). @@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL); static ssize_t state_show(struct subsystem * subsys, char * buf) { int i; - char * s = buf; + char *s = buf; for (i = 0; i < PM_SUSPEND_MAX; i++) { if (pm_states[i] && valid_state(i)) - s += sprintf(s,"%s ", pm_states[i]); + s += sprintf(s, "%s ", pm_states[i]); } - s += sprintf(s,"\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + s += sprintf(s, "%s\n", "disk"); +#else + if (s != buf) + /* convert the last space to a newline */ + *(s-1) = "\n"; +#endif return (s - buf); } @@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys p = memchr(buf, '\n', n); len = p ? p - buf : n; + /* first check hibernate */ + if (strncmp(buf, "disk", len)) { + error = hibernate(); + return error ? error : n; + } + for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { if (*s && !strncmp(buf, *s, len)) break; --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ wireless-dev/include/linux/hibernate.h 2007-04-26 18:21:38.130691185 +0200 @@ -0,0 +1,36 @@ +#ifndef __LINUX_HIBERNATE +#define __LINUX_HIBERNATE +/* + * hibernate ('suspend to disk') functionality + */ + +/** + * struct hibernate_ops - hibernate platform support + * + * The methods in this structure allow a platform to override what + * happens for shutting down the machine when going into hibernation. + * + * All three methods must be assigned. + * + * @prepare: prepare system for hibernation + * @enter: shut down system after state has been saved to disk + * @finish: finish/clean up after state has been reloaded + */ +struct hibernate_ops { + int (*prepare)(void); + int (*enter)(void); + void (*finish)(void); +}; + +/** + * hibernate_set_ops - set the global hibernate operations + * @ops: the hibernate operations to use from now on. + */ +void hibernate_set_ops(struct hibernate_ops *ops); + +/** + * hibernate - hibernate the system + */ +int hibernate(void); + +#endif /* __LINUX_HIBERNATE */ --- wireless-dev.orig/kernel/power/disk.c 2007-04-26 18:15:00.800691185 +0200 +++ wireless-dev/kernel/power/disk.c 2007-04-26 18:15:09.420691185 +0200 @@ -21,45 +21,72 @@ #include <linux/console.h> #include <linux/cpu.h> #include <linux/freezer.h> +#include <linux/hibernate.h> #include "power.h" -static int noresume = 0; +static int noresume; char resume_file[256] = CONFIG_PM_STD_PARTITION; dev_t swsusp_resume_device; sector_t swsusp_resume_block; +static struct hibernate_ops *hibernate_ops; +static int pm_disk_mode; + +enum { + PM_DISK_INVALID, + PM_DISK_PLATFORM, + PM_DISK_TEST, + PM_DISK_TESTPROC, + PM_DISK_SHUTDOWN, + PM_DISK_REBOOT, + /* keep last */ + __PM_DISK_AFTER_LAST +}; +#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1) +#define PM_DISK_FIRST (PM_DISK_INVALID + 1) + +void hibernate_set_ops(struct hibernate_ops *ops) +{ + BUG_ON(!hibernate_ops->prepare); + BUG_ON(!hibernate_ops->enter); + BUG_ON(!hibernate_ops->finish); + mutex_lock(&pm_mutex); + hibernate_ops = ops; + mutex_unlock(&pm_mutex); +} + + /** - * platform_prepare - prepare the machine for hibernation using the - * platform driver if so configured and return an error code if it fails + * hibernate_platform_prepare - prepare the machine for hibernation using + * the platform driver if so configured and return an error code if it + * fails. */ -static inline int platform_prepare(void) +int hibernate_platform_prepare(void) { - int error = 0; - switch (pm_disk_mode) { case PM_DISK_TEST: case PM_DISK_TESTPROC: case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: break; - default: - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); + case PM_DISK_PLATFORM: + if (hibernate_ops) + return hibernate_ops->prepare(); } - return error; + return 0; } /** - * power_down - Shut machine down for hibernate. + * hibernate_power_down - Shut machine down for hibernate. * * Use the platform driver, if configured so; otherwise try * to power off or reboot. */ -static void power_down(void) +static void hibernate_power_down(void) { switch (pm_disk_mode) { case PM_DISK_TEST: @@ -70,11 +97,10 @@ static void power_down(void) case PM_DISK_REBOOT: kernel_restart(NULL); break; - default: - if (pm_ops && pm_ops->enter) { + case PM_DISK_PLATFORM: + if (hibernate_ops) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - pm_ops->enter(PM_SUSPEND_DISK); - break; + hibernate_ops->enter(); } } @@ -85,7 +111,7 @@ static void power_down(void) while(1); } -static inline void platform_finish(void) +void hibernate_platform_finish(void) { switch (pm_disk_mode) { case PM_DISK_TEST: @@ -93,9 +119,9 @@ static inline void platform_finish(void) case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: break; - default: - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); + case PM_DISK_PLATFORM: + if (hibernate_ops) + hibernate_ops->finish(); } } @@ -118,13 +144,13 @@ static int prepare_processes(void) } /** - * pm_suspend_disk - The granpappy of hibernation power management. + * hibernate - The granpappy of hibernation power management. * * If not, then call swsusp to do its thing, then figure out how * to power down the system. */ -int pm_suspend_disk(void) +int hibernate(void) { int error; @@ -147,7 +173,7 @@ int pm_suspend_disk(void) if (error) goto Finish; - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; @@ -175,13 +201,13 @@ int pm_suspend_disk(void) if (in_suspend) { enable_nonboot_cpus(); - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); pr_debug("PM: writing image.\n"); error = swsusp_write(); if (!error) - power_down(); + hibernate_power_down(); else { swsusp_free(); goto Finish; @@ -194,7 +220,7 @@ int pm_suspend_disk(void) Enable_cpus: enable_nonboot_cpus(); Resume_devices: - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); Finish: @@ -211,7 +237,7 @@ int pm_suspend_disk(void) * Called as a late_initcall (so all devices are discovered and * initialized), we call swsusp to see if we have a saved image or not. * If so, we quiesce devices, the restore the saved image. We will - * return above (in pm_suspend_disk() ) if everything goes well. + * return above (in hibernate() ) if everything goes well. * Otherwise, we fail gracefully and return to the normally * scheduled program. * @@ -311,12 +337,13 @@ static const char * const pm_disk_modes[ * * Suspend-to-disk can be handled in several ways. We have a few options * for putting the system to sleep - using the platform driver (e.g. ACPI - * or other pm_ops), powering off the system or rebooting the system - * (for testing) as well as the two test modes. + * or other hibernate_ops), powering off the system or rebooting the + * system (for testing) as well as the two test modes. * * The system can support 'platform', and that is known a priori (and - * encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot' - * as alternatives, as well as the test modes 'test' and 'testproc'. + * encoded by the presence of hibernate_ops). However, the user may choose + * 'shutdown' or 'reboot' as alternatives, as well as the test modes 'test' + * and 'testproc'. * * show() will display what the mode is currently set to. * store() will accept one of @@ -328,7 +355,7 @@ static const char * const pm_disk_modes[ * 'testproc' * * It will only change to 'platform' if the system - * supports it (as determined from pm_ops->pm_disk_mode). + * supports it (as determined by having hibernate_ops). */ static ssize_t disk_show(struct subsystem * subsys, char * buf) @@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste int i; char *start = buf; - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { + for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) { if (!pm_disk_modes[i]) continue; switch (i) { @@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste case PM_DISK_TEST: case PM_DISK_TESTPROC: break; - default: - if (pm_ops && pm_ops->enter && - (i == pm_ops->pm_disk_mode)) + case PM_DISK_PLATFORM: + if (hibernate_ops) break; /* not a valid mode, continue with loop */ continue; @@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst int i; int len; char *p; - suspend_disk_method_t mode = 0; + int mode = PM_DISK_INVALID; p = memchr(buf, '\n', n); len = p ? p - buf : n; mutex_lock(&pm_mutex); - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { + for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) { if (!strncmp(buf, pm_disk_modes[i], len)) { mode = i; break; } } - if (mode) { + if (mode != PM_DISK_INVALID) { switch (mode) { case PM_DISK_SHUTDOWN: case PM_DISK_REBOOT: @@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst case PM_DISK_TESTPROC: pm_disk_mode = mode; break; - default: - if (pm_ops && pm_ops->enter && - (mode == pm_ops->pm_disk_mode)) + case PM_DISK_PLATFORM: + if (hibernate_ops) pm_disk_mode = mode; else error = -EINVAL; } - } else { + } else error = -EINVAL; - } - pr_debug("PM: suspend-to-disk mode set to '%s'\n", - pm_disk_modes[mode]); + if (!error) + pr_debug("PM: suspend-to-disk mode set to '%s'\n", + pm_disk_modes[mode]); mutex_unlock(&pm_mutex); return error ? error : n; } --- wireless-dev.orig/kernel/power/user.c 2007-04-26 18:15:01.130691185 +0200 +++ wireless-dev/kernel/power/user.c 2007-04-26 18:15:09.420691185 +0200 @@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil return res; } -static inline int platform_prepare(void) -{ - int error = 0; - - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); - - return error; -} - -static inline void platform_finish(void) -{ - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); -} - static inline int snapshot_suspend(int platform_suspend) { int error; @@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p goto Finish; if (platform_suspend) { - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; } @@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p enable_nonboot_cpus(); Resume_devices: if (platform_suspend) - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); @@ -188,7 +172,7 @@ static inline int snapshot_restore(int p mutex_lock(&pm_mutex); pm_prepare_console(); if (platform_suspend) { - error = platform_prepare(); + error = hibernate_platform_prepare(); if (error) goto Finish; } @@ -204,7 +188,7 @@ static inline int snapshot_restore(int p enable_nonboot_cpus(); Resume_devices: if (platform_suspend) - platform_finish(); + hibernate_platform_finish(); device_resume(); resume_console(); @@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode * case PMOPS_ENTER: if (data->platform_suspend) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - error = pm_ops->enter(PM_SUSPEND_DISK); + error = hibernate_ops->enter(); + /* how can this possibly do the right thing? */ error = 0; } break; case PMOPS_FINISH: if (data->platform_suspend) + /* and why doesn't this invoke anything??? */ error = 0; break; --- wireless-dev.orig/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:02.120691185 +0200 +++ wireless-dev/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:09.440691185 +0200 @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t to resume the system from RAM if there's enough battery power or restore its state on the basis of the saved suspend image otherwise) -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and - pmops->finish methods (the in-kernel swsusp knows these as the "platform - method") which are needed on many machines to (among others) speed up - the resume by letting the BIOS skip some steps or to let the system - recognise the correct state of the hardware after the resume (in - particular on many machines this ensures that unplugged AC - adapters get correctly detected and that kacpid does not run wild after - the resume). The last ioctl() argument can take one of the three - values, defined in kernel/power/power.h: +SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare, + hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel + swsusp knows these as the "platform method") which are needed on many + machines to (among others) speed up the resume by letting the BIOS skip + some steps or to let the system recognise the correct state of the + hardware after the resume (in particular on many machines this ensures + that unplugged AC adapters get correctly detected and that kacpid does + not run wild after the resume). The last ioctl() argument can take one + of the three values, defined in kernel/power/power.h: PMOPS_PREPARE - make the kernel carry out the - pm_ops->prepare(PM_SUSPEND_DISK) operation + hibernate_ops->prepare() operation PMOPS_ENTER - make the kernel power off the system by calling - pm_ops->enter(PM_SUSPEND_DISK) + hibernate_ops->enter() PMOPS_FINISH - make the kernel carry out the - pm_ops->finish(PM_SUSPEND_DISK) operation + hibernate_ops->finish() operation + Note that the actual constants are misnamed because they surface + internal kernel implementation details that have changed. The device's read() operation can be used to transfer the snapshot image from the kernel. It has the following limitations: --- wireless-dev.orig/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:02.150691185 +0200 +++ wireless-dev/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:09.440691185 +0200 @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp * also needs to get error handling and probably * an #ifdef CONFIG_SOFTWARE_SUSPEND */ - pm_suspend(PM_SUSPEND_DISK); + hibernate(); #endif poll = 1; } --- wireless-dev.orig/kernel/sys.c 2007-04-26 18:15:01.310691185 +0200 +++ wireless-dev/kernel/sys.c 2007-04-26 18:15:09.450691185 +0200 @@ -25,6 +25,7 @@ #include <linux/security.h> #include <linux/dcookies.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <linux/tty.h> #include <linux/signal.h> #include <linux/cn_proc.h> @@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i #ifdef CONFIG_SOFTWARE_SUSPEND case LINUX_REBOOT_CMD_SW_SUSPEND: { - int ret = pm_suspend(PM_SUSPEND_DISK); + int ret = hibernate(); unlock_kernel(); return ret; } --- wireless-dev.orig/drivers/acpi/sleep/main.c 2007-04-26 18:15:02.290691185 +0200 +++ wireless-dev/drivers/acpi/sleep/main.c 2007-04-26 18:15:09.630691185 +0200 @@ -15,6 +15,7 @@ #include <linux/dmi.h> #include <linux/device.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <acpi/acpi_bus.h> #include <acpi/acpi_drivers.h> #include "sleep.h" @@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = { [PM_SUSPEND_ON] = ACPI_STATE_S0, [PM_SUSPEND_STANDBY] = ACPI_STATE_S1, [PM_SUSPEND_MEM] = ACPI_STATE_S3, - [PM_SUSPEND_DISK] = ACPI_STATE_S4, [PM_SUSPEND_MAX] = ACPI_STATE_S5 }; @@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t do_suspend_lowlevel(); break; - case PM_SUSPEND_DISK: - if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM) - status = acpi_enter_sleep_state(acpi_state); - break; - case PM_SUSPEND_MAX: - acpi_power_off(); - break; - default: return -EINVAL; } @@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state) suspend_state_t states[] = { [1] = PM_SUSPEND_STANDBY, [3] = PM_SUSPEND_MEM, - [4] = PM_SUSPEND_DISK, [5] = PM_SUSPEND_MAX }; if (acpi_state < 6 && states[acpi_state]) return pm_suspend(states[acpi_state]); + if (acpi_state == 4) + return hibernate(); return -EINVAL; } @@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = { .finish = acpi_pm_finish, }; +#ifdef CONFIG_SOFTWARE_SUSPEND +static int acpi_hib_prepare(void) +{ + return acpi_sleep_prepare(ACPI_STATE_S4); +} + +static int acpi_hib_enter(void) +{ + acpi_status status = AE_OK; + unsigned long flags = 0; + u32 acpi_state = acpi_suspend_states[pm_state]; + + ACPI_FLUSH_CPU_CACHE(); + + /* Do arch specific saving of state. */ + int error = acpi_save_state_mem(); + if (error) + return error; + + local_irq_save(flags); + acpi_enable_wakeup_device(acpi_state); + status = acpi_enter_sleep_state(acpi_state); + + /* ACPI 3.0 specs (P62) says that it's the responsabilty + * of the OSPM to clear the status bit [ implying that the + * POWER_BUTTON event should not reach userspace ] + */ + if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3)) + acpi_clear_event(ACPI_EVENT_POWER_BUTTON); + + local_irq_restore(flags); + printk(KERN_DEBUG "Back to C!\n"); + + /* restore processor state + * We should only be here if we're coming back from STR or STD. + * And, in the case of the latter, the memory image should have already + * been loaded from disk. + */ + acpi_restore_state_mem(); + + return ACPI_SUCCESS(status) ? 0 : -EFAULT; +} + +static void acpi_hib_finish(void) +{ + acpi_leave_sleep_state(ACPI_STATE_S4); + acpi_disable_wakeup_device(ACPI_STATE_S4); + + /* reset firmware waking vector */ + acpi_set_firmware_waking_vector((acpi_physical_address) 0); + + if (init_8259A_after_S1) { + printk("Broken toshiba laptop -> kicking interrupts\n"); + init_8259A(0); + } + return 0; +} + +static struct hibernate_ops acpi_hib_ops = { + .prepare = acpi_hib_prepare, + .enter = acpi_hib_enter, + .finish = acpi_hib_finish, +}; +#endif /* CONFIG_SOFTWARE_SUSPEND */ + /* * Toshiba fails to preserve interrupts over S1, reinitialization * of 8259 is needed after S1 resume. @@ -227,13 +285,16 @@ int __init acpi_sleep_init(void) sleep_states[i] = 1; printk(" S%d", i); } - if (i == ACPI_STATE_S4) { - if (sleep_states[i]) - acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM; - } } printk(")\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + if (sleep_states[ACPI_STATE_S4]) + hibernate_set_ops(&acpi_hib_ops); +#else + sleep_states[ACPI_STATE_S4] = 0; +#endif + pm_set_ops(&acpi_pm_ops); return 0; } --- wireless-dev.orig/kernel/power/power.h 2007-04-26 18:15:01.240691185 +0200 +++ wireless-dev/kernel/power/power.h 2007-04-26 18:15:09.630691185 +0200 @@ -13,16 +13,6 @@ struct swsusp_info { -#ifdef CONFIG_SOFTWARE_SUSPEND -extern int pm_suspend_disk(void); - -#else -static inline int pm_suspend_disk(void) -{ - return -EPERM; -} -#endif - extern struct mutex pm_mutex; #define power_attr(_name) \ @@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t struct timeval; extern void swsusp_show_speed(struct timeval *, struct timeval *, unsigned int, char *); + +extern int hibernate_platform_prepare(void); +extern void hibernate_platform_finish(void); --- wireless-dev.orig/drivers/acpi/sleep/proc.c 2007-04-26 18:15:02.720691185 +0200 +++ wireless-dev/drivers/acpi/sleep/proc.c 2007-04-26 18:15:09.630691185 +0200 @@ -1,6 +1,7 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/suspend.h> +#include <linux/hibernate.h> #include <linux/bcd.h> #include <asm/uaccess.h> @@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil state = simple_strtoul(str, NULL, 0); #ifdef CONFIG_SOFTWARE_SUSPEND if (state == 4) { - error = pm_suspend(PM_SUSPEND_DISK); + error = hibernate(); goto Done; } #endif ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 16:31 ` Johannes Berg @ 2007-04-26 18:40 ` Rafael J. Wysocki -1 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 18:40 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm On Thursday, 26 April 2007 18:31, Johannes Berg wrote: > On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > > > From looking at pm_ops which I was recently working with a lot, it seems > > > that it was designed by somebody who was reading the ACPI documentation > > > and was otherwise pretty clueless, even at that level std tries to look > > > like suspend. IMHO that is one of the first things that should be ripped > > > out, no pm_ops for STD, it's a pain to work with. > > > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > > low-level enter is pretty similar). > > > > Patches would be welcome > > That was easier than I thought. This applies on top of a patch that > makes kernel/power/user.c optional since I had no idea how to fix it, > problems I see: > * it surfaces kernel implementation details about pm_ops and thus makes > the whole thing very fragile Can you elaborate? > * it has yet another interface (yuck) to determine whether to reboot, > shut down etc, doesn't use /sys/power/disk Yes. In fact it was meant as a replacement for /sys/power/disk at one point. > * I generally had no idea wtf it is doing in some places I could have told you if you had asked. :-) > Anyway, this patch is only compile tested, it > * introduces include/linux/hibernate.h with hibernate_ops and > a new hibernate() function to hibernate the system Do we need hibernate_ops at all? There's only one user anyway and I'm not sure there will be more of them in the future. > * rips apart a lot of the suspend code and puts it back together using > the hibernate_ops > * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) > * might apply/compile against -mm, I have all my and some of Rafael's > suspend/hibernate work in my tree. > * breaks user suspend as I noted above > * is incomplete, somewhere pm_suspend_disk() is still defined iirc I think I can fix it up, just give me some time. The idea is good, I think we should do someting like this. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) @ 2007-04-26 18:40 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 18:40 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 18:31, Johannes Berg wrote: > On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > > > From looking at pm_ops which I was recently working with a lot, it seems > > > that it was designed by somebody who was reading the ACPI documentation > > > and was otherwise pretty clueless, even at that level std tries to look > > > like suspend. IMHO that is one of the first things that should be ripped > > > out, no pm_ops for STD, it's a pain to work with. > > > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > > low-level enter is pretty similar). > > > > Patches would be welcome > > That was easier than I thought. This applies on top of a patch that > makes kernel/power/user.c optional since I had no idea how to fix it, > problems I see: > * it surfaces kernel implementation details about pm_ops and thus makes > the whole thing very fragile Can you elaborate? > * it has yet another interface (yuck) to determine whether to reboot, > shut down etc, doesn't use /sys/power/disk Yes. In fact it was meant as a replacement for /sys/power/disk at one point. > * I generally had no idea wtf it is doing in some places I could have told you if you had asked. :-) > Anyway, this patch is only compile tested, it > * introduces include/linux/hibernate.h with hibernate_ops and > a new hibernate() function to hibernate the system Do we need hibernate_ops at all? There's only one user anyway and I'm not sure there will be more of them in the future. > * rips apart a lot of the suspend code and puts it back together using > the hibernate_ops > * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) > * might apply/compile against -mm, I have all my and some of Rafael's > suspend/hibernate work in my tree. > * breaks user suspend as I noted above > * is incomplete, somewhere pm_suspend_disk() is still defined iirc I think I can fix it up, just give me some time. The idea is good, I think we should do someting like this. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 18:40 ` Rafael J. Wysocki (?) @ 2007-04-26 18:40 ` Johannes Berg 2007-04-26 19:02 ` Rafael J. Wysocki 2007-04-26 19:02 ` Rafael J. Wysocki -1 siblings, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 18:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 1240 bytes --] On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote: > > * it surfaces kernel implementation details about pm_ops and thus makes > > the whole thing very fragile > > Can you elaborate? Well it tells userspace about pm_ops->enter/prepare/finish etc. Also, it seems that it needs a "release memory now" operation instead of just releasing it when the fd is closed? > > * it has yet another interface (yuck) to determine whether to reboot, > > shut down etc, doesn't use /sys/power/disk > > Yes. In fact it was meant as a replacement for /sys/power/disk at one point. Heh. > > * I generally had no idea wtf it is doing in some places > > I could have told you if you had asked. :-) I was offline ;) > Do we need hibernate_ops at all? There's only one user anyway and I'm not > sure there will be more of them in the future. I'm pretty sure there won't be, but there's no way to do it cleanly without pm_ops since even acpi doesn't do this all the time but only when some set of conditions is true. Hence, it needs to be able to determine the availability of the platform mode at run time rather than build time (build time => we could use weak symbols, arch hooks, ...) johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 18:40 ` Johannes Berg @ 2007-04-26 19:02 ` Rafael J. Wysocki 2007-04-27 9:41 ` Johannes Berg 2007-04-27 9:41 ` Johannes Berg 2007-04-26 19:02 ` Rafael J. Wysocki 1 sibling, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 19:02 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm On Thursday, 26 April 2007 20:40, Johannes Berg wrote: > On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote: > > > > * it surfaces kernel implementation details about pm_ops and thus makes > > > the whole thing very fragile > > > > Can you elaborate? > > Well it tells userspace about pm_ops->enter/prepare/finish etc. > Also, it seems that it needs a "release memory now" operation instead of > just releasing it when the fd is closed? Yes. That's because we want to be able to repeat creating the image without closing the fd in some situations. > > > * it has yet another interface (yuck) to determine whether to reboot, > > > shut down etc, doesn't use /sys/power/disk > > > > Yes. In fact it was meant as a replacement for /sys/power/disk at one point. > > Heh. > > > > * I generally had no idea wtf it is doing in some places > > > > I could have told you if you had asked. :-) > > I was offline ;) > > > Do we need hibernate_ops at all? There's only one user anyway and I'm not > > sure there will be more of them in the future. > > I'm pretty sure there won't be, but there's no way to do it cleanly > without pm_ops since even acpi doesn't do this all the time but only > when some set of conditions is true. Hence, it needs to be able to > determine the availability of the platform mode at run time rather than > build time (build time => we could use weak symbols, arch hooks, ...) Still, we could use a global var 'platform_hibernation' or something like this, I think. Then, we can do #define platform_hibernation 0 on the architectures that don't need it and make ACPI use it instead of this "dynamic linking". Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 19:02 ` Rafael J. Wysocki @ 2007-04-27 9:41 ` Johannes Berg 2007-04-27 10:09 ` [linux-pm] " Johannes Berg ` (3 more replies) 2007-04-27 9:41 ` Johannes Berg 1 sibling, 4 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 9:41 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 616 bytes --] On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote: > Yes. That's because we want to be able to repeat creating the image > without closing the fd in some situations. Oh yeah, I just checked and it's not in fact necessary. I'm just confused. > Still, we could use a global var 'platform_hibernation' or something like this, > I think. Then, we can do > > #define platform_hibernation 0 > > on the architectures that don't need it and make ACPI use it instead of this > "dynamic linking". No, because acpi doesn't know at build time whether it can actually do S4 or not. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [linux-pm] Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 9:41 ` Johannes Berg @ 2007-04-27 10:09 ` Johannes Berg 2007-04-27 10:09 ` Johannes Berg ` (2 subsequent siblings) 3 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:09 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 415 bytes --] On Fri, 2007-04-27 at 11:41 +0200, Johannes Berg wrote: > No, because acpi doesn't know at build time whether it can actually do > S4 or not. Actually, you could probably do it by making some weak symbol for it that only ACPI overrides, and then check in the ACPI code if S4 is possible, otherwise somehow invoke the old symbol or copy the code or something. Seems a bit more fragile though. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 9:41 ` Johannes Berg 2007-04-27 10:09 ` [linux-pm] " Johannes Berg @ 2007-04-27 10:09 ` Johannes Berg 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:18 ` Rafael J. Wysocki 3 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:09 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Pavel Machek, Ingo Molnar, Linus Torvalds, linux-pm, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 415 bytes --] On Fri, 2007-04-27 at 11:41 +0200, Johannes Berg wrote: > No, because acpi doesn't know at build time whether it can actually do > S4 or not. Actually, you could probably do it by making some weak symbol for it that only ACPI overrides, and then check in the ACPI code if S4 is possible, otherwise somehow invoke the old symbol or copy the code or something. Seems a bit more fragile though. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 9:41 ` Johannes Berg 2007-04-27 10:09 ` [linux-pm] " Johannes Berg 2007-04-27 10:09 ` Johannes Berg @ 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:18 ` Rafael J. Wysocki 3 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 10:18 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Friday, 27 April 2007 11:41, Johannes Berg wrote: > On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote: > > > Yes. That's because we want to be able to repeat creating the image > > without closing the fd in some situations. > > Oh yeah, I just checked and it's not in fact necessary. I'm just > confused. > > > Still, we could use a global var 'platform_hibernation' or something like this, > > I think. Then, we can do > > > > #define platform_hibernation 0 > > > > on the architectures that don't need it and make ACPI use it instead of this > > "dynamic linking". > > No, because acpi doesn't know at build time whether it can actually do > S4 or not. That's not a problem, I think. 1) We define platform_hibernation if CONFIG_ACPI is set. 2) In the ACPI code we do if (can do S4) platform_hibernation = 1; 3) We have functions arch_platform_prepare()/finish()/enter() that are defined to be noops for anything but ACPI systems and for ACPI systems they are defined like this: int arch_platform_enter(void) { if (!platform_hibernation) return 0; ... } I think it should work. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 9:41 ` Johannes Berg ` (2 preceding siblings ...) 2007-04-27 10:18 ` Rafael J. Wysocki @ 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:19 ` Johannes Berg 2007-04-27 10:19 ` Johannes Berg 3 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 10:18 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm On Friday, 27 April 2007 11:41, Johannes Berg wrote: > On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote: > > > Yes. That's because we want to be able to repeat creating the image > > without closing the fd in some situations. > > Oh yeah, I just checked and it's not in fact necessary. I'm just > confused. > > > Still, we could use a global var 'platform_hibernation' or something like this, > > I think. Then, we can do > > > > #define platform_hibernation 0 > > > > on the architectures that don't need it and make ACPI use it instead of this > > "dynamic linking". > > No, because acpi doesn't know at build time whether it can actually do > S4 or not. That's not a problem, I think. 1) We define platform_hibernation if CONFIG_ACPI is set. 2) In the ACPI code we do if (can do S4) platform_hibernation = 1; 3) We have functions arch_platform_prepare()/finish()/enter() that are defined to be noops for anything but ACPI systems and for ACPI systems they are defined like this: int arch_platform_enter(void) { if (!platform_hibernation) return 0; ... } I think it should work. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 10:18 ` Rafael J. Wysocki @ 2007-04-27 10:19 ` Johannes Berg 2007-04-27 10:19 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 1248 bytes --] On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote: > 1) We define platform_hibernation if CONFIG_ACPI is set. Let's just define it always then in the common code so we don't have even more magic bits platforms need to define even if they don't care at all. And please don't put #ifdef CONFIG_ACPI into the common code ;) Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something. > 2) In the ACPI code we do > > if (can do S4) > platform_hibernation = 1; Gotcha. > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined > to be noops for anything but ACPI systems and for ACPI systems they are > defined like this: > > int arch_platform_enter(void) > { > if (!platform_hibernation) > return 0; > > ... > } > > I think it should work. You could reduce code churn in all other platforms by making these weak symbols like the irq hooks I did for pm_ops. It looks like it can work and possibly is even less intrusive than my hibernate_ops patch. Though then again my hibernate_ops patch removed a lot of stuff that is now no longer necessary, and also completely removed the PM_SUSPEND_DISK foo... we probably want that regardless of how we invoke ACPI. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:19 ` Johannes Berg @ 2007-04-27 10:19 ` Johannes Berg 2007-04-27 12:09 ` Rafael J. Wysocki 2007-04-27 12:09 ` Rafael J. Wysocki 1 sibling, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 10:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 1248 bytes --] On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote: > 1) We define platform_hibernation if CONFIG_ACPI is set. Let's just define it always then in the common code so we don't have even more magic bits platforms need to define even if they don't care at all. And please don't put #ifdef CONFIG_ACPI into the common code ;) Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something. > 2) In the ACPI code we do > > if (can do S4) > platform_hibernation = 1; Gotcha. > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined > to be noops for anything but ACPI systems and for ACPI systems they are > defined like this: > > int arch_platform_enter(void) > { > if (!platform_hibernation) > return 0; > > ... > } > > I think it should work. You could reduce code churn in all other platforms by making these weak symbols like the irq hooks I did for pm_ops. It looks like it can work and possibly is even less intrusive than my hibernate_ops patch. Though then again my hibernate_ops patch removed a lot of stuff that is now no longer necessary, and also completely removed the PM_SUSPEND_DISK foo... we probably want that regardless of how we invoke ACPI. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 10:19 ` Johannes Berg @ 2007-04-27 12:09 ` Rafael J. Wysocki 2007-04-27 12:07 ` Johannes Berg 2007-04-27 12:07 ` Johannes Berg 2007-04-27 12:09 ` Rafael J. Wysocki 1 sibling, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 12:09 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm On Friday, 27 April 2007 12:19, Johannes Berg wrote: > On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote: > > > 1) We define platform_hibernation if CONFIG_ACPI is set. > > Let's just define it always then in the common code so we don't have > even more magic bits platforms need to define even if they don't care at > all. And please don't put #ifdef CONFIG_ACPI into the common code ;) > Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something. > > > 2) In the ACPI code we do > > > > if (can do S4) > > platform_hibernation = 1; > > Gotcha. > > > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined > > to be noops for anything but ACPI systems and for ACPI systems they are > > defined like this: > > > > int arch_platform_enter(void) > > { > > if (!platform_hibernation) > > return 0; > > > > ... > > } > > > > I think it should work. > > You could reduce code churn in all other platforms by making these weak > symbols like the irq hooks I did for pm_ops. It looks like it can work > and possibly is even less intrusive than my hibernate_ops patch. Though > then again my hibernate_ops patch removed a lot of stuff that is now no > longer necessary, and also completely removed the PM_SUSPEND_DISK foo... > we probably want that regardless of how we invoke ACPI. Yes. Still, I'd like to rework your patch to deal with ACPI without introducing hibernate_ops . I'm going to do this later today if you don't mind. :-) Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 12:09 ` Rafael J. Wysocki @ 2007-04-27 12:07 ` Johannes Berg 2007-04-27 12:07 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 12:07 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Thomas Gleixner, Ingo Molnar, Arjan van de Ven, linux-pm [-- Attachment #1: Type: text/plain, Size: 347 bytes --] On Fri, 2007-04-27 at 14:09 +0200, Rafael J. Wysocki wrote: > Yes. Still, I'd like to rework your patch to deal with ACPI without > introducing hibernate_ops . I'm going to do this later today if you don't > mind. :-) Not at all :) That's why I actually sent it out instead of just saying "well I give up it breaks user.c" johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 12:09 ` Rafael J. Wysocki 2007-04-27 12:07 ` Johannes Berg @ 2007-04-27 12:07 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 12:07 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 347 bytes --] On Fri, 2007-04-27 at 14:09 +0200, Rafael J. Wysocki wrote: > Yes. Still, I'd like to rework your patch to deal with ACPI without > introducing hibernate_ops . I'm going to do this later today if you don't > mind. :-) Not at all :) That's why I actually sent it out instead of just saying "well I give up it breaks user.c" johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-27 10:19 ` Johannes Berg 2007-04-27 12:09 ` Rafael J. Wysocki @ 2007-04-27 12:09 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-27 12:09 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Friday, 27 April 2007 12:19, Johannes Berg wrote: > On Fri, 2007-04-27 at 12:18 +0200, Rafael J. Wysocki wrote: > > > 1) We define platform_hibernation if CONFIG_ACPI is set. > > Let's just define it always then in the common code so we don't have > even more magic bits platforms need to define even if they don't care at > all. And please don't put #ifdef CONFIG_ACPI into the common code ;) > Maybe #ifdef CONFIG_ARCH_NEEDS_HIBERNATE_HOOKS or something. > > > 2) In the ACPI code we do > > > > if (can do S4) > > platform_hibernation = 1; > > Gotcha. > > > 3) We have functions arch_platform_prepare()/finish()/enter() that are defined > > to be noops for anything but ACPI systems and for ACPI systems they are > > defined like this: > > > > int arch_platform_enter(void) > > { > > if (!platform_hibernation) > > return 0; > > > > ... > > } > > > > I think it should work. > > You could reduce code churn in all other platforms by making these weak > symbols like the irq hooks I did for pm_ops. It looks like it can work > and possibly is even less intrusive than my hibernate_ops patch. Though > then again my hibernate_ops patch removed a lot of stuff that is now no > longer necessary, and also completely removed the PM_SUSPEND_DISK foo... > we probably want that regardless of how we invoke ACPI. Yes. Still, I'd like to rework your patch to deal with ACPI without introducing hibernate_ops . I'm going to do this later today if you don't mind. :-) Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 19:02 ` Rafael J. Wysocki 2007-04-27 9:41 ` Johannes Berg @ 2007-04-27 9:41 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-27 9:41 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 616 bytes --] On Thu, 2007-04-26 at 21:02 +0200, Rafael J. Wysocki wrote: > Yes. That's because we want to be able to repeat creating the image > without closing the fd in some situations. Oh yeah, I just checked and it's not in fact necessary. I'm just confused. > Still, we could use a global var 'platform_hibernation' or something like this, > I think. Then, we can do > > #define platform_hibernation 0 > > on the architectures that don't need it and make ACPI use it instead of this > "dynamic linking". No, because acpi doesn't know at build time whether it can actually do S4 or not. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 18:40 ` Johannes Berg 2007-04-26 19:02 ` Rafael J. Wysocki @ 2007-04-26 19:02 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 19:02 UTC (permalink / raw) To: Johannes Berg Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Thursday, 26 April 2007 20:40, Johannes Berg wrote: > On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote: > > > > * it surfaces kernel implementation details about pm_ops and thus makes > > > the whole thing very fragile > > > > Can you elaborate? > > Well it tells userspace about pm_ops->enter/prepare/finish etc. > Also, it seems that it needs a "release memory now" operation instead of > just releasing it when the fd is closed? Yes. That's because we want to be able to repeat creating the image without closing the fd in some situations. > > > * it has yet another interface (yuck) to determine whether to reboot, > > > shut down etc, doesn't use /sys/power/disk > > > > Yes. In fact it was meant as a replacement for /sys/power/disk at one point. > > Heh. > > > > * I generally had no idea wtf it is doing in some places > > > > I could have told you if you had asked. :-) > > I was offline ;) > > > Do we need hibernate_ops at all? There's only one user anyway and I'm not > > sure there will be more of them in the future. > > I'm pretty sure there won't be, but there's no way to do it cleanly > without pm_ops since even acpi doesn't do this all the time but only > when some set of conditions is true. Hence, it needs to be able to > determine the availability of the platform mode at run time rather than > build time (build time => we could use weak symbols, arch hooks, ...) Still, we could use a global var 'platform_hibernation' or something like this, I think. Then, we can do #define platform_hibernation 0 on the architectures that don't need it and make ACPI use it instead of this "dynamic linking". Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 18:40 ` Rafael J. Wysocki (?) (?) @ 2007-04-26 18:40 ` Johannes Berg -1 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 18:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nick Piggin, Nigel Cunningham, Ingo Molnar, Pavel Machek, Mike Galbraith, linux-kernel, Con Kolivas, suspend2-devel, linux-pm, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven [-- Attachment #1.1: Type: text/plain, Size: 1240 bytes --] On Thu, 2007-04-26 at 20:40 +0200, Rafael J. Wysocki wrote: > > * it surfaces kernel implementation details about pm_ops and thus makes > > the whole thing very fragile > > Can you elaborate? Well it tells userspace about pm_ops->enter/prepare/finish etc. Also, it seems that it needs a "release memory now" operation instead of just releasing it when the fd is closed? > > * it has yet another interface (yuck) to determine whether to reboot, > > shut down etc, doesn't use /sys/power/disk > > Yes. In fact it was meant as a replacement for /sys/power/disk at one point. Heh. > > * I generally had no idea wtf it is doing in some places > > I could have told you if you had asked. :-) I was offline ;) > Do we need hibernate_ops at all? There's only one user anyway and I'm not > sure there will be more of them in the future. I'm pretty sure there won't be, but there's no way to do it cleanly without pm_ops since even acpi doesn't do this all the time but only when some set of conditions is true. Hence, it needs to be able to determine the availability of the platform mode at run time rather than build time (build time => we could use weak symbols, arch hooks, ...) johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-26 16:31 ` Johannes Berg (?) (?) @ 2007-04-29 12:48 ` R. J. Wysocki 2007-04-29 12:53 ` Rafael J. Wysocki 2007-04-30 8:29 ` Johannes Berg -1 siblings, 2 replies; 713+ messages in thread From: R. J. Wysocki @ 2007-04-29 12:48 UTC (permalink / raw) To: Johannes Berg; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek [Trimmed the CC list to a reasonable minimum] On Thursday, 26 April 2007 18:31, Johannes Berg wrote: > On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > > > From looking at pm_ops which I was recently working with a lot, it seems > > > that it was designed by somebody who was reading the ACPI documentation > > > and was otherwise pretty clueless, even at that level std tries to look > > > like suspend. IMHO that is one of the first things that should be ripped > > > out, no pm_ops for STD, it's a pain to work with. > > > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > > low-level enter is pretty similar). > > > > Patches would be welcome > > That was easier than I thought. This applies on top of a patch that > makes kernel/power/user.c optional since I had no idea how to fix it, > problems I see: > * it surfaces kernel implementation details about pm_ops and thus makes > the whole thing very fragile > * it has yet another interface (yuck) to determine whether to reboot, > shut down etc, doesn't use /sys/power/disk > * I generally had no idea wtf it is doing in some places > > Anyway, this patch is only compile tested, it > * introduces include/linux/hibernate.h with hibernate_ops and > a new hibernate() function to hibernate the system > * rips apart a lot of the suspend code and puts it back together using > the hibernate_ops > * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) > * might apply/compile against -mm, I have all my and some of Rafael's > suspend/hibernate work in my tree. > * breaks user suspend as I noted above > * is incomplete, somewhere pm_suspend_disk() is still defined iirc OK, I reworked it a bit. Main changes: - IMHO 'hibernation_ops' sounds better than 'hibernate_ops', for example, so now the new names start with 'hibernation_' (or 'HIBERNATION_') - Moved the hibernation-related definitions to include/linux/suspend.h, since some hibernation-specific definitions are already there. We can introduce hibernation.h in a separate patch (it'll have to #include suspend.h IMO). - Changed the names starting from 'pm_disk_' (or 'PM_DISK_'). - Cleaned up the new ACPI code (it didn't compile and included some things unrelated to hibernation). I'm still not sure about acpi_hibernation_finish() (is the code after acpi_disable_wakeup_device() really needed?) - Made kernel/power/user.c compile (and hopefully work too) It looks like we'll have to change CONFIG_SOFTWARE_SUSPEND into CONFIG_HIBERNATION, since some pieces of code now look silly. The appended patch is agaist 2.6.21-rc7-mm2 with two freezer patches applied (should not affect this one). Compilation tested on x86_64. Greetings, Rafael --- Documentation/power/userland-swsusp.txt | 26 ++-- drivers/acpi/sleep/main.c | 79 +++++++++++-- drivers/acpi/sleep/proc.c | 2 drivers/i2c/chips/tps65010.c | 2 include/linux/pm.h | 31 ----- kernel/power/disk.c | 184 +++++++++++++++++--------------- kernel/power/main.c | 42 ++----- kernel/power/power.h | 7 - kernel/power/user.c | 13 +- kernel/sys.c | 2 10 files changed, 204 insertions(+), 184 deletions(-) Index: linux-2.6.21-rc7-mm2/include/linux/pm.h =================================================================== --- linux-2.6.21-rc7-mm2.orig/include/linux/pm.h 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/include/linux/pm.h 2007-04-29 13:39:17.000000000 +0200 @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t; #define PM_SUSPEND_ON ((__force suspend_state_t) 0) #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1) #define PM_SUSPEND_MEM ((__force suspend_state_t) 3) -#define PM_SUSPEND_DISK ((__force suspend_state_t) 4) -#define PM_SUSPEND_MAX ((__force suspend_state_t) 5) - -typedef int __bitwise suspend_disk_method_t; - -/* invalid must be 0 so struct pm_ops initialisers can leave it out */ -#define PM_DISK_INVALID ((__force suspend_disk_method_t) 0) -#define PM_DISK_PLATFORM ((__force suspend_disk_method_t) 1) -#define PM_DISK_SHUTDOWN ((__force suspend_disk_method_t) 2) -#define PM_DISK_REBOOT ((__force suspend_disk_method_t) 3) -#define PM_DISK_TEST ((__force suspend_disk_method_t) 4) -#define PM_DISK_TESTPROC ((__force suspend_disk_method_t) 5) -#define PM_DISK_MAX ((__force suspend_disk_method_t) 6) +#define PM_SUSPEND_MAX ((__force suspend_state_t) 4) /** * struct pm_ops - Callbacks for managing platform dependent suspend states. * @valid: Callback to determine whether the given state can be entered. - * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is - * always valid and never passed to this call. If not assigned, - * no suspend states are valid. * Valid states are advertised in /sys/power/state but can still * be rejected by prepare or enter if the conditions aren't right. * There is a %pm_valid_only_mem function available that can be assigned @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho * * @finish: Called when the system has left the given state and all devices * are resumed. The return value is ignored. - * - * @pm_disk_mode: The generic code always allows one of the shutdown methods - * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and - * %PM_DISK_TESTPROC. If this variable is set, the mode it is set - * to is allowed in addition to those modes and is also made default. - * When this mode is sent selected, the @prepare call will be called - * before suspending to disk (if present), the @enter call should be - * present and will be called after all state has been saved and the - * machine is ready to be powered off; the @finish callback is called - * after state has been restored. All these calls are called with - * %PM_SUSPEND_DISK as the state. */ struct pm_ops { int (*valid)(suspend_state_t state); int (*prepare)(suspend_state_t state); int (*enter)(suspend_state_t state); int (*finish)(suspend_state_t state); - suspend_disk_method_t pm_disk_mode; }; /** @@ -258,8 +231,6 @@ extern void device_power_up(void); extern void device_resume(void); #ifdef CONFIG_PM -extern suspend_disk_method_t pm_disk_mode; - extern int device_suspend(pm_message_t state); extern int device_prepare_suspend(pm_message_t state); Index: linux-2.6.21-rc7-mm2/kernel/power/main.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/kernel/power/main.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/kernel/power/main.c 2007-04-29 13:43:34.000000000 +0200 @@ -30,7 +30,6 @@ DEFINE_MUTEX(pm_mutex); struct pm_ops *pm_ops; -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN; /** * pm_set_ops - Set the global power method table. @@ -41,10 +40,6 @@ void pm_set_ops(struct pm_ops * ops) { mutex_lock(&pm_mutex); pm_ops = ops; - if (ops && ops->pm_disk_mode != PM_DISK_INVALID) { - pm_disk_mode = ops->pm_disk_mode; - } else - pm_disk_mode = PM_DISK_SHUTDOWN; mutex_unlock(&pm_mutex); } @@ -196,24 +191,12 @@ static void suspend_finish(suspend_state static const char * const pm_states[PM_SUSPEND_MAX] = { [PM_SUSPEND_STANDBY] = "standby", [PM_SUSPEND_MEM] = "mem", - [PM_SUSPEND_DISK] = "disk", }; static inline int valid_state(suspend_state_t state) { - /* Suspend-to-disk does not really need low-level support. - * It can work with shutdown/reboot if needed. If it isn't - * configured, then it cannot be supported. - */ - if (state == PM_SUSPEND_DISK) -#ifdef CONFIG_SOFTWARE_SUSPEND - return 1; -#else - return 0; -#endif - - /* all other states need lowlevel support and need to be - * valid to the lowlevel implementation, no valid callback + /* All states need lowlevel support and need to be valid + * to the lowlevel implementation, no valid callback * implies that none are valid. */ if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state)) return 0; @@ -241,11 +224,6 @@ static int enter_state(suspend_state_t s if (!mutex_trylock(&pm_mutex)) return -EBUSY; - if (state == PM_SUSPEND_DISK) { - error = pm_suspend_disk(); - goto Unlock; - } - pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]); if ((error = suspend_prepare(state))) goto Unlock; @@ -263,7 +241,7 @@ static int enter_state(suspend_state_t s /** * pm_suspend - Externally visible function for suspending system. - * @state: Enumarted value of state to enter. + * @state: Enumerated value of state to enter. * * Determine whether or not value is within range, get state * structure, and enter (above). @@ -301,7 +279,13 @@ static ssize_t state_show(struct subsyst if (pm_states[i] && valid_state(i)) s += sprintf(s,"%s ", pm_states[i]); } - s += sprintf(s,"\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + s += sprintf(s, "%s\n", "disk"); +#else + if (s != buf) + /* convert the last space to a newline */ + *(s-1) = "\n"; +#endif return (s - buf); } @@ -316,6 +300,12 @@ static ssize_t state_store(struct subsys p = memchr(buf, '\n', n); len = p ? p - buf : n; + /* First, check if we are requested to hibernate */ + if (strncmp(buf, "disk", len)) { + error = hibernate(); + return error ? error : n; + } + for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { if (*s && !strncmp(buf, *s, len)) break; Index: linux-2.6.21-rc7-mm2/kernel/power/disk.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/kernel/power/disk.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/kernel/power/disk.c 2007-04-29 13:54:50.000000000 +0200 @@ -30,30 +30,60 @@ char resume_file[256] = CONFIG_PM_STD_PA dev_t swsusp_resume_device; sector_t swsusp_resume_block; +static int hibernation_mode; + +enum { + HIBERNATION_INVALID, + HIBERNATION_PLATFORM, + HIBERNATION_TEST, + HIBERNATION_TESTPROC, + HIBERNATION_SHUTDOWN, + HIBERNATION_REBOOT, + /* keep last */ + __HIBERNATION_AFTER_LAST +}; +#define HIBERNATION_MAX (__HIBERNATION_AFTER_LAST-1) +#define HIBERNATION_FIRST (HIBERNATION_INVALID + 1) + +struct hibernation_ops *hibernation_ops; + +void hibernation_set_ops(struct hibernation_ops *ops) +{ + mutex_lock(&pm_mutex); + hibernation_ops = ops; + mutex_unlock(&pm_mutex); + if (hibernation_ops) { + BUG_ON(!hibernation_ops->prepare); + BUG_ON(!hibernation_ops->enter); + BUG_ON(!hibernation_ops->finish); + } +} + + /** * platform_prepare - prepare the machine for hibernation using the * platform driver if so configured and return an error code if it fails */ -static inline int platform_prepare(void) +static int platform_prepare(void) { - int error = 0; + return (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) ? + hibernation_ops->prepare() : 0; +} - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - break; - default: - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); - } - return error; +/** + * platform_finish - switch the machine to the normal mode of operation + * using the platform driver (must be called after platform_prepare()) + */ + +static void platform_finish(void) +{ + if (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) + hibernation_ops->finish(); } /** - * power_down - Shut machine down for hibernate. + * power_down - Shut the machine down for hibernation. * * Use the platform driver, if configured so; otherwise try * to power off or reboot. @@ -61,20 +91,20 @@ static inline int platform_prepare(void) static void power_down(void) { - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: + switch (hibernation_mode) { + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: break; - case PM_DISK_SHUTDOWN: + case HIBERNATION_SHUTDOWN: kernel_power_off(); break; - case PM_DISK_REBOOT: + case HIBERNATION_REBOOT: kernel_restart(NULL); break; - default: - if (pm_ops && pm_ops->enter) { + case HIBERNATION_PLATFORM: + if (hibernation_ops) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - pm_ops->enter(PM_SUSPEND_DISK); + hibernation_ops->enter(); break; } } @@ -87,20 +117,6 @@ static void power_down(void) while(1); } -static inline void platform_finish(void) -{ - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - break; - default: - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); - } -} - static void unprepare_processes(void) { thaw_processes(); @@ -120,13 +136,10 @@ static int prepare_processes(void) } /** - * pm_suspend_disk - The granpappy of hibernation power management. - * - * If not, then call swsusp to do its thing, then figure out how - * to power down the system. + * hibernate - The granpappy of the built-in hibernation management */ -int pm_suspend_disk(void) +int hibernate(void) { int error; @@ -151,7 +164,7 @@ int pm_suspend_disk(void) if (error) goto Thaw; - if (pm_disk_mode == PM_DISK_TESTPROC) { + if (hibernation_mode == HIBERNATION_TESTPROC) { printk("swsusp debug: Waiting for 5 seconds.\n"); mdelay(5000); goto Thaw; @@ -176,7 +189,7 @@ int pm_suspend_disk(void) if (error) goto Enable_cpus; - if (pm_disk_mode == PM_DISK_TEST) { + if (hibernation_mode == HIBERNATION_TEST) { printk("swsusp debug: Waiting for 5 seconds.\n"); mdelay(5000); goto Enable_cpus; @@ -230,7 +243,7 @@ int pm_suspend_disk(void) * Called as a late_initcall (so all devices are discovered and * initialized), we call swsusp to see if we have a saved image or not. * If so, we quiesce devices, the restore the saved image. We will - * return above (in pm_suspend_disk() ) if everything goes well. + * return above (in hibernate() ) if everything goes well. * Otherwise, we fail gracefully and return to the normally * scheduled program. * @@ -336,25 +349,26 @@ static int software_resume(void) late_initcall(software_resume); -static const char * const pm_disk_modes[] = { - [PM_DISK_PLATFORM] = "platform", - [PM_DISK_SHUTDOWN] = "shutdown", - [PM_DISK_REBOOT] = "reboot", - [PM_DISK_TEST] = "test", - [PM_DISK_TESTPROC] = "testproc", +static const char * const hibernation_modes[] = { + [HIBERNATION_PLATFORM] = "platform", + [HIBERNATION_SHUTDOWN] = "shutdown", + [HIBERNATION_REBOOT] = "reboot", + [HIBERNATION_TEST] = "test", + [HIBERNATION_TESTPROC] = "testproc", }; /** - * disk - Control suspend-to-disk mode + * disk - Control hibernation mode * * Suspend-to-disk can be handled in several ways. We have a few options * for putting the system to sleep - using the platform driver (e.g. ACPI - * or other pm_ops), powering off the system or rebooting the system - * (for testing) as well as the two test modes. + * or other hibernation_ops), powering off the system or rebooting the + * system (for testing) as well as the two test modes. * * The system can support 'platform', and that is known a priori (and - * encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot' - * as alternatives, as well as the test modes 'test' and 'testproc'. + * encoded by the presence of hibernation_ops). However, the user may + * choose 'shutdown' or 'reboot' as alternatives, as well as one fo the + * test modes, 'test' or 'testproc'. * * show() will display what the mode is currently set to. * store() will accept one of @@ -366,7 +380,7 @@ static const char * const pm_disk_modes[ * 'testproc' * * It will only change to 'platform' if the system - * supports it (as determined from pm_ops->pm_disk_mode). + * supports it (as determined by having hibernation_ops). */ static ssize_t disk_show(struct subsystem * subsys, char * buf) @@ -374,27 +388,26 @@ static ssize_t disk_show(struct subsyste int i; char *start = buf; - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { - if (!pm_disk_modes[i]) + for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) { + if (!hibernation_modes[i]) continue; switch (i) { - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - case PM_DISK_TEST: - case PM_DISK_TESTPROC: + case HIBERNATION_SHUTDOWN: + case HIBERNATION_REBOOT: + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: break; - default: - if (pm_ops && pm_ops->enter && - (i == pm_ops->pm_disk_mode)) + case HIBERNATION_PLATFORM: + if (hibernation_ops) break; /* not a valid mode, continue with loop */ continue; } - if (i == pm_disk_mode) - buf += sprintf(buf, "[%s]", pm_disk_modes[i]); + if (i == hibernation_mode) + buf += sprintf(buf, "[%s]", hibernation_modes[i]); else - buf += sprintf(buf, "%s", pm_disk_modes[i]); - if (i+1 != PM_DISK_MAX) + buf += sprintf(buf, "%s", hibernation_modes[i]); + if (i+1 != HIBERNATION_MAX) buf += sprintf(buf, " "); } buf += sprintf(buf, "\n"); @@ -408,39 +421,38 @@ static ssize_t disk_store(struct subsyst int i; int len; char *p; - suspend_disk_method_t mode = 0; + int mode = HIBERNATION_INVALID; p = memchr(buf, '\n', n); len = p ? p - buf : n; mutex_lock(&pm_mutex); - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { - if (!strncmp(buf, pm_disk_modes[i], len)) { + for (i = HIBERNATION_FIRST; i < HIBERNATION_MAX; i++) { + if (!strncmp(buf, hibernation_modes[i], len)) { mode = i; break; } } - if (mode) { + if (mode != HIBERNATION_INVALID) { switch (mode) { - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - pm_disk_mode = mode; + case HIBERNATION_SHUTDOWN: + case HIBERNATION_REBOOT: + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: + hibernation_mode = mode; break; - default: - if (pm_ops && pm_ops->enter && - (mode == pm_ops->pm_disk_mode)) - pm_disk_mode = mode; + case HIBERNATION_PLATFORM: + if (hibernation_ops) + hibernation_mode = mode; else error = -EINVAL; } - } else { + } else error = -EINVAL; - } - pr_debug("PM: suspend-to-disk mode set to '%s'\n", - pm_disk_modes[mode]); + if (!error) + pr_debug("PM: suspend-to-disk mode set to '%s'\n", + hibernation_modes[mode]); mutex_unlock(&pm_mutex); return error ? error : n; } Index: linux-2.6.21-rc7-mm2/Documentation/power/userland-swsusp.txt =================================================================== --- linux-2.6.21-rc7-mm2.orig/Documentation/power/userland-swsusp.txt 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/Documentation/power/userland-swsusp.txt 2007-04-29 13:39:18.000000000 +0200 @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t to resume the system from RAM if there's enough battery power or restore its state on the basis of the saved suspend image otherwise) -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and - pmops->finish methods (the in-kernel swsusp knows these as the "platform - method") which are needed on many machines to (among others) speed up - the resume by letting the BIOS skip some steps or to let the system - recognise the correct state of the hardware after the resume (in - particular on many machines this ensures that unplugged AC - adapters get correctly detected and that kacpid does not run wild after - the resume). The last ioctl() argument can take one of the three - values, defined in kernel/power/power.h: +SNAPSHOT_PMOPS - enable the usage of the hibernation_ops->prepare, + hibernate_ops->enter and hibernation_ops->finish methods (the in-kernel + swsusp knows these as the "platform method") which are needed on many + machines to (among others) speed up the resume by letting the BIOS skip + some steps or to let the system recognise the correct state of the + hardware after the resume (in particular on many machines this ensures + that unplugged AC adapters get correctly detected and that kacpid does + not run wild after the resume). The last ioctl() argument can take one + of the three values, defined in kernel/power/power.h: PMOPS_PREPARE - make the kernel carry out the - pm_ops->prepare(PM_SUSPEND_DISK) operation + hibernation_ops->prepare() operation PMOPS_ENTER - make the kernel power off the system by calling - pm_ops->enter(PM_SUSPEND_DISK) + hibernation_ops->enter() PMOPS_FINISH - make the kernel carry out the - pm_ops->finish(PM_SUSPEND_DISK) operation + hibernation_ops->finish() operation + Note that the actual constants are misnamed because they surface + internal kernel implementation details that have changed. The device's read() operation can be used to transfer the snapshot image from the kernel. It has the following limitations: Index: linux-2.6.21-rc7-mm2/drivers/i2c/chips/tps65010.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/drivers/i2c/chips/tps65010.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/drivers/i2c/chips/tps65010.c 2007-04-29 13:39:18.000000000 +0200 @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp * also needs to get error handling and probably * an #ifdef CONFIG_SOFTWARE_SUSPEND */ - pm_suspend(PM_SUSPEND_DISK); + hibernate(); #endif poll = 1; } Index: linux-2.6.21-rc7-mm2/kernel/sys.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/kernel/sys.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/kernel/sys.c 2007-04-29 13:39:18.000000000 +0200 @@ -942,7 +942,7 @@ asmlinkage long sys_reboot(int magic1, i #ifdef CONFIG_SOFTWARE_SUSPEND case LINUX_REBOOT_CMD_SW_SUSPEND: { - int ret = pm_suspend(PM_SUSPEND_DISK); + int ret = hibernate(); unlock_kernel(); return ret; } Index: linux-2.6.21-rc7-mm2/drivers/acpi/sleep/main.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/drivers/acpi/sleep/main.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/drivers/acpi/sleep/main.c 2007-04-29 14:16:30.000000000 +0200 @@ -29,7 +29,6 @@ static u32 acpi_suspend_states[] = { [PM_SUSPEND_ON] = ACPI_STATE_S0, [PM_SUSPEND_STANDBY] = ACPI_STATE_S1, [PM_SUSPEND_MEM] = ACPI_STATE_S3, - [PM_SUSPEND_DISK] = ACPI_STATE_S4, [PM_SUSPEND_MAX] = ACPI_STATE_S5 }; @@ -94,14 +93,6 @@ static int acpi_pm_enter(suspend_state_t do_suspend_lowlevel(); break; - case PM_SUSPEND_DISK: - if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM) - status = acpi_enter_sleep_state(acpi_state); - break; - case PM_SUSPEND_MAX: - acpi_power_off(); - break; - default: return -EINVAL; } @@ -157,12 +148,13 @@ int acpi_suspend(u32 acpi_state) suspend_state_t states[] = { [1] = PM_SUSPEND_STANDBY, [3] = PM_SUSPEND_MEM, - [4] = PM_SUSPEND_DISK, [5] = PM_SUSPEND_MAX }; if (acpi_state < 6 && states[acpi_state]) return pm_suspend(states[acpi_state]); + if (acpi_state == 4) + return hibernate(); return -EINVAL; } @@ -189,6 +181,61 @@ static struct pm_ops acpi_pm_ops = { .finish = acpi_pm_finish, }; +#ifdef CONFIG_SOFTWARE_SUSPEND +static int acpi_hibernation_prepare(void) +{ + return acpi_sleep_prepare(ACPI_STATE_S4); +} + +static int acpi_hibernation_enter(void) +{ + acpi_status status = AE_OK; + unsigned long flags = 0; + int error; + + ACPI_FLUSH_CPU_CACHE(); + + /* Do arch specific saving of state. */ + error = acpi_save_state_mem(); + if (error) + return error; + + local_irq_save(flags); + acpi_enable_wakeup_device(ACPI_STATE_S4); + status = acpi_enter_sleep_state(ACPI_STATE_S4); + local_irq_restore(flags); + + /* + * Restore processor state + * We should only be here if we're coming back from hibernation and + * the memory image should have already been loaded from disk. + */ + acpi_restore_state_mem(); + + return ACPI_SUCCESS(status) ? 0 : -EFAULT; +} + +static void acpi_hibernation_finish(void) +{ + acpi_leave_sleep_state(ACPI_STATE_S4); + acpi_disable_wakeup_device(ACPI_STATE_S4); + + /* reset firmware waking vector */ + acpi_set_firmware_waking_vector((acpi_physical_address) 0); + + if (init_8259A_after_S1) { + printk("Broken toshiba laptop -> kicking interrupts\n"); + init_8259A(0); + } +} + +static struct hibernation_ops acpi_hibernation_ops = { + .prepare = acpi_hibernation_prepare, + .enter = acpi_hibernation_enter, + .finish = acpi_hibernation_finish, +}; +#endif /* CONFIG_SOFTWARE_SUSPEND */ + /* * Toshiba fails to preserve interrupts over S1, reinitialization * of 8259 is needed after S1 resume. @@ -227,14 +274,18 @@ int __init acpi_sleep_init(void) sleep_states[i] = 1; printk(" S%d", i); } - if (i == ACPI_STATE_S4) { - if (sleep_states[i]) - acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM; - } } printk(")\n"); pm_set_ops(&acpi_pm_ops); + +#ifdef CONFIG_SOFTWARE_SUSPEND + if (sleep_states[ACPI_STATE_S4]) + hibernation_set_ops(&acpi_hibernation_ops); +#else + sleep_states[ACPI_STATE_S4] = 0; +#endif + return 0; } Index: linux-2.6.21-rc7-mm2/kernel/power/power.h =================================================================== --- linux-2.6.21-rc7-mm2.orig/kernel/power/power.h 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/kernel/power/power.h 2007-04-29 13:55:55.000000000 +0200 @@ -25,12 +25,7 @@ struct swsusp_info { */ #define SPARE_PAGES ((1024 * 1024) >> PAGE_SHIFT) -extern int pm_suspend_disk(void); -#else -static inline int pm_suspend_disk(void) -{ - return -EPERM; -} +extern struct hibernation_ops *hibernation_ops; #endif extern int pfn_is_nosave(unsigned long); Index: linux-2.6.21-rc7-mm2/drivers/acpi/sleep/proc.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/drivers/acpi/sleep/proc.c 2007-04-29 13:39:02.000000000 +0200 +++ linux-2.6.21-rc7-mm2/drivers/acpi/sleep/proc.c 2007-04-29 13:49:42.000000000 +0200 @@ -60,7 +60,7 @@ acpi_system_write_sleep(struct file *fil state = simple_strtoul(str, NULL, 0); #ifdef CONFIG_SOFTWARE_SUSPEND if (state == 4) { - error = pm_suspend(PM_SUSPEND_DISK); + error = hibernate(); goto Done; } #endif Index: linux-2.6.21-rc7-mm2/kernel/power/user.c =================================================================== --- linux-2.6.21-rc7-mm2.orig/kernel/power/user.c 2007-04-29 13:43:34.000000000 +0200 +++ linux-2.6.21-rc7-mm2/kernel/power/user.c 2007-04-29 14:00:42.000000000 +0200 @@ -138,16 +138,16 @@ static inline int platform_prepare(void) { int error = 0; - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); + if (hibernation_ops) + error = hibernation_ops->prepare(); return error; } static inline void platform_finish(void) { - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); + if (hibernation_ops) + hibernation_ops->finish(); } static inline int snapshot_suspend(int platform_suspend) @@ -407,7 +407,7 @@ static int snapshot_ioctl(struct inode * switch (arg) { case PMOPS_PREPARE: - if (pm_ops && pm_ops->enter) { + if (hibernation_ops) { data->platform_suspend = 1; error = 0; } else { @@ -418,8 +418,7 @@ static int snapshot_ioctl(struct inode * case PMOPS_ENTER: if (data->platform_suspend) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - error = pm_ops->enter(PM_SUSPEND_DISK); - error = 0; + error = hibernation_ops->enter(); } break; > > johannes > --- > Documentation/power/userland-swsusp.txt | 26 +++---- > drivers/acpi/sleep/main.c | 89 ++++++++++++++++++++---- > drivers/acpi/sleep/proc.c | 3 > drivers/i2c/chips/tps65010.c | 2 > include/linux/hibernate.h | 36 +++++++++ > include/linux/pm.h | 31 -------- > kernel/power/disk.c | 117 +++++++++++++++++++------------- > kernel/power/main.c | 47 +++++------- > kernel/power/power.h | 13 --- > kernel/power/user.c | 28 +------ > kernel/sys.c | 3 > 11 files changed, 231 insertions(+), 164 deletions(-) > > --- wireless-dev.orig/include/linux/pm.h 2007-04-26 18:15:00.440691185 +0200 > +++ wireless-dev/include/linux/pm.h 2007-04-26 18:15:09.410691185 +0200 > @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t; > #define PM_SUSPEND_ON ((__force suspend_state_t) 0) > #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1) > #define PM_SUSPEND_MEM ((__force suspend_state_t) 3) > -#define PM_SUSPEND_DISK ((__force suspend_state_t) 4) > -#define PM_SUSPEND_MAX ((__force suspend_state_t) 5) > - > -typedef int __bitwise suspend_disk_method_t; > - > -/* invalid must be 0 so struct pm_ops initialisers can leave it out */ > -#define PM_DISK_INVALID ((__force suspend_disk_method_t) 0) > -#define PM_DISK_PLATFORM ((__force suspend_disk_method_t) 1) > -#define PM_DISK_SHUTDOWN ((__force suspend_disk_method_t) 2) > -#define PM_DISK_REBOOT ((__force suspend_disk_method_t) 3) > -#define PM_DISK_TEST ((__force suspend_disk_method_t) 4) > -#define PM_DISK_TESTPROC ((__force suspend_disk_method_t) 5) > -#define PM_DISK_MAX ((__force suspend_disk_method_t) 6) > +#define PM_SUSPEND_MAX ((__force suspend_state_t) 4) > > /** > * struct pm_ops - Callbacks for managing platform dependent suspend states. > * @valid: Callback to determine whether the given state can be entered. > - * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is > - * always valid and never passed to this call. If not assigned, > - * no suspend states are valid. > * Valid states are advertised in /sys/power/state but can still > * be rejected by prepare or enter if the conditions aren't right. > * There is a %pm_valid_only_mem function available that can be assigned > @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho > * > * @finish: Called when the system has left the given state and all devices > * are resumed. The return value is ignored. > - * > - * @pm_disk_mode: The generic code always allows one of the shutdown methods > - * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and > - * %PM_DISK_TESTPROC. If this variable is set, the mode it is set > - * to is allowed in addition to those modes and is also made default. > - * When this mode is sent selected, the @prepare call will be called > - * before suspending to disk (if present), the @enter call should be > - * present and will be called after all state has been saved and the > - * machine is ready to be powered off; the @finish callback is called > - * after state has been restored. All these calls are called with > - * %PM_SUSPEND_DISK as the state. > */ > struct pm_ops { > int (*valid)(suspend_state_t state); > int (*prepare)(suspend_state_t state); > int (*enter)(suspend_state_t state); > int (*finish)(suspend_state_t state); > - suspend_disk_method_t pm_disk_mode; > }; > > /** > @@ -276,8 +249,6 @@ extern void device_power_up(void); > extern void device_resume(void); > > #ifdef CONFIG_PM > -extern suspend_disk_method_t pm_disk_mode; > - > extern int device_suspend(pm_message_t state); > extern int device_prepare_suspend(pm_message_t state); > > --- wireless-dev.orig/kernel/power/main.c 2007-04-26 18:15:00.790691185 +0200 > +++ wireless-dev/kernel/power/main.c 2007-04-26 18:15:09.410691185 +0200 > @@ -21,6 +21,7 @@ > #include <linux/resume-trace.h> > #include <linux/freezer.h> > #include <linux/vmstat.h> > +#include <linux/hibernate.h> > > #include "power.h" > > @@ -30,7 +31,6 @@ > DEFINE_MUTEX(pm_mutex); > > struct pm_ops *pm_ops; > -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN; > > /** > * pm_set_ops - Set the global power method table. > @@ -41,10 +41,6 @@ void pm_set_ops(struct pm_ops * ops) > { > mutex_lock(&pm_mutex); > pm_ops = ops; > - if (ops && ops->pm_disk_mode != PM_DISK_INVALID) { > - pm_disk_mode = ops->pm_disk_mode; > - } else > - pm_disk_mode = PM_DISK_SHUTDOWN; > mutex_unlock(&pm_mutex); > } > > @@ -184,24 +180,12 @@ static void suspend_finish(suspend_state > static const char * const pm_states[PM_SUSPEND_MAX] = { > [PM_SUSPEND_STANDBY] = "standby", > [PM_SUSPEND_MEM] = "mem", > - [PM_SUSPEND_DISK] = "disk", > }; > > static inline int valid_state(suspend_state_t state) > { > - /* Suspend-to-disk does not really need low-level support. > - * It can work with shutdown/reboot if needed. If it isn't > - * configured, then it cannot be supported. > - */ > - if (state == PM_SUSPEND_DISK) > -#ifdef CONFIG_SOFTWARE_SUSPEND > - return 1; > -#else > - return 0; > -#endif > - > - /* all other states need lowlevel support and need to be > - * valid to the lowlevel implementation, no valid callback > + /* All states need lowlevel support and need to be valid > + * to the lowlevel implementation, no valid callback > * implies that none are valid. */ > if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state)) > return 0; > @@ -229,11 +213,6 @@ static int enter_state(suspend_state_t s > if (!mutex_trylock(&pm_mutex)) > return -EBUSY; > > - if (state == PM_SUSPEND_DISK) { > - error = pm_suspend_disk(); > - goto Unlock; > - } > - > pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]); > if ((error = suspend_prepare(state))) > goto Unlock; > @@ -251,7 +230,7 @@ static int enter_state(suspend_state_t s > > /** > * pm_suspend - Externally visible function for suspending system. > - * @state: Enumarted value of state to enter. > + * @state: Enumerated value of state to enter. > * > * Determine whether or not value is within range, get state > * structure, and enter (above). > @@ -283,13 +262,19 @@ decl_subsys(power,NULL,NULL); > static ssize_t state_show(struct subsystem * subsys, char * buf) > { > int i; > - char * s = buf; > + char *s = buf; > > for (i = 0; i < PM_SUSPEND_MAX; i++) { > if (pm_states[i] && valid_state(i)) > - s += sprintf(s,"%s ", pm_states[i]); > + s += sprintf(s, "%s ", pm_states[i]); > } > - s += sprintf(s,"\n"); > +#ifdef CONFIG_SOFTWARE_SUSPEND > + s += sprintf(s, "%s\n", "disk"); > +#else > + if (s != buf) > + /* convert the last space to a newline */ > + *(s-1) = "\n"; > +#endif > return (s - buf); > } > > @@ -304,6 +289,12 @@ static ssize_t state_store(struct subsys > p = memchr(buf, '\n', n); > len = p ? p - buf : n; > > + /* first check hibernate */ > + if (strncmp(buf, "disk", len)) { > + error = hibernate(); > + return error ? error : n; > + } > + > for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { > if (*s && !strncmp(buf, *s, len)) > break; > --- /dev/null 1970-01-01 00:00:00.000000000 +0000 > +++ wireless-dev/include/linux/hibernate.h 2007-04-26 18:21:38.130691185 +0200 > @@ -0,0 +1,36 @@ > +#ifndef __LINUX_HIBERNATE > +#define __LINUX_HIBERNATE > +/* > + * hibernate ('suspend to disk') functionality > + */ > + > +/** > + * struct hibernate_ops - hibernate platform support > + * > + * The methods in this structure allow a platform to override what > + * happens for shutting down the machine when going into hibernation. > + * > + * All three methods must be assigned. > + * > + * @prepare: prepare system for hibernation > + * @enter: shut down system after state has been saved to disk > + * @finish: finish/clean up after state has been reloaded > + */ > +struct hibernate_ops { > + int (*prepare)(void); > + int (*enter)(void); > + void (*finish)(void); > +}; > + > +/** > + * hibernate_set_ops - set the global hibernate operations > + * @ops: the hibernate operations to use from now on. > + */ > +void hibernate_set_ops(struct hibernate_ops *ops); > + > +/** > + * hibernate - hibernate the system > + */ > +int hibernate(void); > + > +#endif /* __LINUX_HIBERNATE */ > --- wireless-dev.orig/kernel/power/disk.c 2007-04-26 18:15:00.800691185 +0200 > +++ wireless-dev/kernel/power/disk.c 2007-04-26 18:15:09.420691185 +0200 > @@ -21,45 +21,72 @@ > #include <linux/console.h> > #include <linux/cpu.h> > #include <linux/freezer.h> > +#include <linux/hibernate.h> > > #include "power.h" > > > -static int noresume = 0; > +static int noresume; > char resume_file[256] = CONFIG_PM_STD_PARTITION; > dev_t swsusp_resume_device; > sector_t swsusp_resume_block; > > +static struct hibernate_ops *hibernate_ops; > +static int pm_disk_mode; > + > +enum { > + PM_DISK_INVALID, > + PM_DISK_PLATFORM, > + PM_DISK_TEST, > + PM_DISK_TESTPROC, > + PM_DISK_SHUTDOWN, > + PM_DISK_REBOOT, > + /* keep last */ > + __PM_DISK_AFTER_LAST > +}; > +#define PM_DISK_MAX (__PM_DISK_AFTER_LAST-1) > +#define PM_DISK_FIRST (PM_DISK_INVALID + 1) > + > +void hibernate_set_ops(struct hibernate_ops *ops) > +{ > + BUG_ON(!hibernate_ops->prepare); > + BUG_ON(!hibernate_ops->enter); > + BUG_ON(!hibernate_ops->finish); > + mutex_lock(&pm_mutex); > + hibernate_ops = ops; > + mutex_unlock(&pm_mutex); > +} > + > + > /** > - * platform_prepare - prepare the machine for hibernation using the > - * platform driver if so configured and return an error code if it fails > + * hibernate_platform_prepare - prepare the machine for hibernation using > + * the platform driver if so configured and return an error code if it > + * fails. > */ > > -static inline int platform_prepare(void) > +int hibernate_platform_prepare(void) > { > - int error = 0; > - > switch (pm_disk_mode) { > case PM_DISK_TEST: > case PM_DISK_TESTPROC: > case PM_DISK_SHUTDOWN: > case PM_DISK_REBOOT: > break; > - default: > - if (pm_ops && pm_ops->prepare) > - error = pm_ops->prepare(PM_SUSPEND_DISK); > + case PM_DISK_PLATFORM: > + if (hibernate_ops) > + return hibernate_ops->prepare(); > } > - return error; > + return 0; > } > > /** > - * power_down - Shut machine down for hibernate. > + * hibernate_power_down - Shut machine down for hibernate. > * > * Use the platform driver, if configured so; otherwise try > * to power off or reboot. > */ > > -static void power_down(void) > +static void hibernate_power_down(void) > { > switch (pm_disk_mode) { > case PM_DISK_TEST: > @@ -70,11 +97,10 @@ static void power_down(void) > case PM_DISK_REBOOT: > kernel_restart(NULL); > break; > - default: > - if (pm_ops && pm_ops->enter) { > + case PM_DISK_PLATFORM: > + if (hibernate_ops) { > kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); > - pm_ops->enter(PM_SUSPEND_DISK); > - break; > + hibernate_ops->enter(); > } > } > > @@ -85,7 +111,7 @@ static void power_down(void) > while(1); > } > > -static inline void platform_finish(void) > +void hibernate_platform_finish(void) > { > switch (pm_disk_mode) { > case PM_DISK_TEST: > @@ -93,9 +119,9 @@ static inline void platform_finish(void) > case PM_DISK_SHUTDOWN: > case PM_DISK_REBOOT: > break; > - default: > - if (pm_ops && pm_ops->finish) > - pm_ops->finish(PM_SUSPEND_DISK); > + case PM_DISK_PLATFORM: > + if (hibernate_ops) > + hibernate_ops->finish(); > } > } > > @@ -118,13 +144,13 @@ static int prepare_processes(void) > } > > /** > - * pm_suspend_disk - The granpappy of hibernation power management. > + * hibernate - The granpappy of hibernation power management. > * > * If not, then call swsusp to do its thing, then figure out how > * to power down the system. > */ > > -int pm_suspend_disk(void) > +int hibernate(void) > { > int error; > > @@ -147,7 +173,7 @@ int pm_suspend_disk(void) > if (error) > goto Finish; > > - error = platform_prepare(); > + error = hibernate_platform_prepare(); > if (error) > goto Finish; > > @@ -175,13 +201,13 @@ int pm_suspend_disk(void) > > if (in_suspend) { > enable_nonboot_cpus(); > - platform_finish(); > + hibernate_platform_finish(); > device_resume(); > resume_console(); > pr_debug("PM: writing image.\n"); > error = swsusp_write(); > if (!error) > - power_down(); > + hibernate_power_down(); > else { > swsusp_free(); > goto Finish; > @@ -194,7 +220,7 @@ int pm_suspend_disk(void) > Enable_cpus: > enable_nonboot_cpus(); > Resume_devices: > - platform_finish(); > + hibernate_platform_finish(); > device_resume(); > resume_console(); > Finish: > @@ -211,7 +237,7 @@ int pm_suspend_disk(void) > * Called as a late_initcall (so all devices are discovered and > * initialized), we call swsusp to see if we have a saved image or not. > * If so, we quiesce devices, the restore the saved image. We will > - * return above (in pm_suspend_disk() ) if everything goes well. > + * return above (in hibernate() ) if everything goes well. > * Otherwise, we fail gracefully and return to the normally > * scheduled program. > * > @@ -311,12 +337,13 @@ static const char * const pm_disk_modes[ > * > * Suspend-to-disk can be handled in several ways. We have a few options > * for putting the system to sleep - using the platform driver (e.g. ACPI > - * or other pm_ops), powering off the system or rebooting the system > - * (for testing) as well as the two test modes. > + * or other hibernate_ops), powering off the system or rebooting the > + * system (for testing) as well as the two test modes. > * > * The system can support 'platform', and that is known a priori (and > - * encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot' > - * as alternatives, as well as the test modes 'test' and 'testproc'. > + * encoded by the presence of hibernate_ops). However, the user may choose > + * 'shutdown' or 'reboot' as alternatives, as well as the test modes 'test' > + * and 'testproc'. > * > * show() will display what the mode is currently set to. > * store() will accept one of > @@ -328,7 +355,7 @@ static const char * const pm_disk_modes[ > * 'testproc' > * > * It will only change to 'platform' if the system > - * supports it (as determined from pm_ops->pm_disk_mode). > + * supports it (as determined by having hibernate_ops). > */ > > static ssize_t disk_show(struct subsystem * subsys, char * buf) > @@ -336,7 +363,7 @@ static ssize_t disk_show(struct subsyste > int i; > char *start = buf; > > - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { > + for (i = PM_DISK_FIRST; i <= PM_DISK_MAX; i++) { > if (!pm_disk_modes[i]) > continue; > switch (i) { > @@ -345,9 +372,8 @@ static ssize_t disk_show(struct subsyste > case PM_DISK_TEST: > case PM_DISK_TESTPROC: > break; > - default: > - if (pm_ops && pm_ops->enter && > - (i == pm_ops->pm_disk_mode)) > + case PM_DISK_PLATFORM: > + if (hibernate_ops) > break; > /* not a valid mode, continue with loop */ > continue; > @@ -370,19 +396,19 @@ static ssize_t disk_store(struct subsyst > int i; > int len; > char *p; > - suspend_disk_method_t mode = 0; > + int mode = PM_DISK_INVALID; > > p = memchr(buf, '\n', n); > len = p ? p - buf : n; > > mutex_lock(&pm_mutex); > - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { > + for (i = PM_DISK_FIRST; i < PM_DISK_MAX; i++) { > if (!strncmp(buf, pm_disk_modes[i], len)) { > mode = i; > break; > } > } > - if (mode) { > + if (mode != PM_DISK_INVALID) { > switch (mode) { > case PM_DISK_SHUTDOWN: > case PM_DISK_REBOOT: > @@ -390,19 +416,18 @@ static ssize_t disk_store(struct subsyst > case PM_DISK_TESTPROC: > pm_disk_mode = mode; > break; > - default: > - if (pm_ops && pm_ops->enter && > - (mode == pm_ops->pm_disk_mode)) > + case PM_DISK_PLATFORM: > + if (hibernate_ops) > pm_disk_mode = mode; > else > error = -EINVAL; > } > - } else { > + } else > error = -EINVAL; > - } > > - pr_debug("PM: suspend-to-disk mode set to '%s'\n", > - pm_disk_modes[mode]); > + if (!error) > + pr_debug("PM: suspend-to-disk mode set to '%s'\n", > + pm_disk_modes[mode]); > mutex_unlock(&pm_mutex); > return error ? error : n; > } > --- wireless-dev.orig/kernel/power/user.c 2007-04-26 18:15:01.130691185 +0200 > +++ wireless-dev/kernel/power/user.c 2007-04-26 18:15:09.420691185 +0200 > @@ -128,22 +128,6 @@ static ssize_t snapshot_write(struct fil > return res; > } > > -static inline int platform_prepare(void) > -{ > - int error = 0; > - > - if (pm_ops && pm_ops->prepare) > - error = pm_ops->prepare(PM_SUSPEND_DISK); > - > - return error; > -} > - > -static inline void platform_finish(void) > -{ > - if (pm_ops && pm_ops->finish) > - pm_ops->finish(PM_SUSPEND_DISK); > -} > - > static inline int snapshot_suspend(int platform_suspend) > { > int error; > @@ -155,7 +139,7 @@ static inline int snapshot_suspend(int p > goto Finish; > > if (platform_suspend) { > - error = platform_prepare(); > + error = hibernate_platform_prepare(); > if (error) > goto Finish; > } > @@ -172,7 +156,7 @@ static inline int snapshot_suspend(int p > enable_nonboot_cpus(); > Resume_devices: > if (platform_suspend) > - platform_finish(); > + hibernate_platform_finish(); > > device_resume(); > resume_console(); > @@ -188,7 +172,7 @@ static inline int snapshot_restore(int p > mutex_lock(&pm_mutex); > pm_prepare_console(); > if (platform_suspend) { > - error = platform_prepare(); > + error = hibernate_platform_prepare(); > if (error) > goto Finish; > } > @@ -204,7 +188,7 @@ static inline int snapshot_restore(int p > enable_nonboot_cpus(); > Resume_devices: > if (platform_suspend) > - platform_finish(); > + hibernate_platform_finish(); > > device_resume(); > resume_console(); > @@ -406,13 +390,15 @@ static int snapshot_ioctl(struct inode * > case PMOPS_ENTER: > if (data->platform_suspend) { > kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); > - error = pm_ops->enter(PM_SUSPEND_DISK); > + error = hibernate_ops->enter(); > + /* how can this possibly do the right thing? */ > error = 0; > } > break; > > case PMOPS_FINISH: > if (data->platform_suspend) > + /* and why doesn't this invoke anything??? */ > error = 0; > > break; > --- wireless-dev.orig/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:02.120691185 +0200 > +++ wireless-dev/Documentation/power/userland-swsusp.txt 2007-04-26 18:15:09.440691185 +0200 > @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t > to resume the system from RAM if there's enough battery power or restore > its state on the basis of the saved suspend image otherwise) > > -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and > - pmops->finish methods (the in-kernel swsusp knows these as the "platform > - method") which are needed on many machines to (among others) speed up > - the resume by letting the BIOS skip some steps or to let the system > - recognise the correct state of the hardware after the resume (in > - particular on many machines this ensures that unplugged AC > - adapters get correctly detected and that kacpid does not run wild after > - the resume). The last ioctl() argument can take one of the three > - values, defined in kernel/power/power.h: > +SNAPSHOT_PMOPS - enable the usage of the hibernate_ops->prepare, > + hibernate_ops->enter and hibernate_ops->finish methods (the in-kernel > + swsusp knows these as the "platform method") which are needed on many > + machines to (among others) speed up the resume by letting the BIOS skip > + some steps or to let the system recognise the correct state of the > + hardware after the resume (in particular on many machines this ensures > + that unplugged AC adapters get correctly detected and that kacpid does > + not run wild after the resume). The last ioctl() argument can take one > + of the three values, defined in kernel/power/power.h: > PMOPS_PREPARE - make the kernel carry out the > - pm_ops->prepare(PM_SUSPEND_DISK) operation > + hibernate_ops->prepare() operation > PMOPS_ENTER - make the kernel power off the system by calling > - pm_ops->enter(PM_SUSPEND_DISK) > + hibernate_ops->enter() > PMOPS_FINISH - make the kernel carry out the > - pm_ops->finish(PM_SUSPEND_DISK) operation > + hibernate_ops->finish() operation > + Note that the actual constants are misnamed because they surface > + internal kernel implementation details that have changed. > > The device's read() operation can be used to transfer the snapshot image from > the kernel. It has the following limitations: > --- wireless-dev.orig/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:02.150691185 +0200 > +++ wireless-dev/drivers/i2c/chips/tps65010.c 2007-04-26 18:15:09.440691185 +0200 > @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp > * also needs to get error handling and probably > * an #ifdef CONFIG_SOFTWARE_SUSPEND > */ > - pm_suspend(PM_SUSPEND_DISK); > + hibernate(); > #endif > poll = 1; > } > --- wireless-dev.orig/kernel/sys.c 2007-04-26 18:15:01.310691185 +0200 > +++ wireless-dev/kernel/sys.c 2007-04-26 18:15:09.450691185 +0200 > @@ -25,6 +25,7 @@ > #include <linux/security.h> > #include <linux/dcookies.h> > #include <linux/suspend.h> > +#include <linux/hibernate.h> > #include <linux/tty.h> > #include <linux/signal.h> > #include <linux/cn_proc.h> > @@ -881,7 +882,7 @@ asmlinkage long sys_reboot(int magic1, i > #ifdef CONFIG_SOFTWARE_SUSPEND > case LINUX_REBOOT_CMD_SW_SUSPEND: > { > - int ret = pm_suspend(PM_SUSPEND_DISK); > + int ret = hibernate(); > unlock_kernel(); > return ret; > } > --- wireless-dev.orig/drivers/acpi/sleep/main.c 2007-04-26 18:15:02.290691185 +0200 > +++ wireless-dev/drivers/acpi/sleep/main.c 2007-04-26 18:15:09.630691185 +0200 > @@ -15,6 +15,7 @@ > #include <linux/dmi.h> > #include <linux/device.h> > #include <linux/suspend.h> > +#include <linux/hibernate.h> > #include <acpi/acpi_bus.h> > #include <acpi/acpi_drivers.h> > #include "sleep.h" > @@ -29,7 +30,6 @@ static u32 acpi_suspend_states[] = { > [PM_SUSPEND_ON] = ACPI_STATE_S0, > [PM_SUSPEND_STANDBY] = ACPI_STATE_S1, > [PM_SUSPEND_MEM] = ACPI_STATE_S3, > - [PM_SUSPEND_DISK] = ACPI_STATE_S4, > [PM_SUSPEND_MAX] = ACPI_STATE_S5 > }; > > @@ -94,14 +94,6 @@ static int acpi_pm_enter(suspend_state_t > do_suspend_lowlevel(); > break; > > - case PM_SUSPEND_DISK: > - if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM) > - status = acpi_enter_sleep_state(acpi_state); > - break; > - case PM_SUSPEND_MAX: > - acpi_power_off(); > - break; > - > default: > return -EINVAL; > } > @@ -157,12 +149,13 @@ int acpi_suspend(u32 acpi_state) > suspend_state_t states[] = { > [1] = PM_SUSPEND_STANDBY, > [3] = PM_SUSPEND_MEM, > - [4] = PM_SUSPEND_DISK, > [5] = PM_SUSPEND_MAX > }; > > if (acpi_state < 6 && states[acpi_state]) > return pm_suspend(states[acpi_state]); > + if (acpi_state == 4) > + return hibernate(); > return -EINVAL; > } > > @@ -189,6 +182,71 @@ static struct pm_ops acpi_pm_ops = { > .finish = acpi_pm_finish, > }; > > +#ifdef CONFIG_SOFTWARE_SUSPEND > +static int acpi_hib_prepare(void) > +{ > + return acpi_sleep_prepare(ACPI_STATE_S4); > +} > + > +static int acpi_hib_enter(void) > +{ > + acpi_status status = AE_OK; > + unsigned long flags = 0; > + u32 acpi_state = acpi_suspend_states[pm_state]; > + > + ACPI_FLUSH_CPU_CACHE(); > + > + /* Do arch specific saving of state. */ > + int error = acpi_save_state_mem(); > + if (error) > + return error; > + > + local_irq_save(flags); > + acpi_enable_wakeup_device(acpi_state); > + status = acpi_enter_sleep_state(acpi_state); > + > + /* ACPI 3.0 specs (P62) says that it's the responsabilty > + * of the OSPM to clear the status bit [ implying that the > + * POWER_BUTTON event should not reach userspace ] > + */ > + if (ACPI_SUCCESS(status) && (acpi_state == ACPI_STATE_S3)) > + acpi_clear_event(ACPI_EVENT_POWER_BUTTON); > + > + local_irq_restore(flags); > + printk(KERN_DEBUG "Back to C!\n"); > + > + /* restore processor state > + * We should only be here if we're coming back from STR or STD. > + * And, in the case of the latter, the memory image should have already > + * been loaded from disk. > + */ > + acpi_restore_state_mem(); > + > + return ACPI_SUCCESS(status) ? 0 : -EFAULT; > +} > + > +static void acpi_hib_finish(void) > +{ > + acpi_leave_sleep_state(ACPI_STATE_S4); > + acpi_disable_wakeup_device(ACPI_STATE_S4); > + > + /* reset firmware waking vector */ > + acpi_set_firmware_waking_vector((acpi_physical_address) 0); > + > + if (init_8259A_after_S1) { > + printk("Broken toshiba laptop -> kicking interrupts\n"); > + init_8259A(0); > + } > + return 0; > +} > + > +static struct hibernate_ops acpi_hib_ops = { > + .prepare = acpi_hib_prepare, > + .enter = acpi_hib_enter, > + .finish = acpi_hib_finish, > +}; > +#endif /* CONFIG_SOFTWARE_SUSPEND */ > + > /* > * Toshiba fails to preserve interrupts over S1, reinitialization > * of 8259 is needed after S1 resume. > @@ -227,13 +285,16 @@ int __init acpi_sleep_init(void) > sleep_states[i] = 1; > printk(" S%d", i); > } > - if (i == ACPI_STATE_S4) { > - if (sleep_states[i]) > - acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM; > - } > } > printk(")\n"); > > +#ifdef CONFIG_SOFTWARE_SUSPEND > + if (sleep_states[ACPI_STATE_S4]) > + hibernate_set_ops(&acpi_hib_ops); > +#else > + sleep_states[ACPI_STATE_S4] = 0; > +#endif > + > pm_set_ops(&acpi_pm_ops); > return 0; > } > --- wireless-dev.orig/kernel/power/power.h 2007-04-26 18:15:01.240691185 +0200 > +++ wireless-dev/kernel/power/power.h 2007-04-26 18:15:09.630691185 +0200 > @@ -13,16 +13,6 @@ struct swsusp_info { > > > > -#ifdef CONFIG_SOFTWARE_SUSPEND > -extern int pm_suspend_disk(void); > - > -#else > -static inline int pm_suspend_disk(void) > -{ > - return -EPERM; > -} > -#endif > - > extern struct mutex pm_mutex; > > #define power_attr(_name) \ > @@ -179,3 +169,6 @@ extern int suspend_enter(suspend_state_t > struct timeval; > extern void swsusp_show_speed(struct timeval *, struct timeval *, > unsigned int, char *); > + > +extern int hibernate_platform_prepare(void); > +extern void hibernate_platform_finish(void); > --- wireless-dev.orig/drivers/acpi/sleep/proc.c 2007-04-26 18:15:02.720691185 +0200 > +++ wireless-dev/drivers/acpi/sleep/proc.c 2007-04-26 18:15:09.630691185 +0200 > @@ -1,6 +1,7 @@ > #include <linux/proc_fs.h> > #include <linux/seq_file.h> > #include <linux/suspend.h> > +#include <linux/hibernate.h> > #include <linux/bcd.h> > #include <asm/uaccess.h> > > @@ -60,7 +61,7 @@ acpi_system_write_sleep(struct file *fil > state = simple_strtoul(str, NULL, 0); > #ifdef CONFIG_SOFTWARE_SUSPEND > if (state == 4) { > - error = pm_suspend(PM_SUSPEND_DISK); > + error = hibernate(); > goto Done; > } > #endif > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > -- Rafael J. Wysocki, Ph.D. Institute of Theoretical Physics Faculty of Physics of Warsaw University ul. Hoza 69, 00-681 Warsaw [tel: +48 22 55 32 263] [mob: +48 60 50 53 693] ---------------------------- One should not increase, beyond what is necessary, the number of entities required to explain anything. -- William of Ockham ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-29 12:48 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki @ 2007-04-29 12:53 ` Rafael J. Wysocki 2007-04-30 8:29 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-29 12:53 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Sunday, 29 April 2007 14:48, R. J. Wysocki wrote: > [Trimmed the CC list to a reasonable minimum] > > On Thursday, 26 April 2007 18:31, Johannes Berg wrote: > > On Thu, 2007-04-26 at 13:30 +0200, Pavel Machek wrote: > > > > > > From looking at pm_ops which I was recently working with a lot, it seems > > > > that it was designed by somebody who was reading the ACPI documentation > > > > and was otherwise pretty clueless, even at that level std tries to look > > > > like suspend. IMHO that is one of the first things that should be ripped > > > > out, no pm_ops for STD, it's a pain to work with. > > > > > > That code goes back to Patrick, AFAICT. (And yes, ACPI S3 and ACPI S4 > > > low-level enter is pretty similar). > > > > > > Patches would be welcome > > > > That was easier than I thought. This applies on top of a patch that > > makes kernel/power/user.c optional since I had no idea how to fix it, > > problems I see: > > * it surfaces kernel implementation details about pm_ops and thus makes > > the whole thing very fragile > > * it has yet another interface (yuck) to determine whether to reboot, > > shut down etc, doesn't use /sys/power/disk > > * I generally had no idea wtf it is doing in some places > > > > Anyway, this patch is only compile tested, it > > * introduces include/linux/hibernate.h with hibernate_ops and > > a new hibernate() function to hibernate the system > > * rips apart a lot of the suspend code and puts it back together using > > the hibernate_ops > > * switches ACPI to hibernate_ops (the only user of pm_ops.pm_disk_mode) > > * might apply/compile against -mm, I have all my and some of Rafael's > > suspend/hibernate work in my tree. > > * breaks user suspend as I noted above > > * is incomplete, somewhere pm_suspend_disk() is still defined iirc > > OK, I reworked it a bit. > > Main changes: > > - IMHO 'hibernation_ops' sounds better than 'hibernate_ops', for example, so > now the new names start with 'hibernation_' (or 'HIBERNATION_') > > - Moved the hibernation-related definitions to include/linux/suspend.h, since > some hibernation-specific definitions are already there. We can introduce > hibernation.h in a separate patch (it'll have to #include suspend.h IMO). > > - Changed the names starting from 'pm_disk_' (or 'PM_DISK_'). > > - Cleaned up the new ACPI code (it didn't compile and included some things > unrelated to hibernation). I'm still not sure about acpi_hibernation_finish() > (is the code after acpi_disable_wakeup_device() really needed?) > > - Made kernel/power/user.c compile (and hopefully work too) Forgot to say that hibernation_ops is needed, IMO, because ACPI can be modular. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-29 12:48 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki 2007-04-29 12:53 ` Rafael J. Wysocki @ 2007-04-30 8:29 ` Johannes Berg 2007-04-30 14:51 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-30 8:29 UTC (permalink / raw) To: R. J. Wysocki; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 787 bytes --] On Sun, 2007-04-29 at 14:48 +0200, R. J. Wysocki wrote: > + status = acpi_enter_sleep_state(ACPI_STATE_S4); > + local_irq_restore(flags); > + > + /* > + * Restore processor state > + * We should only be here if we're coming back from hibernation and > + * the memory image should have already been loaded from disk. That comment doesn't seem right. This is in ->enter so afaict the image hasn't been loaded yet at this point. I don't know if you just moved code but if you did then I don't think it was correct before. > + */ > + acpi_restore_state_mem(); Maybe that needs to be in ->finish then? Or somewhere in the deeper arch code? Other than that it looks good to me on a cursory look. I'll give it a try on my G5 on Wednesday or Thursday. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-30 8:29 ` Johannes Berg @ 2007-04-30 14:51 ` Rafael J. Wysocki 2007-04-30 14:59 ` Johannes Berg 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-30 14:51 UTC (permalink / raw) To: Johannes Berg, Pavel Machek; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham On Monday, 30 April 2007 10:29, Johannes Berg wrote: > On Sun, 2007-04-29 at 14:48 +0200, R. J. Wysocki wrote: > > > + status = acpi_enter_sleep_state(ACPI_STATE_S4); > > + local_irq_restore(flags); > > + > > + /* > > + * Restore processor state > > + * We should only be here if we're coming back from hibernation and > > + * the memory image should have already been loaded from disk. > > That comment doesn't seem right. This is in ->enter so afaict the image > hasn't been loaded yet at this point. I don't know if you just moved > code but if you did then I don't think it was correct before. It was in your patch, so I kept it, but I don't think it's correct too. Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are only needed by s2ram, so we can safely remove them from the hibernation code path. Pavel, is that correct? > > + */ > > + acpi_restore_state_mem(); > > Maybe that needs to be in ->finish then? Or somewhere in the deeper arch > code? > > Other than that it looks good to me on a cursory look. I'll give it a > try on my G5 on Wednesday or Thursday. I think I'll have an improved version till then. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-30 14:51 ` Rafael J. Wysocki @ 2007-04-30 14:59 ` Johannes Berg 2007-05-01 14:05 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-30 14:59 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 736 bytes --] On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote: > > That comment doesn't seem right. This is in ->enter so afaict the image > > hasn't been loaded yet at this point. I don't know if you just moved > > code but if you did then I don't think it was correct before. > > It was in your patch, so I kept it, but I don't think it's correct too. If it was in my patch then it must be there in the original code, iirc I just shuffled it a bit :) > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are > only needed by s2ram, so we can safely remove them from the hibernation code > path. Pavel, is that correct? This I don't know. They seemed to be done on hibernate too. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-04-30 14:59 ` Johannes Berg @ 2007-05-01 14:05 ` Rafael J. Wysocki 2007-05-01 22:02 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-01 14:05 UTC (permalink / raw) To: Johannes Berg; +Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Pavel Machek On Monday, 30 April 2007 16:59, Johannes Berg wrote: > On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote: > > > > That comment doesn't seem right. This is in ->enter so afaict the image > > > hasn't been loaded yet at this point. I don't know if you just moved > > > code but if you did then I don't think it was correct before. > > > > It was in your patch, so I kept it, but I don't think it's correct too. > > If it was in my patch then it must be there in the original code, iirc I > just shuffled it a bit :) > > > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are > > only needed by s2ram, so we can safely remove them from the hibernation code > > path. Pavel, is that correct? > > This I don't know. They seemed to be done on hibernate too. The previous version of the patch was missing the changes in suspend.h. Apart from this I've cleaned up some changes in disk.c and main.c to make the sysfs interface work again and dropped some ACPI code that I think was not necessary. Patch appended (tested on x86_64, but not extensively), comments welcome. :-) Greetings, Rafael --- This patch: * removes the definitions related to hibernation (aka suspend to disk) from include/linux/pm.h * introduces struct hibernation_ops and a new function to hibernate the system called hibernate(), defined in include/linux/suspend.h * separates suspend code in kernel/power/main.c from hibernation-related code in kernel/power/disk.c and kernel/power/user.c (with the help of hibernation_ops) * switches ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops --- Documentation/power/userland-swsusp.txt | 26 ++-- drivers/acpi/sleep/main.c | 67 +++++++++-- drivers/acpi/sleep/proc.c | 2 drivers/i2c/chips/tps65010.c | 2 include/linux/pm.h | 31 ----- include/linux/suspend.h | 32 +++++ kernel/power/disk.c | 186 +++++++++++++++++--------------- kernel/power/main.c | 42 ++----- kernel/power/power.h | 7 - kernel/power/user.c | 13 +- kernel/sys.c | 2 11 files changed, 225 insertions(+), 185 deletions(-) Index: linux-2.6.21/include/linux/pm.h =================================================================== --- linux-2.6.21.orig/include/linux/pm.h 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/include/linux/pm.h 2007-05-01 13:35:33.000000000 +0200 @@ -107,26 +107,11 @@ typedef int __bitwise suspend_state_t; #define PM_SUSPEND_ON ((__force suspend_state_t) 0) #define PM_SUSPEND_STANDBY ((__force suspend_state_t) 1) #define PM_SUSPEND_MEM ((__force suspend_state_t) 3) -#define PM_SUSPEND_DISK ((__force suspend_state_t) 4) -#define PM_SUSPEND_MAX ((__force suspend_state_t) 5) - -typedef int __bitwise suspend_disk_method_t; - -/* invalid must be 0 so struct pm_ops initialisers can leave it out */ -#define PM_DISK_INVALID ((__force suspend_disk_method_t) 0) -#define PM_DISK_PLATFORM ((__force suspend_disk_method_t) 1) -#define PM_DISK_SHUTDOWN ((__force suspend_disk_method_t) 2) -#define PM_DISK_REBOOT ((__force suspend_disk_method_t) 3) -#define PM_DISK_TEST ((__force suspend_disk_method_t) 4) -#define PM_DISK_TESTPROC ((__force suspend_disk_method_t) 5) -#define PM_DISK_MAX ((__force suspend_disk_method_t) 6) +#define PM_SUSPEND_MAX ((__force suspend_state_t) 4) /** * struct pm_ops - Callbacks for managing platform dependent suspend states. * @valid: Callback to determine whether the given state can be entered. - * If %CONFIG_SOFTWARE_SUSPEND is set then %PM_SUSPEND_DISK is - * always valid and never passed to this call. If not assigned, - * no suspend states are valid. * Valid states are advertised in /sys/power/state but can still * be rejected by prepare or enter if the conditions aren't right. * There is a %pm_valid_only_mem function available that can be assigned @@ -140,24 +125,12 @@ typedef int __bitwise suspend_disk_metho * * @finish: Called when the system has left the given state and all devices * are resumed. The return value is ignored. - * - * @pm_disk_mode: The generic code always allows one of the shutdown methods - * %PM_DISK_SHUTDOWN, %PM_DISK_REBOOT, %PM_DISK_TEST and - * %PM_DISK_TESTPROC. If this variable is set, the mode it is set - * to is allowed in addition to those modes and is also made default. - * When this mode is sent selected, the @prepare call will be called - * before suspending to disk (if present), the @enter call should be - * present and will be called after all state has been saved and the - * machine is ready to be powered off; the @finish callback is called - * after state has been restored. All these calls are called with - * %PM_SUSPEND_DISK as the state. */ struct pm_ops { int (*valid)(suspend_state_t state); int (*prepare)(suspend_state_t state); int (*enter)(suspend_state_t state); int (*finish)(suspend_state_t state); - suspend_disk_method_t pm_disk_mode; }; /** @@ -276,8 +249,6 @@ extern void device_power_up(void); extern void device_resume(void); #ifdef CONFIG_PM -extern suspend_disk_method_t pm_disk_mode; - extern int device_suspend(pm_message_t state); extern int device_prepare_suspend(pm_message_t state); Index: linux-2.6.21/kernel/power/main.c =================================================================== --- linux-2.6.21.orig/kernel/power/main.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/kernel/power/main.c 2007-05-01 15:14:00.000000000 +0200 @@ -30,7 +30,6 @@ DEFINE_MUTEX(pm_mutex); struct pm_ops *pm_ops; -suspend_disk_method_t pm_disk_mode = PM_DISK_SHUTDOWN; /** * pm_set_ops - Set the global power method table. @@ -41,10 +40,6 @@ void pm_set_ops(struct pm_ops * ops) { mutex_lock(&pm_mutex); pm_ops = ops; - if (ops && ops->pm_disk_mode != PM_DISK_INVALID) { - pm_disk_mode = ops->pm_disk_mode; - } else - pm_disk_mode = PM_DISK_SHUTDOWN; mutex_unlock(&pm_mutex); } @@ -184,24 +179,12 @@ static void suspend_finish(suspend_state static const char * const pm_states[PM_SUSPEND_MAX] = { [PM_SUSPEND_STANDBY] = "standby", [PM_SUSPEND_MEM] = "mem", - [PM_SUSPEND_DISK] = "disk", }; static inline int valid_state(suspend_state_t state) { - /* Suspend-to-disk does not really need low-level support. - * It can work with shutdown/reboot if needed. If it isn't - * configured, then it cannot be supported. - */ - if (state == PM_SUSPEND_DISK) -#ifdef CONFIG_SOFTWARE_SUSPEND - return 1; -#else - return 0; -#endif - - /* all other states need lowlevel support and need to be - * valid to the lowlevel implementation, no valid callback + /* All states need lowlevel support and need to be valid + * to the lowlevel implementation, no valid callback * implies that none are valid. */ if (!pm_ops || !pm_ops->valid || !pm_ops->valid(state)) return 0; @@ -229,11 +212,6 @@ static int enter_state(suspend_state_t s if (!mutex_trylock(&pm_mutex)) return -EBUSY; - if (state == PM_SUSPEND_DISK) { - error = pm_suspend_disk(); - goto Unlock; - } - pr_debug("PM: Preparing system for %s sleep\n", pm_states[state]); if ((error = suspend_prepare(state))) goto Unlock; @@ -251,7 +229,7 @@ static int enter_state(suspend_state_t s /** * pm_suspend - Externally visible function for suspending system. - * @state: Enumarted value of state to enter. + * @state: Enumerated value of state to enter. * * Determine whether or not value is within range, get state * structure, and enter (above). @@ -289,7 +267,13 @@ static ssize_t state_show(struct subsyst if (pm_states[i] && valid_state(i)) s += sprintf(s,"%s ", pm_states[i]); } - s += sprintf(s,"\n"); +#ifdef CONFIG_SOFTWARE_SUSPEND + s += sprintf(s, "%s\n", "disk"); +#else + if (s != buf) + /* convert the last space to a newline */ + *(s-1) = "\n"; +#endif return (s - buf); } @@ -304,6 +288,12 @@ static ssize_t state_store(struct subsys p = memchr(buf, '\n', n); len = p ? p - buf : n; + /* First, check if we are requested to hibernate */ + if (!strncmp(buf, "disk", len)) { + error = hibernate(); + return error ? error : n; + } + for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { if (*s && !strncmp(buf, *s, len)) break; Index: linux-2.6.21/kernel/power/disk.c =================================================================== --- linux-2.6.21.orig/kernel/power/disk.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/kernel/power/disk.c 2007-05-01 15:14:13.000000000 +0200 @@ -30,30 +30,60 @@ char resume_file[256] = CONFIG_PM_STD_PA dev_t swsusp_resume_device; sector_t swsusp_resume_block; +static int hibernation_mode; + +enum { + HIBERNATION_INVALID, + HIBERNATION_PLATFORM, + HIBERNATION_TEST, + HIBERNATION_TESTPROC, + HIBERNATION_SHUTDOWN, + HIBERNATION_REBOOT, + /* keep last */ + __HIBERNATION_AFTER_LAST +}; +#define HIBERNATION_MAX (__HIBERNATION_AFTER_LAST-1) +#define HIBERNATION_FIRST (HIBERNATION_INVALID + 1) + +struct hibernation_ops *hibernation_ops; + +void hibernation_set_ops(struct hibernation_ops *ops) +{ + if (ops && !(ops->prepare && ops->enter && ops->finish)) { + printk(KERN_ERR "Wrong definition of hibernation operations! " + "Using defaults\n"); + return; + } + mutex_lock(&pm_mutex); + hibernation_ops = ops; + mutex_unlock(&pm_mutex); +} + + /** * platform_prepare - prepare the machine for hibernation using the * platform driver if so configured and return an error code if it fails */ -static inline int platform_prepare(void) +static int platform_prepare(void) { - int error = 0; + return (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) ? + hibernation_ops->prepare() : 0; +} - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - break; - default: - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); - } - return error; +/** + * platform_finish - switch the machine to the normal mode of operation + * using the platform driver (must be called after platform_prepare()) + */ + +static void platform_finish(void) +{ + if (hibernation_mode == HIBERNATION_PLATFORM && hibernation_ops) + hibernation_ops->finish(); } /** - * power_down - Shut machine down for hibernate. + * power_down - Shut the machine down for hibernation. * * Use the platform driver, if configured so; otherwise try * to power off or reboot. @@ -61,20 +91,20 @@ static inline int platform_prepare(void) static void power_down(void) { - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: + switch (hibernation_mode) { + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: break; - case PM_DISK_SHUTDOWN: + case HIBERNATION_SHUTDOWN: kernel_power_off(); break; - case PM_DISK_REBOOT: + case HIBERNATION_REBOOT: kernel_restart(NULL); break; - default: - if (pm_ops && pm_ops->enter) { + case HIBERNATION_PLATFORM: + if (hibernation_ops) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - pm_ops->enter(PM_SUSPEND_DISK); + hibernation_ops->enter(); break; } } @@ -87,20 +117,6 @@ static void power_down(void) while(1); } -static inline void platform_finish(void) -{ - switch (pm_disk_mode) { - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - break; - default: - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); - } -} - static void unprepare_processes(void) { thaw_processes(); @@ -120,13 +136,10 @@ static int prepare_processes(void) } /** - * pm_suspend_disk - The granpappy of hibernation power management. - * - * If not, then call swsusp to do its thing, then figure out how - * to power down the system. + * hibernate - The granpappy of the built-in hibernation management */ -int pm_suspend_disk(void) +int hibernate(void) { int error; @@ -143,7 +156,8 @@ int pm_suspend_disk(void) if (error) goto Finish; - if (pm_disk_mode == PM_DISK_TESTPROC) { + mutex_lock(&pm_mutex); + if (hibernation_mode == HIBERNATION_TESTPROC) { printk("swsusp debug: Waiting for 5 seconds.\n"); mdelay(5000); goto Thaw; @@ -168,7 +182,7 @@ int pm_suspend_disk(void) if (error) goto Enable_cpus; - if (pm_disk_mode == PM_DISK_TEST) { + if (hibernation_mode == HIBERNATION_TEST) { printk("swsusp debug: Waiting for 5 seconds.\n"); mdelay(5000); goto Enable_cpus; @@ -205,6 +219,7 @@ int pm_suspend_disk(void) device_resume(); resume_console(); Thaw: + mutex_unlock(&pm_mutex); unprepare_processes(); Finish: free_basic_memory_bitmaps(); @@ -220,7 +235,7 @@ int pm_suspend_disk(void) * Called as a late_initcall (so all devices are discovered and * initialized), we call swsusp to see if we have a saved image or not. * If so, we quiesce devices, the restore the saved image. We will - * return above (in pm_suspend_disk() ) if everything goes well. + * return above (in hibernate() ) if everything goes well. * Otherwise, we fail gracefully and return to the normally * scheduled program. * @@ -315,25 +330,26 @@ static int software_resume(void) late_initcall(software_resume); -static const char * const pm_disk_modes[] = { - [PM_DISK_PLATFORM] = "platform", - [PM_DISK_SHUTDOWN] = "shutdown", - [PM_DISK_REBOOT] = "reboot", - [PM_DISK_TEST] = "test", - [PM_DISK_TESTPROC] = "testproc", +static const char * const hibernation_modes[] = { + [HIBERNATION_PLATFORM] = "platform", + [HIBERNATION_SHUTDOWN] = "shutdown", + [HIBERNATION_REBOOT] = "reboot", + [HIBERNATION_TEST] = "test", + [HIBERNATION_TESTPROC] = "testproc", }; /** - * disk - Control suspend-to-disk mode + * disk - Control hibernation mode * * Suspend-to-disk can be handled in several ways. We have a few options * for putting the system to sleep - using the platform driver (e.g. ACPI - * or other pm_ops), powering off the system or rebooting the system - * (for testing) as well as the two test modes. + * or other hibernation_ops), powering off the system or rebooting the + * system (for testing) as well as the two test modes. * * The system can support 'platform', and that is known a priori (and - * encoded in pm_ops). However, the user may choose 'shutdown' or 'reboot' - * as alternatives, as well as the test modes 'test' and 'testproc'. + * encoded by the presence of hibernation_ops). However, the user may + * choose 'shutdown' or 'reboot' as alternatives, as well as one fo the + * test modes, 'test' or 'testproc'. * * show() will display what the mode is currently set to. * store() will accept one of @@ -345,7 +361,7 @@ static const char * const pm_disk_modes[ * 'testproc' * * It will only change to 'platform' if the system - * supports it (as determined from pm_ops->pm_disk_mode). + * supports it (as determined by having hibernation_ops). */ static ssize_t disk_show(struct subsystem * subsys, char * buf) @@ -353,28 +369,25 @@ static ssize_t disk_show(struct subsyste int i; char *start = buf; - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { - if (!pm_disk_modes[i]) + for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) { + if (!hibernation_modes[i]) continue; switch (i) { - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - case PM_DISK_TEST: - case PM_DISK_TESTPROC: + case HIBERNATION_SHUTDOWN: + case HIBERNATION_REBOOT: + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: break; - default: - if (pm_ops && pm_ops->enter && - (i == pm_ops->pm_disk_mode)) + case HIBERNATION_PLATFORM: + if (hibernation_ops) break; /* not a valid mode, continue with loop */ continue; } - if (i == pm_disk_mode) - buf += sprintf(buf, "[%s]", pm_disk_modes[i]); + if (i == hibernation_mode) + buf += sprintf(buf, "[%s] ", hibernation_modes[i]); else - buf += sprintf(buf, "%s", pm_disk_modes[i]); - if (i+1 != PM_DISK_MAX) - buf += sprintf(buf, " "); + buf += sprintf(buf, "%s ", hibernation_modes[i]); } buf += sprintf(buf, "\n"); return buf-start; @@ -387,39 +400,38 @@ static ssize_t disk_store(struct subsyst int i; int len; char *p; - suspend_disk_method_t mode = 0; + int mode = HIBERNATION_INVALID; p = memchr(buf, '\n', n); len = p ? p - buf : n; mutex_lock(&pm_mutex); - for (i = PM_DISK_PLATFORM; i < PM_DISK_MAX; i++) { - if (!strncmp(buf, pm_disk_modes[i], len)) { + for (i = HIBERNATION_FIRST; i <= HIBERNATION_MAX; i++) { + if (!strncmp(buf, hibernation_modes[i], len)) { mode = i; break; } } - if (mode) { + if (mode != HIBERNATION_INVALID) { switch (mode) { - case PM_DISK_SHUTDOWN: - case PM_DISK_REBOOT: - case PM_DISK_TEST: - case PM_DISK_TESTPROC: - pm_disk_mode = mode; + case HIBERNATION_SHUTDOWN: + case HIBERNATION_REBOOT: + case HIBERNATION_TEST: + case HIBERNATION_TESTPROC: + hibernation_mode = mode; break; - default: - if (pm_ops && pm_ops->enter && - (mode == pm_ops->pm_disk_mode)) - pm_disk_mode = mode; + case HIBERNATION_PLATFORM: + if (hibernation_ops) + hibernation_mode = mode; else error = -EINVAL; } - } else { + } else error = -EINVAL; - } - pr_debug("PM: suspend-to-disk mode set to '%s'\n", - pm_disk_modes[mode]); + if (!error) + pr_debug("PM: suspend-to-disk mode set to '%s'\n", + hibernation_modes[mode]); mutex_unlock(&pm_mutex); return error ? error : n; } Index: linux-2.6.21/Documentation/power/userland-swsusp.txt =================================================================== --- linux-2.6.21.orig/Documentation/power/userland-swsusp.txt 2007-05-01 13:35:25.000000000 +0200 +++ linux-2.6.21/Documentation/power/userland-swsusp.txt 2007-05-01 13:35:33.000000000 +0200 @@ -93,21 +93,23 @@ SNAPSHOT_S2RAM - suspend to RAM; using t to resume the system from RAM if there's enough battery power or restore its state on the basis of the saved suspend image otherwise) -SNAPSHOT_PMOPS - enable the usage of the pmops->prepare, pmops->enter and - pmops->finish methods (the in-kernel swsusp knows these as the "platform - method") which are needed on many machines to (among others) speed up - the resume by letting the BIOS skip some steps or to let the system - recognise the correct state of the hardware after the resume (in - particular on many machines this ensures that unplugged AC - adapters get correctly detected and that kacpid does not run wild after - the resume). The last ioctl() argument can take one of the three - values, defined in kernel/power/power.h: +SNAPSHOT_PMOPS - enable the usage of the hibernation_ops->prepare, + hibernate_ops->enter and hibernation_ops->finish methods (the in-kernel + swsusp knows these as the "platform method") which are needed on many + machines to (among others) speed up the resume by letting the BIOS skip + some steps or to let the system recognise the correct state of the + hardware after the resume (in particular on many machines this ensures + that unplugged AC adapters get correctly detected and that kacpid does + not run wild after the resume). The last ioctl() argument can take one + of the three values, defined in kernel/power/power.h: PMOPS_PREPARE - make the kernel carry out the - pm_ops->prepare(PM_SUSPEND_DISK) operation + hibernation_ops->prepare() operation PMOPS_ENTER - make the kernel power off the system by calling - pm_ops->enter(PM_SUSPEND_DISK) + hibernation_ops->enter() PMOPS_FINISH - make the kernel carry out the - pm_ops->finish(PM_SUSPEND_DISK) operation + hibernation_ops->finish() operation + Note that the actual constants are misnamed because they surface + internal kernel implementation details that have changed. The device's read() operation can be used to transfer the snapshot image from the kernel. It has the following limitations: Index: linux-2.6.21/drivers/i2c/chips/tps65010.c =================================================================== --- linux-2.6.21.orig/drivers/i2c/chips/tps65010.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/drivers/i2c/chips/tps65010.c 2007-05-01 13:35:33.000000000 +0200 @@ -354,7 +354,7 @@ static void tps65010_interrupt(struct tp * also needs to get error handling and probably * an #ifdef CONFIG_SOFTWARE_SUSPEND */ - pm_suspend(PM_SUSPEND_DISK); + hibernate(); #endif poll = 1; } Index: linux-2.6.21/kernel/sys.c =================================================================== --- linux-2.6.21.orig/kernel/sys.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/kernel/sys.c 2007-05-01 13:35:33.000000000 +0200 @@ -881,7 +881,7 @@ asmlinkage long sys_reboot(int magic1, i #ifdef CONFIG_SOFTWARE_SUSPEND case LINUX_REBOOT_CMD_SW_SUSPEND: { - int ret = pm_suspend(PM_SUSPEND_DISK); + int ret = hibernate(); unlock_kernel(); return ret; } Index: linux-2.6.21/drivers/acpi/sleep/main.c =================================================================== --- linux-2.6.21.orig/drivers/acpi/sleep/main.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/drivers/acpi/sleep/main.c 2007-05-01 14:20:45.000000000 +0200 @@ -29,7 +29,6 @@ static u32 acpi_suspend_states[] = { [PM_SUSPEND_ON] = ACPI_STATE_S0, [PM_SUSPEND_STANDBY] = ACPI_STATE_S1, [PM_SUSPEND_MEM] = ACPI_STATE_S3, - [PM_SUSPEND_DISK] = ACPI_STATE_S4, [PM_SUSPEND_MAX] = ACPI_STATE_S5 }; @@ -94,14 +93,6 @@ static int acpi_pm_enter(suspend_state_t do_suspend_lowlevel(); break; - case PM_SUSPEND_DISK: - if (acpi_pm_ops.pm_disk_mode == PM_DISK_PLATFORM) - status = acpi_enter_sleep_state(acpi_state); - break; - case PM_SUSPEND_MAX: - acpi_power_off(); - break; - default: return -EINVAL; } @@ -157,12 +148,13 @@ int acpi_suspend(u32 acpi_state) suspend_state_t states[] = { [1] = PM_SUSPEND_STANDBY, [3] = PM_SUSPEND_MEM, - [4] = PM_SUSPEND_DISK, [5] = PM_SUSPEND_MAX }; if (acpi_state < 6 && states[acpi_state]) return pm_suspend(states[acpi_state]); + if (acpi_state == 4) + return hibernate(); return -EINVAL; } @@ -189,6 +181,49 @@ static struct pm_ops acpi_pm_ops = { .finish = acpi_pm_finish, }; +#ifdef CONFIG_SOFTWARE_SUSPEND +static int acpi_hibernation_prepare(void) +{ + return acpi_sleep_prepare(ACPI_STATE_S4); +} + +static int acpi_hibernation_enter(void) +{ + acpi_status status = AE_OK; + unsigned long flags = 0; + + ACPI_FLUSH_CPU_CACHE(); + + local_irq_save(flags); + acpi_enable_wakeup_device(ACPI_STATE_S4); + /* This shouldn't return. If it returns, we have a problem */ + status = acpi_enter_sleep_state(ACPI_STATE_S4); + local_irq_restore(flags); + + return ACPI_SUCCESS(status) ? 0 : -EFAULT; +} + +static void acpi_hibernation_finish(void) +{ + acpi_leave_sleep_state(ACPI_STATE_S4); + acpi_disable_wakeup_device(ACPI_STATE_S4); + + /* reset firmware waking vector */ + acpi_set_firmware_waking_vector((acpi_physical_address) 0); + + if (init_8259A_after_S1) { + printk("Broken toshiba laptop -> kicking interrupts\n"); + init_8259A(0); + } +} + +static struct hibernation_ops acpi_hibernation_ops = { + .prepare = acpi_hibernation_prepare, + .enter = acpi_hibernation_enter, + .finish = acpi_hibernation_finish, +}; +#endif /* CONFIG_SOFTWARE_SUSPEND */ + /* * Toshiba fails to preserve interrupts over S1, reinitialization * of 8259 is needed after S1 resume. @@ -227,14 +262,18 @@ int __init acpi_sleep_init(void) sleep_states[i] = 1; printk(" S%d", i); } - if (i == ACPI_STATE_S4) { - if (sleep_states[i]) - acpi_pm_ops.pm_disk_mode = PM_DISK_PLATFORM; - } } printk(")\n"); pm_set_ops(&acpi_pm_ops); + +#ifdef CONFIG_SOFTWARE_SUSPEND + if (sleep_states[ACPI_STATE_S4]) + hibernation_set_ops(&acpi_hibernation_ops); +#else + sleep_states[ACPI_STATE_S4] = 0; +#endif + return 0; } Index: linux-2.6.21/kernel/power/power.h =================================================================== --- linux-2.6.21.orig/kernel/power/power.h 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/kernel/power/power.h 2007-05-01 13:35:33.000000000 +0200 @@ -25,12 +25,7 @@ struct swsusp_info { */ #define SPARE_PAGES ((1024 * 1024) >> PAGE_SHIFT) -extern int pm_suspend_disk(void); -#else -static inline int pm_suspend_disk(void) -{ - return -EPERM; -} +extern struct hibernation_ops *hibernation_ops; #endif extern struct mutex pm_mutex; Index: linux-2.6.21/drivers/acpi/sleep/proc.c =================================================================== --- linux-2.6.21.orig/drivers/acpi/sleep/proc.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/drivers/acpi/sleep/proc.c 2007-05-01 13:35:33.000000000 +0200 @@ -60,7 +60,7 @@ acpi_system_write_sleep(struct file *fil state = simple_strtoul(str, NULL, 0); #ifdef CONFIG_SOFTWARE_SUSPEND if (state == 4) { - error = pm_suspend(PM_SUSPEND_DISK); + error = hibernate(); goto Done; } #endif Index: linux-2.6.21/kernel/power/user.c =================================================================== --- linux-2.6.21.orig/kernel/power/user.c 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/kernel/power/user.c 2007-05-01 13:35:33.000000000 +0200 @@ -130,16 +130,16 @@ static inline int platform_prepare(void) { int error = 0; - if (pm_ops && pm_ops->prepare) - error = pm_ops->prepare(PM_SUSPEND_DISK); + if (hibernation_ops) + error = hibernation_ops->prepare(); return error; } static inline void platform_finish(void) { - if (pm_ops && pm_ops->finish) - pm_ops->finish(PM_SUSPEND_DISK); + if (hibernation_ops) + hibernation_ops->finish(); } static inline int snapshot_suspend(int platform_suspend) @@ -384,7 +384,7 @@ static int snapshot_ioctl(struct inode * switch (arg) { case PMOPS_PREPARE: - if (pm_ops && pm_ops->enter) { + if (hibernation_ops) { data->platform_suspend = 1; error = 0; } else { @@ -395,8 +395,7 @@ static int snapshot_ioctl(struct inode * case PMOPS_ENTER: if (data->platform_suspend) { kernel_shutdown_prepare(SYSTEM_SUSPEND_DISK); - error = pm_ops->enter(PM_SUSPEND_DISK); - error = 0; + error = hibernation_ops->enter(); } break; Index: linux-2.6.21/include/linux/suspend.h =================================================================== --- linux-2.6.21.orig/include/linux/suspend.h 2007-05-01 13:35:33.000000000 +0200 +++ linux-2.6.21/include/linux/suspend.h 2007-05-01 13:35:33.000000000 +0200 @@ -32,6 +32,24 @@ static inline int pm_prepare_console(voi static inline void pm_restore_console(void) {} #endif +/** + * struct hibernation_ops - hibernation platform support + * + * The methods in this structure allow a platform to override the default + * mechanism of shutting down the machine during a hibernation transition. + * + * All three methods must be assigned. + * + * @prepare: prepare system for hibernation + * @enter: shut down system after state has been saved to disk + * @finish: finish/clean up after state has been reloaded + */ +struct hibernation_ops { + int (*prepare)(void); + int (*enter)(void); + void (*finish)(void); +}; + #if defined(CONFIG_PM) && defined(CONFIG_SOFTWARE_SUSPEND) /* kernel/power/snapshot.c */ extern void __init register_nosave_region(unsigned long, unsigned long); @@ -39,11 +57,25 @@ extern int swsusp_page_is_forbidden(stru extern void swsusp_set_page_free(struct page *); extern void swsusp_unset_page_free(struct page *); extern unsigned long get_safe_page(gfp_t gfp_mask); + +/** + * hibernation_set_ops - set the global hibernate operations + * @ops: the hibernation operations to use in subsequent hibernation transitions + */ +void hibernation_set_ops(struct hibernation_ops *ops); + +/** + * hibernate - hibernate the system + */ +extern int hibernate(void); #else static inline void register_nosave_region(unsigned long b, unsigned long e) {} static inline int swsusp_page_is_forbidden(struct page *p) { return 0; } static inline void swsusp_set_page_free(struct page *p) {} static inline void swsusp_unset_page_free(struct page *p) {} + +static inline void hibernation_set_ops(struct hibernation_ops *ops) {} +extern inline int hibernate(void) { return -ENOSYS; } #endif /* defined(CONFIG_PM) && defined(CONFIG_SOFTWARE_SUSPEND) */ void save_processor_state(void); ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-01 14:05 ` Rafael J. Wysocki @ 2007-05-01 22:02 ` Rafael J. Wysocki 2007-05-02 5:13 ` Alexey Starikovskiy 2007-05-02 8:21 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 0 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-01 22:02 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Tuesday, 1 May 2007 16:05, Rafael J. Wysocki wrote: > On Monday, 30 April 2007 16:59, Johannes Berg wrote: > > On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote: > > > > > > That comment doesn't seem right. This is in ->enter so afaict the image > > > > hasn't been loaded yet at this point. I don't know if you just moved > > > > code but if you did then I don't think it was correct before. > > > > > > It was in your patch, so I kept it, but I don't think it's correct too. > > > > If it was in my patch then it must be there in the original code, iirc I > > just shuffled it a bit :) > > > > > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are > > > only needed by s2ram, so we can safely remove them from the hibernation code > > > path. Pavel, is that correct? > > > > This I don't know. They seemed to be done on hibernate too. > > The previous version of the patch was missing the changes in suspend.h. > > Apart from this I've cleaned up some changes in disk.c and main.c to make > the sysfs interface work again and dropped some ACPI code that I think was > not necessary. > > Patch appended (tested on x86_64, but not extensively), comments welcome. :-) Well, having a look on the ACPI spec I'm thinking that what we're trying to do with this patch is actually wrong. Instead, we should rip off all of the invocations of pm_ops->whatever() from the hibernation code paths (with the below exceptions) and *if* the platform method is to be used, call pm_ops to make the system go down, in the following way: 1) call pm_ops->prepare(PM_SUSPEND_DISK) 2) suspend devices (ie. call device_suspend() etc.) 3) call pm_ops->enter(PM_SUSPEND_DISK) and if that *fails* (ie. pm_ops->enter() returns): 4) call pm_ops->finish(PM_SUSPEND_DISK) 5) halt the system Formally, after restoring the image, *if* the platform method was used (ie. the above was executed as the last hibernation step), we should call pm_ops->finish(PM_SUSPEND_DISK) before resuming devices, but since we get the control from the "old kernel" rather than from the BIOS, this doesn't seem to be the right thing to do. I'll try to create a patch along these lines and see if it breaks anything on my boxes. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-01 22:02 ` Rafael J. Wysocki @ 2007-05-02 5:13 ` Alexey Starikovskiy 2007-05-02 13:42 ` Rafael J. Wysocki 2007-05-02 8:21 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 1 sibling, 1 reply; 713+ messages in thread From: Alexey Starikovskiy @ 2007-05-02 5:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham Rafael, On resume ACPI expects boot kernel do pm_prepare(). resumed kernel do pm_finish() before device_resume(). Thanks, Alex. On 5/2/07, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Tuesday, 1 May 2007 16:05, Rafael J. Wysocki wrote: > > On Monday, 30 April 2007 16:59, Johannes Berg wrote: > > > On Mon, 2007-04-30 at 16:51 +0200, Rafael J. Wysocki wrote: > > > > > > > > That comment doesn't seem right. This is in ->enter so afaict the image > > > > > hasn't been loaded yet at this point. I don't know if you just moved > > > > > code but if you did then I don't think it was correct before. > > > > > > > > It was in your patch, so I kept it, but I don't think it's correct too. > > > > > > If it was in my patch then it must be there in the original code, iirc I > > > just shuffled it a bit :) > > > > > > > Moreover, it seems that acpi_save_state_mem() and acpi_restore_state_mem() are > > > > only needed by s2ram, so we can safely remove them from the hibernation code > > > > path. Pavel, is that correct? > > > > > > This I don't know. They seemed to be done on hibernate too. > > > > The previous version of the patch was missing the changes in suspend.h. > > > > Apart from this I've cleaned up some changes in disk.c and main.c to make > > the sysfs interface work again and dropped some ACPI code that I think was > > not necessary. > > > > Patch appended (tested on x86_64, but not extensively), comments welcome. :-) > > Well, having a look on the ACPI spec I'm thinking that what we're trying to do > with this patch is actually wrong. > > Instead, we should rip off all of the invocations of pm_ops->whatever() from > the hibernation code paths (with the below exceptions) and *if* the platform > method is to be used, call pm_ops to make the system go down, in the following > way: > 1) call pm_ops->prepare(PM_SUSPEND_DISK) > 2) suspend devices (ie. call device_suspend() etc.) > 3) call pm_ops->enter(PM_SUSPEND_DISK) > and if that *fails* (ie. pm_ops->enter() returns): > 4) call pm_ops->finish(PM_SUSPEND_DISK) > 5) halt the system > > Formally, after restoring the image, *if* the platform method was used (ie. the > above was executed as the last hibernation step), we should call > pm_ops->finish(PM_SUSPEND_DISK) before resuming devices, but > since we get the control from the "old kernel" rather than from the BIOS, > this doesn't seem to be the right thing to do. > > I'll try to create a patch along these lines and see if it breaks anything on > my boxes. > > Greetings, > Rafael > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm > ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 5:13 ` Alexey Starikovskiy @ 2007-05-02 13:42 ` Rafael J. Wysocki 2007-05-02 14:11 ` Alexey Starikovskiy 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-02 13:42 UTC (permalink / raw) To: Alexey Starikovskiy Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham Hi, On Wednesday, 2 May 2007 07:13, Alexey Starikovskiy wrote: > Rafael, > > On resume ACPI expects > boot kernel do pm_prepare(). > resumed kernel do pm_finish() before device_resume(). Well, lets analyse what pm_prepare() actually does. If my understading of the code in there and the ACPI spec [1] is correct, it does the following: (1) Sets the firmware waking vector (doesn't matter for hibernation) (2) Prepares the wake-up devices for a state transition, by calling their _PSW methods ("to enable wake" according to the spec) (3) Disables the GPEs that cannot wake up the system (4) Runs the _PTS and _GTS methods (5) Runs the _SST method (6) Disables all GPEs Now, there's a couple of problems with that regardless of what it's used for, as far as I can see: a) The spec (in Section 7.2) says that (2) should be done *after* the _PTS method is called b) The spec (Section 7.3.2) says: "This [_PTS] method is called after OSPM has notified native device drivers of the sleep state transition and before the OSPM has had a chance to fully prepare the system for a sleep state transition." We don't seem to be doing this. Moreover, Section 15.1.6 of the spec say that "OSPM places all device drivers into their respective Dx state" *before* _PTS is executed, so it doesn't look like _PTS should be executed before device_suspend(). c) According to the spec (Section 15.1.6) "OSPM saves any other processor’s context (other than the local processor) to memory" *after* executing _PTS, but *before* _GTS is executed, but we do this after _GTS is executed. Moreover, the waking vector should be written into FACS after the "other processor’s context" has been saved, but *before* _GTS is executed. d) The spec (Section 7.3.3) says literally this: " _GTS allows ACPI system firmware to perform any required system specific functions prior to entering a system sleep state. OSPM will set the sleep enable (SLP_EN) bit in the PM1 control register immediately following the execution of the _GTS control method without performing any other physical I/O or allowing any interrupt servicing." However, in our code _GTS is executed *waaay* before setting the SLP_EN bit in PM1, which only happens in acpi_enter_sleep_state() called from acpi_pm_enter(), *after* we've executed device_suspend() with IRQs enabled and, in the hibernation case, called device_resume() and saved the image (oh, dear). e) It implicitly follows from d) that _SST should be executed before _GTS and after we run device_suspend(). f) I'm not sure if the disabling of all GPEs before device_suspend() is actually a good idea. Next, we can consider acpi_pm_finish(). Again, if my understading of the code in there and the ACPI spec [1] is correct, it does the following: (7) Sets SLP_EN and SLP_TYPE to state S0 (8) Executes the _SST method (Waking) (9) Executes the _BFS (Back From Sleep) method (10) Executes the _WAK method (11) Enables the runtime GPEs (12) Enables the power button (13) Executes the _SST method (Working) (14) Disables the wake-up capability of devices (15) Resets the firmware waking vector (doesn't matter for hibernation) Now, there seems to be nothing wrong with that *if* it's executed while resuming from RAM, for example, but it doesn't seem to be suitable for using in such a way as we do this in the resume-during-hibernation code path. Consider a hibernation (aka suspend to disk) transition (ie. an operation in which we snapshot the system memory, save the image and shut the system down). Currently, we call acpi_pm_prepare(PM_SUSPEND_DISK) and run device_suspend(), which seems to be in many ways agaist the ACPI spec. The spec, as I understand it, indicates that we should run device_suspend() first and then execute the _PTS method. We shouldn't, however, execute either _GTS, or _SST just yet. Next, we suspend sysdevs etc., and create the memory snapshot. We want to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK to be executed *after* _GTS, which is clearly against the spec (might this be the reason why (7) is sometimes necessary?). Moreover, calling _BFS at this stage makes no sense, IMHO, since there hasn't been any transition (the system has not slept). What I think we should do at this point is to execute _WAK only, which means "power transition aborted" to the firmware, and continue with device_resume(). Next, we save the image and now we'd like to put the system to "sleep", so we use acpi_pm_enter(PM_SUSPEND_DISK), but we shouldn't do that, since the power transition has been aborted by _WAK in acpi_pm_finish()! Thus we should start the transition again, run device_suspend(), execute _PTS, do (2) and (3), save the "other processor's context" etc., execute _SST(S4), execute _GTS and set SLP_EN in PM1 etc. When we restore the system state from a hibernation image, the "boot kernel" is first started. It loads the image into memory, calls device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with the "resumed kernel". It doesn't call acpi_pm_prepare(), which I think is right, because it doesn't want to start any power transition, not even a fake one. Now, the "resumed kernel" takes control, resumes sysdevs and calls acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS should be executed in that case (the ACPI spec seems to assume that the hibernation image will be loaded into memory by a boot loader). Concluding, it seems to me that the "restore" code path is correct, but the "hibernate" code path is not and should be reworked. Also, it seems that acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case either (_PTS should be executed after device_suspend() and _GTS should be executed in acpi_pm_enter(), right before the transition is completed). Greetings, Rafael [1] Advanced Configuration and Power Interface Specification, Revision 3.0, September 2, 2004 ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 13:42 ` Rafael J. Wysocki @ 2007-05-02 14:11 ` Alexey Starikovskiy 2007-05-02 19:26 ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki 2007-05-02 19:26 ` Rafael J. Wysocki 0 siblings, 2 replies; 713+ messages in thread From: Alexey Starikovskiy @ 2007-05-02 14:11 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham Rafael, > Concluding, it seems to me that the "restore" code path is correct, but the > "hibernate" code path is not and should be reworked. Also, it seems that > acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case > either (_PTS should be executed after device_suspend() and _GTS should > be executed in acpi_pm_enter(), right before the transition is completed). Current implementation is not fully up-to spec, so we may try to get it closer to, I agree. > When we restore the system state from a hibernation image, the "boot kernel" is > first started. It loads the image into memory, calls > device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with > the "resumed kernel". It doesn't call acpi_pm_prepare(), which I think is > right, because it doesn't want to start any power transition, not even a > fake one. Now, the "resumed kernel" takes control, resumes sysdevs and calls Currently call to prepare() is needed to stop ACPI devices to send GPEs to ACPI drivers. If you remove it, Acer laptops will resume without ACPI interrupt at all (with all problems from it). > acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS > should be executed in that case (the ACPI spec seems to assume that the > hibernation image will be loaded into memory by a boot loader). > Next, we suspend sysdevs etc., and create the memory snapshot. We want > to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK > to be executed *after* _GTS, which is clearly against the spec (might this be the > reason why (7) is sometimes necessary?). Moreover, calling _BFS at this stage > makes no sense, IMHO, since there hasn't been any transition (the system has > not slept). What I think we should do at this point is to execute _WAK only, > which means "power transition aborted" to the firmware, and continue with > device_resume(). But I don't get your idea about executing _finish() between _prepare() and _enter()... _finish is executed only if _prepare() fails, so we are rolling back, or it is executed after we loaded the image and transfered execution to it, so again -- we are going from _prepare() state to running state... Regards, Alex. ^ permalink raw reply [flat|nested] 713+ messages in thread
* ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-02 14:11 ` Alexey Starikovskiy @ 2007-05-02 19:26 ` Rafael J. Wysocki 2007-05-02 19:26 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-02 19:26 UTC (permalink / raw) To: Alexey Starikovskiy Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, Pavel Machek, Johannes Berg, linux-pm Hi, [Added linux-acpi to the CC list, should be there from the start] On Wednesday, 2 May 2007 16:11, Alexey Starikovskiy wrote: > Rafael, > > > Concluding, it seems to me that the "restore" code path is correct, but the > > "hibernate" code path is not and should be reworked. Also, it seems that > > acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case > > either (_PTS should be executed after device_suspend() and _GTS should > > be executed in acpi_pm_enter(), right before the transition is completed). > > Current implementation is not fully up-to spec, so we may try to get > it closer to, I agree. Okay. Since we're trying to separate the hibernation code from the suspend code anyway, we can use the opportunity to introduce some new callbacks for the hibernation and/or redefine the existing ones. The spec suggests that we need the following callbacks: (1) prepare() - called after device_suspend(), execute _PTS and disable GPEs (2) cancel() - called at any time after prepare() if there's an error, execute _WAK and enable run-time GPEs (3) enter() - called after the image has been saved, execute _GTS and do what's currently done in pm_enter() (4) finish() - called after the image has been restored, do what's currently done in pm_finish() [At least, the execution of _GTS in pm_prepare() seems to be dangerous at first sight.] We also might need a callback that will be run before device_suspend() to invoke some ACPI-related magic needed at that point, but I have no idea what it would have to do. :-) > > When we restore the system state from a hibernation image, the "boot kernel" is > > first started. It loads the image into memory, calls > > device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with > > the "resumed kernel". It doesn't call acpi_pm_prepare(), which I think is > > right, because it doesn't want to start any power transition, not even a > > fake one. Now, the "resumed kernel" takes control, resumes sysdevs and calls > Currently call to prepare() is needed to stop ACPI devices to send > GPEs to ACPI drivers. Does it mean that we need to call pm_prepare() (or an equivalent function) before device_suspend()? If that's the case, then which part of pm_prepare() is essential here? > If you remove it, Acer laptops will resume without ACPI interrupt at > all (with all problems from it). A recent discussion on the LKML lead to the conclusion that for the hibernation we shouldn't use .suspend() callbacks before snapshotting the system memory. Instead, we should use some other callbacks to quiesce devices, create the snapshot, reactivate devices, save the image and carry out the actual power transition after that. Would something like this be viable from the ACPI point of view? > > acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS > > should be executed in that case (the ACPI spec seems to assume that the > > hibernation image will be loaded into memory by a boot loader). > > > Next, we suspend sysdevs etc., and create the memory snapshot. We want > > to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK > > to be executed *after* _GTS, which is clearly against the spec (might this be the > > reason why (7) is sometimes necessary?). Moreover, calling _BFS at this stage > > makes no sense, IMHO, since there hasn't been any transition (the system has > > not slept). What I think we should do at this point is to execute _WAK only, > > which means "power transition aborted" to the firmware, and continue with > > device_resume(). > > But I don't get your idea about executing _finish() between _prepare() > and _enter()... > _finish is executed only if _prepare() fails, so we are rolling back, Well, no. Please have a look at the code in kernel/power/disk.c. Should we remove it from the nonerror code paths? > or it is executed after we loaded the image and transfered execution > to it, so again -- we are going from _prepare() state to running > state... Currently that's not the case. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-02 14:11 ` Alexey Starikovskiy 2007-05-02 19:26 ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki @ 2007-05-02 19:26 ` Rafael J. Wysocki 2007-05-03 22:48 ` Pavel Machek 2007-05-03 22:48 ` Pavel Machek 1 sibling, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-02 19:26 UTC (permalink / raw) To: Alexey Starikovskiy Cc: Johannes Berg, linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek, ACPI Devel Maling List Hi, [Added linux-acpi to the CC list, should be there from the start] On Wednesday, 2 May 2007 16:11, Alexey Starikovskiy wrote: > Rafael, > > > Concluding, it seems to me that the "restore" code path is correct, but the > > "hibernate" code path is not and should be reworked. Also, it seems that > > acpi_pm_prepare() and acpi_pm_enter() should be fixed for the s2ram case > > either (_PTS should be executed after device_suspend() and _GTS should > > be executed in acpi_pm_enter(), right before the transition is completed). > > Current implementation is not fully up-to spec, so we may try to get > it closer to, I agree. Okay. Since we're trying to separate the hibernation code from the suspend code anyway, we can use the opportunity to introduce some new callbacks for the hibernation and/or redefine the existing ones. The spec suggests that we need the following callbacks: (1) prepare() - called after device_suspend(), execute _PTS and disable GPEs (2) cancel() - called at any time after prepare() if there's an error, execute _WAK and enable run-time GPEs (3) enter() - called after the image has been saved, execute _GTS and do what's currently done in pm_enter() (4) finish() - called after the image has been restored, do what's currently done in pm_finish() [At least, the execution of _GTS in pm_prepare() seems to be dangerous at first sight.] We also might need a callback that will be run before device_suspend() to invoke some ACPI-related magic needed at that point, but I have no idea what it would have to do. :-) > > When we restore the system state from a hibernation image, the "boot kernel" is > > first started. It loads the image into memory, calls > > device_suspend(PMSG_PRETHAW), suspends sysdevs etc., and replaces itself with > > the "resumed kernel". It doesn't call acpi_pm_prepare(), which I think is > > right, because it doesn't want to start any power transition, not even a > > fake one. Now, the "resumed kernel" takes control, resumes sysdevs and calls > Currently call to prepare() is needed to stop ACPI devices to send > GPEs to ACPI drivers. Does it mean that we need to call pm_prepare() (or an equivalent function) before device_suspend()? If that's the case, then which part of pm_prepare() is essential here? > If you remove it, Acer laptops will resume without ACPI interrupt at > all (with all problems from it). A recent discussion on the LKML lead to the conclusion that for the hibernation we shouldn't use .suspend() callbacks before snapshotting the system memory. Instead, we should use some other callbacks to quiesce devices, create the snapshot, reactivate devices, save the image and carry out the actual power transition after that. Would something like this be viable from the ACPI point of view? > > acpi_pm_finish(), which seems to be about OK, except that I'm not sure if _BFS > > should be executed in that case (the ACPI spec seems to assume that the > > hibernation image will be loaded into memory by a boot loader). > > > Next, we suspend sysdevs etc., and create the memory snapshot. We want > > to be able to save it, so w call acpi_pm_finish(), which causes _BFS and _WAK > > to be executed *after* _GTS, which is clearly against the spec (might this be the > > reason why (7) is sometimes necessary?). Moreover, calling _BFS at this stage > > makes no sense, IMHO, since there hasn't been any transition (the system has > > not slept). What I think we should do at this point is to execute _WAK only, > > which means "power transition aborted" to the firmware, and continue with > > device_resume(). > > But I don't get your idea about executing _finish() between _prepare() > and _enter()... > _finish is executed only if _prepare() fails, so we are rolling back, Well, no. Please have a look at the code in kernel/power/disk.c. Should we remove it from the nonerror code paths? > or it is executed after we loaded the image and transfered execution > to it, so again -- we are going from _prepare() state to running > state... Currently that's not the case. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-02 19:26 ` Rafael J. Wysocki @ 2007-05-03 22:48 ` Pavel Machek 2007-05-03 22:48 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-03 22:48 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, Johannes Berg, linux-pm Hi! Crazy idea... could we kill hibernate_ops-like struct, and just create a device for ACPI, using its suspend()/resume()/whatever callbacks to do the ACPI magic? > Okay. Since we're trying to separate the hibernation code from the > suspend code anyway, we can use the opportunity to introduce some new > callbacks for the hibernation and/or redefine the existing ones. > > The spec suggests that we need the following callbacks: > > (1) prepare() - called after device_suspend(), execute _PTS and > disable GPEs sysdev .suspend() method would do the trick. > (2) cancel() - called at any time after prepare() if there's an error, execute > _WAK and enable run-time GPEs sysdev .resume() should do the trick. > (3) enter() - called after the image has been saved, execute _GTS and do what's > currently done in pm_enter() This one is tricky. It is essentially powerdown_but_enter_S4_instead. I guess we can live with if()... as we need to special-case reboot in the same place. > (4) finish() - called after the image has been restored, do what's currently > done in pm_finish() platform (?) device .resume() method should work. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-02 19:26 ` Rafael J. Wysocki 2007-05-03 22:48 ` Pavel Machek @ 2007-05-03 22:48 ` Pavel Machek 2007-05-03 23:14 ` Rafael J. Wysocki ` (3 more replies) 1 sibling, 4 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-03 22:48 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Alexey Starikovskiy, Johannes Berg, linux-pm, Pekka Enberg, Nigel Cunningham, ACPI Devel Maling List Hi! Crazy idea... could we kill hibernate_ops-like struct, and just create a device for ACPI, using its suspend()/resume()/whatever callbacks to do the ACPI magic? > Okay. Since we're trying to separate the hibernation code from the > suspend code anyway, we can use the opportunity to introduce some new > callbacks for the hibernation and/or redefine the existing ones. > > The spec suggests that we need the following callbacks: > > (1) prepare() - called after device_suspend(), execute _PTS and > disable GPEs sysdev .suspend() method would do the trick. > (2) cancel() - called at any time after prepare() if there's an error, execute > _WAK and enable run-time GPEs sysdev .resume() should do the trick. > (3) enter() - called after the image has been saved, execute _GTS and do what's > currently done in pm_enter() This one is tricky. It is essentially powerdown_but_enter_S4_instead. I guess we can live with if()... as we need to special-case reboot in the same place. > (4) finish() - called after the image has been restored, do what's currently > done in pm_finish() platform (?) device .resume() method should work. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-03 22:48 ` Pavel Machek @ 2007-05-03 23:14 ` Rafael J. Wysocki 2007-05-03 23:14 ` Rafael J. Wysocki ` (2 subsequent siblings) 3 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 23:14 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, Johannes Berg, linux-pm Hi, On Friday, 4 May 2007 00:48, Pavel Machek wrote: > Hi! > > Crazy idea... could we kill hibernate_ops-like struct, and just create > a device for ACPI, using its suspend()/resume()/whatever callbacks to > do the ACPI magic? Hmm, I didn't think about that. It seems to be viable at first sight. Still, I think we can first separate hibernation_ops from pm_ops, figure out what they should be and then try to replace them with a cleaner solution. > > Okay. Since we're trying to separate the hibernation code from the > > suspend code anyway, we can use the opportunity to introduce some new > > callbacks for the hibernation and/or redefine the existing ones. > > > > The spec suggests that we need the following callbacks: In fact, I should have added (0) start() - called before device_suspend(), execute _TTS(S4) and I'm not sure if the GPEs should be disabled here or in prepare() In principle this could be done as a device's .resume() call, but that would have to be the very first device registered (can we do that?). > > (1) prepare() - called after device_suspend(), execute _PTS and > > disable GPEs > > sysdev .suspend() method would do the trick. Yes. > > (2) cancel() - called at any time after prepare() if there's an error, execute > > _WAK and enable run-time GPEs > > sysdev .resume() should do the trick. But .resume() would be called unconditionally, so there should be a way of figuring out what to do - looks complicated. > > (3) enter() - called after the image has been saved, execute _GTS and do what's > > currently done in pm_enter() > > This one is tricky. It is essentially > powerdown_but_enter_S4_instead. I guess we can live with if()... as we > need to special-case reboot in the same place. Yes. > > (4) finish() - called after the image has been restored, do what's currently > > done in pm_finish() > > platform (?) device .resume() method should work. Hmm, perhaps. And we need one more (in fact this one should be called finish() and the previous one wake() or something like that): (5) finish() - called after device_resume(), but only after the image has been restored or in case of a hibernation error, execute _TTS(S0). It looks like this also should enable the GPEs (or the previous one; that's the information I'm looking for). Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-03 22:48 ` Pavel Machek 2007-05-03 23:14 ` Rafael J. Wysocki @ 2007-05-03 23:14 ` Rafael J. Wysocki 2007-05-04 10:54 ` Johannes Berg 2007-05-04 10:54 ` Johannes Berg 3 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 23:14 UTC (permalink / raw) To: Pavel Machek Cc: Alexey Starikovskiy, Johannes Berg, linux-pm, Pekka Enberg, Nigel Cunningham, ACPI Devel Maling List Hi, On Friday, 4 May 2007 00:48, Pavel Machek wrote: > Hi! > > Crazy idea... could we kill hibernate_ops-like struct, and just create > a device for ACPI, using its suspend()/resume()/whatever callbacks to > do the ACPI magic? Hmm, I didn't think about that. It seems to be viable at first sight. Still, I think we can first separate hibernation_ops from pm_ops, figure out what they should be and then try to replace them with a cleaner solution. > > Okay. Since we're trying to separate the hibernation code from the > > suspend code anyway, we can use the opportunity to introduce some new > > callbacks for the hibernation and/or redefine the existing ones. > > > > The spec suggests that we need the following callbacks: In fact, I should have added (0) start() - called before device_suspend(), execute _TTS(S4) and I'm not sure if the GPEs should be disabled here or in prepare() In principle this could be done as a device's .resume() call, but that would have to be the very first device registered (can we do that?). > > (1) prepare() - called after device_suspend(), execute _PTS and > > disable GPEs > > sysdev .suspend() method would do the trick. Yes. > > (2) cancel() - called at any time after prepare() if there's an error, execute > > _WAK and enable run-time GPEs > > sysdev .resume() should do the trick. But .resume() would be called unconditionally, so there should be a way of figuring out what to do - looks complicated. > > (3) enter() - called after the image has been saved, execute _GTS and do what's > > currently done in pm_enter() > > This one is tricky. It is essentially > powerdown_but_enter_S4_instead. I guess we can live with if()... as we > need to special-case reboot in the same place. Yes. > > (4) finish() - called after the image has been restored, do what's currently > > done in pm_finish() > > platform (?) device .resume() method should work. Hmm, perhaps. And we need one more (in fact this one should be called finish() and the previous one wake() or something like that): (5) finish() - called after device_resume(), but only after the image has been restored or in case of a hibernation error, execute _TTS(S0). It looks like this also should enable the GPEs (or the previous one; that's the information I'm looking for). Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-03 22:48 ` Pavel Machek 2007-05-03 23:14 ` Rafael J. Wysocki 2007-05-03 23:14 ` Rafael J. Wysocki @ 2007-05-04 10:54 ` Johannes Berg 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:08 ` Pavel Machek 2007-05-04 10:54 ` Johannes Berg 3 siblings, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-04 10:54 UTC (permalink / raw) To: Pavel Machek Cc: Rafael J. Wysocki, Alexey Starikovskiy, linux-pm, Pekka Enberg, Nigel Cunningham, ACPI Devel Maling List [-- Attachment #1: Type: text/plain, Size: 405 bytes --] On Fri, 2007-05-04 at 00:48 +0200, Pavel Machek wrote: > Crazy idea... could we kill hibernate_ops-like struct, and just create > a device for ACPI, using its suspend()/resume()/whatever callbacks to > do the ACPI magic? Doesn't that have the ordering problem again? You must ensure that this sysdev is suspended as the last one, and that's currently impossible if ACPI is modular. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-04 10:54 ` Johannes Berg @ 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:08 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-04 12:08 UTC (permalink / raw) To: Johannes Berg Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, linux-pm Hi! > > Crazy idea... could we kill hibernate_ops-like struct, and just create > > a device for ACPI, using its suspend()/resume()/whatever callbacks to > > do the ACPI magic? > > Doesn't that have the ordering problem again? You must ensure that this > sysdev is suspended as the last one, and that's currently impossible if > ACPI is modular. I do not think acpi has these kinds of ordering requirements... (And I do not see what it has to do with module or not). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-04 10:54 ` Johannes Berg 2007-05-04 12:08 ` Pavel Machek @ 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:29 ` Rafael J. Wysocki 2007-05-04 12:29 ` Rafael J. Wysocki 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-04 12:08 UTC (permalink / raw) To: Johannes Berg Cc: Rafael J. Wysocki, Alexey Starikovskiy, linux-pm, Pekka Enberg, Nigel Cunningham, ACPI Devel Maling List Hi! > > Crazy idea... could we kill hibernate_ops-like struct, and just create > > a device for ACPI, using its suspend()/resume()/whatever callbacks to > > do the ACPI magic? > > Doesn't that have the ordering problem again? You must ensure that this > sysdev is suspended as the last one, and that's currently impossible if > ACPI is modular. I do not think acpi has these kinds of ordering requirements... (And I do not see what it has to do with module or not). -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-04 12:08 ` Pavel Machek @ 2007-05-04 12:29 ` Rafael J. Wysocki 2007-05-04 12:29 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 12:29 UTC (permalink / raw) To: Pavel Machek Cc: Johannes Berg, Alexey Starikovskiy, linux-pm, Pekka Enberg, Nigel Cunningham, ACPI Devel Maling List Hi, On Friday, 4 May 2007 14:08, Pavel Machek wrote: > Hi! > > > > Crazy idea... could we kill hibernate_ops-like struct, and just create > > > a device for ACPI, using its suspend()/resume()/whatever callbacks to > > > do the ACPI magic? > > > > Doesn't that have the ordering problem again? You must ensure that this > > sysdev is suspended as the last one, and that's currently impossible if > > ACPI is modular. > > I do not think acpi has these kinds of ordering requirements... (And I > do not see what it has to do with module or not). Theoretically, ACPI has some ordering requirements. For example, according to the spec, the _PTS system-control method should be executed *after* devices are placed in the appropriate Dx states, which (theoretically) requires us to execute it after device_suspend() (we don't do this in practice, but I think we should). There are some more ordering assumptions like this in the spec and I think we should at least try to follow them or, if that breaks things, document why we don't. That's why I think we should try to do what's needed using hibernation_ops (perhaps we'll need to add a couple of callbacks to hibernation_ops) and then try to replace hibernation_ops with another mechanism allowing us to do the same. We first need to determine which operations have to be carried out at what points so that things don't break. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:29 ` Rafael J. Wysocki @ 2007-05-04 12:29 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 12:29 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, Johannes Berg, linux-pm Hi, On Friday, 4 May 2007 14:08, Pavel Machek wrote: > Hi! > > > > Crazy idea... could we kill hibernate_ops-like struct, and just create > > > a device for ACPI, using its suspend()/resume()/whatever callbacks to > > > do the ACPI magic? > > > > Doesn't that have the ordering problem again? You must ensure that this > > sysdev is suspended as the last one, and that's currently impossible if > > ACPI is modular. > > I do not think acpi has these kinds of ordering requirements... (And I > do not see what it has to do with module or not). Theoretically, ACPI has some ordering requirements. For example, according to the spec, the _PTS system-control method should be executed *after* devices are placed in the appropriate Dx states, which (theoretically) requires us to execute it after device_suspend() (we don't do this in practice, but I think we should). There are some more ordering assumptions like this in the spec and I think we should at least try to follow them or, if that breaks things, document why we don't. That's why I think we should try to do what's needed using hibernation_ops (perhaps we'll need to add a couple of callbacks to hibernation_ops) and then try to replace hibernation_ops with another mechanism allowing us to do the same. We first need to determine which operations have to be carried out at what points so that things don't break. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) 2007-05-03 22:48 ` Pavel Machek ` (2 preceding siblings ...) 2007-05-04 10:54 ` Johannes Berg @ 2007-05-04 10:54 ` Johannes Berg 3 siblings, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-04 10:54 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, ACPI Devel Maling List, Pekka Enberg, linux-pm [-- Attachment #1.1: Type: text/plain, Size: 405 bytes --] On Fri, 2007-05-04 at 00:48 +0200, Pavel Machek wrote: > Crazy idea... could we kill hibernate_ops-like struct, and just create > a device for ACPI, using its suspend()/resume()/whatever callbacks to > do the ACPI magic? Doesn't that have the ordering problem again? You must ensure that this sysdev is suspended as the last one, and that's currently impossible if ACPI is modular. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-01 22:02 ` Rafael J. Wysocki 2007-05-02 5:13 ` Alexey Starikovskiy @ 2007-05-02 8:21 ` Johannes Berg 2007-05-02 9:02 ` Rafael J. Wysocki 2007-05-02 9:16 ` Pavel Machek 1 sibling, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-02 8:21 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 884 bytes --] On Wed, 2007-05-02 at 00:02 +0200, Rafael J. Wysocki wrote: > Well, having a look on the ACPI spec I'm thinking that what we're trying to do > with this patch is actually wrong. No idea :) > Instead, we should rip off all of the invocations of pm_ops->whatever() from > the hibernation code paths (with the below exceptions) and *if* the platform > method is to be used, call pm_ops to make the system go down, in the following > way: > 1) call pm_ops->prepare(PM_SUSPEND_DISK) > 2) suspend devices (ie. call device_suspend() etc.) > 3) call pm_ops->enter(PM_SUSPEND_DISK) > and if that *fails* (ie. pm_ops->enter() returns): > 4) call pm_ops->finish(PM_SUSPEND_DISK) > 5) halt the system Can we still split that off to another method so we don't use pm_ops? No matter how we invoke hibernation_ops or in what order, imho we shouldn't use pm_ops. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 8:21 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg @ 2007-05-02 9:02 ` Rafael J. Wysocki 2007-05-02 9:16 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-02 9:02 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Wednesday, 2 May 2007 10:21, Johannes Berg wrote: > On Wed, 2007-05-02 at 00:02 +0200, Rafael J. Wysocki wrote: > > > Well, having a look on the ACPI spec I'm thinking that what we're trying to do > > with this patch is actually wrong. > > No idea :) > > > Instead, we should rip off all of the invocations of pm_ops->whatever() from > > the hibernation code paths (with the below exceptions) and *if* the platform > > method is to be used, call pm_ops to make the system go down, in the following > > way: > > 1) call pm_ops->prepare(PM_SUSPEND_DISK) > > 2) suspend devices (ie. call device_suspend() etc.) > > 3) call pm_ops->enter(PM_SUSPEND_DISK) > > and if that *fails* (ie. pm_ops->enter() returns): > > 4) call pm_ops->finish(PM_SUSPEND_DISK) > > 5) halt the system > > Can we still split that off to another method so we don't use pm_ops? No > matter how we invoke hibernation_ops or in what order, imho we shouldn't > use pm_ops. OK, I think we can go ahead with the patch if nobody objects. It's been tested to some extent and seems to work. More testing will be appreciated. Later on we can do what I said above using hibernation_ops instead of pm_ops, if turns out to really make sense. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 8:21 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 2007-05-02 9:02 ` Rafael J. Wysocki @ 2007-05-02 9:16 ` Pavel Machek 2007-05-02 9:25 ` Johannes Berg 2007-05-02 13:43 ` Rafael J. Wysocki 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-02 9:16 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham Hi! > > Well, having a look on the ACPI spec I'm thinking that what we're trying to do > > with this patch is actually wrong. > > No idea :) > > > Instead, we should rip off all of the invocations of pm_ops->whatever() from > > the hibernation code paths (with the below exceptions) and *if* the platform > > method is to be used, call pm_ops to make the system go down, in the following > > way: > > 1) call pm_ops->prepare(PM_SUSPEND_DISK) > > 2) suspend devices (ie. call device_suspend() etc.) > > 3) call pm_ops->enter(PM_SUSPEND_DISK) > > and if that *fails* (ie. pm_ops->enter() returns): > > 4) call pm_ops->finish(PM_SUSPEND_DISK) > > 5) halt the system > > Can we still split that off to another method so we don't use pm_ops? No > matter how we invoke hibernation_ops or in what order, imho we shouldn't > use pm_ops. Well... the powerdown during hibernation... does not have _anything_ to do with snapshot/restore. It is really a very deep sleep; similar to soft powerdown, but not quite. So this usage of pm_ops seems ok. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 9:16 ` Pavel Machek @ 2007-05-02 9:25 ` Johannes Berg 2007-05-03 14:00 ` Alan Stern 2007-05-02 13:43 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-02 9:25 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham [-- Attachment #1.1: Type: text/plain, Size: 657 bytes --] On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote: > Well... the powerdown during hibernation... does not have _anything_ > to do with snapshot/restore. It is really a very deep sleep; similar > to soft powerdown, but not quite. It's also horribly confusing to intermingle hibernation and suspend into one operation struct when there's only a single user for it anyway. Just look at what all the arm platforms had there, trying to veto suspend to disk through pm_ops etc. I don't technically disagree with you, but from a point of how to understand this whole thing I'd rather have hibernate and suspend be totally orthogonal. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 9:25 ` Johannes Berg @ 2007-05-03 14:00 ` Alan Stern 2007-05-03 17:17 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Alan Stern @ 2007-05-03 14:00 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Wed, 2 May 2007, Johannes Berg wrote: > On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote: > > > Well... the powerdown during hibernation... does not have _anything_ > > to do with snapshot/restore. It is really a very deep sleep; similar > > to soft powerdown, but not quite. Is this really a good idea? For that matter, what are the differences among the various sorts of poweroff? Which devices remain minimally powered for wakeup purposes? Anything else? In fact, shouldn't the poweroff at the end of a hibernate be exactly the same as a normal non-hibernate poweroff? Aren't drivers required to assume (during the processing after the snapshot has been restored) that power could have been lost and devices might need to be completely reinitialized? We are letting ourselves in for problems if we say that when the snapshot is restored, devices may or may not need to be reinitialized. Drivers might not be able to tell which, so they would have to reinitialize regardless, losing any advantage. Even worse, the device may _appear_ not to need reinitialization because the firmware (BIOS) has already initialized it but left it in a state that's useless for the kernel's purposes. (That's part of the reason why PRETHAW was added.) If the only remaining difference between poweroff for hibernate and normal poweroff is which wakeup devices will function, then it seems pointless. Why shouldn't the same devices work for wakeup from hibernate and wakeup from normal poweroff? Or have I misunderstood something and is this all nonsense? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 14:00 ` Alan Stern @ 2007-05-03 17:17 ` Rafael J. Wysocki 2007-05-03 18:33 ` Alan Stern 2007-05-03 20:33 ` David Brownell 2007-05-03 22:18 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek 2 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 17:17 UTC (permalink / raw) To: Alan Stern Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thursday, 3 May 2007 16:00, Alan Stern wrote: > On Wed, 2 May 2007, Johannes Berg wrote: > > > On Wed, 2007-05-02 at 11:16 +0200, Pavel Machek wrote: > > > > > Well... the powerdown during hibernation... does not have _anything_ > > > to do with snapshot/restore. It is really a very deep sleep; similar > > > to soft powerdown, but not quite. > > Is this really a good idea? > > For that matter, what are the differences among the various sorts of > poweroff? > > Which devices remain minimally powered for wakeup purposes? > > Anything else? > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > same as a normal non-hibernate poweroff? Not quite (see (*) below). > Aren't drivers required to assume (during the processing after the snapshot > has been restored) that power could have been lost and devices might need to > be completely reinitialized? > > We are letting ourselves in for problems if we say that when the snapshot > is restored, devices may or may not need to be reinitialized. Agreed. > Drivers might not be able to tell which, so they would have to reinitialize > regardless, losing any advantage. Even worse, the device may _appear_ not > to need reinitialization because the firmware (BIOS) has already > initialized it but left it in a state that's useless for the kernel's > purposes. (That's part of the reason why PRETHAW was added.) Yes. > If the only remaining difference between poweroff for hibernate and normal > poweroff is which wakeup devices will function, then it seems pointless. No, this is not the only difference (*). > Why shouldn't the same devices work for wakeup from hibernate and wakeup > from normal poweroff? > > Or have I misunderstood something and is this all nonsense? The problem, generally speaking, is that we have to prepare devices for waking up the system. On an ACPI system this is done, among other things, by executing the devices' _PSW control methods after the system-level _PTS method has run. For this purpose the devices must be in (low) power states from which the wake is possible, so in particular they must not be powered off. Later, by making the platform enter the suspend-to-disk (ACPI S4) state we prevent it from powering off the wake-up devices, among other things. That's why I'm thinking that it might be a good idea to do a suspend-before-poweroff, but it doesn't mean that device drivers would be allowed to make any assumptions regarding the state of the device after the resume. IMO, if this is a resume from disk, devices should be initialized from scratch. (*) Another issue is that, for example, on my notebook the status of the AC power supply (and sometimes of the battery too) is not reported correctly by the platform after the resume if the suspend-to-disk (ACPI S4) state has not been entered during the hibernation. I don't understand why this happens, but I'm going to find out. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 17:17 ` Rafael J. Wysocki @ 2007-05-03 18:33 ` Alan Stern 2007-05-03 19:47 ` Rafael J. Wysocki 2007-05-03 20:33 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell 0 siblings, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-03 18:33 UTC (permalink / raw) To: Rafael J. Wysocki Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > > same as a normal non-hibernate poweroff? > > Not quite (see (*) below). > The problem, generally speaking, is that we have to prepare devices for waking > up the system. On an ACPI system this is done, among other things, by > executing the devices' _PSW control methods after the system-level _PTS method > has run. For this purpose the devices must be in (low) power states from which > the wake is possible, so in particular they must not be powered off. Later, by > making the platform enter the suspend-to-disk (ACPI S4) state we prevent it > from powering off the wake-up devices, among other things. > > That's why I'm thinking that it might be a good idea to do a > suspend-before-poweroff, but it doesn't mean that device drivers would be > allowed to make any assumptions regarding the state of the device after the > resume. IMO, if this is a resume from disk, devices should be initialized from > scratch. I generally agree with your last sentence, but with one reservation (see below). As for the rest, you missed my point. Granted that all these special activities are required on ACPI systems in order to support proper operation of wakeup devices -- Why shouldn't these same steps also be followed during a normal poweroff? There really are two orthogonal issues here: (1) Is this a "hibernate" poweroff (as opposed to a "normal" poweroff)? (2) Should some devices remain minimally powered and be capable of waking up the system? I don't see any necessary relation between the answers to (1) and (2). In particular, I don't see why a Yes answer to (1) should imply a Yes answer to (2). This suggests that the poweroff methods be completely independent of hibernation_ops (or whatever you are now calling it). Perhaps there should be a separate sysfs attribute controlling whether or not wakeup is enabled. If it is then poweroff should go through all the ACPI (or the platform's equivalent) hoops, otherwise everything should just be turned off completely. Regardless of whether the poweroff is part of a hibernate sequence. > (*) Another issue is that, for example, on my notebook the status of the AC > power supply (and sometimes of the battery too) is not reported correctly by > the platform after the resume if the suspend-to-disk (ACPI S4) state has not > been entered during the hibernation. I don't understand why this happens, but > I'm going to find out. Hopefully it's not directly related to the matter under discussion. :-) There remains one issue associated with always reinitializing devices during resume from hibernation. In the one area I know a lot about (USB) this actually does matter, at least a little. The USB specs include the notion of a "power session", which is essentially an uninterrupted continuous connection between the host and the device. As long as a power session exists, the host is guaranteed that the device has not been unplugged or replaced with a different device. On most systems, hibernation breaks power sessions. When the system wakes back up it sees a bunch of USB devices connected, but it is not allowed (by the spec!) to assume that these are the same devices as were attached before. In fact, some of them might not be. Mostly this doesn't make any difference, but for mass-storage it does. Memory mappings and filesystem mounts will be disrupted when the underlying logical device goes away, even if the same physical device is still attached to the same port. This has caused significant headaches for USB users in the past. On the other hand, some systems are designed cleverly enough to maintain power sessions across hibernation. Not many -- the only ones I've heard about were all PPC Macs. The USB drivers have always tried to keep power sessions intact across hibernation whenever the hardware and firmware would permit, but of course reinitializing the USB controller would destroy them. There are a couple of reaons for not worrying about this very much. First, as mentioned before this issue exists on only a small number of systems. Second, I have submitted to Greg KH a couple of patches to maintain persistence of USB devices even when the power sessions are lost (they're still in his queue so you can't try them out yet). This feature violates the USB spec and it is potentially dangerous -- users could easily lose data for example by changing the card in a USB card reader while the system is hibernating -- so it is a non-default Kconfig option. Nevertheless, it does solve the problem. In the end, this is a long-winded way of saying that always reinitializing devices while resuming from a hibernation is probably the best overall approach, even if it may not be optimal in a few cases. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 18:33 ` Alan Stern @ 2007-05-03 19:47 ` Rafael J. Wysocki 2007-05-03 19:59 ` Alan Stern 2007-05-03 20:33 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 19:47 UTC (permalink / raw) To: Alan Stern Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thursday, 3 May 2007 20:33, Alan Stern wrote: > On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > > > same as a normal non-hibernate poweroff? > > > > Not quite (see (*) below). > > > The problem, generally speaking, is that we have to prepare devices for waking > > up the system. On an ACPI system this is done, among other things, by > > executing the devices' _PSW control methods after the system-level _PTS method > > has run. For this purpose the devices must be in (low) power states from which > > the wake is possible, so in particular they must not be powered off. Later, by > > making the platform enter the suspend-to-disk (ACPI S4) state we prevent it > > from powering off the wake-up devices, among other things. The last sencence in the above paragraph is not actually true, sorry for the confusion. > > That's why I'm thinking that it might be a good idea to do a > > suspend-before-poweroff, but it doesn't mean that device drivers would be > > allowed to make any assumptions regarding the state of the device after the > > resume. IMO, if this is a resume from disk, devices should be initialized from > > scratch. > > I generally agree with your last sentence, but with one reservation (see > below). > > As for the rest, you missed my point. Granted that all these special > activities are required on ACPI systems in order to support proper > operation of wakeup devices -- Why shouldn't these same steps also be > followed during a normal poweroff? > > There really are two orthogonal issues here: > > (1) Is this a "hibernate" poweroff (as opposed to a "normal" > poweroff)? > > (2) Should some devices remain minimally powered and be capable > of waking up the system? > > I don't see any necessary relation between the answers to (1) and (2). In > particular, I don't see why a Yes answer to (1) should imply a Yes answer > to (2). > > This suggests that the poweroff methods be completely independent of > hibernation_ops (or whatever you are now calling it). Perhaps there > should be a separate sysfs attribute controlling whether or not wakeup is > enabled. If it is then poweroff should go through all the ACPI (or the > platform's equivalent) hoops, otherwise everything should just be turned > off completely. Regardless of whether the poweroff is part of a > hibernate sequence. Well, after reviewing the code once again I see that we already do it this way on ACPI systems, since the 'normal' power off is done by entering the ACPI S5 state. Moreover, there shouldn't be any difference between ACPI S4 and 'power off' with respect to the wake-up devices, so you're absolutely right. It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4) to finish the hibernation in order to avoid problems like (*) and for this purpose we need to use hibernation_ops earlier during the hibernation. > > (*) Another issue is that, for example, on my notebook the status of the AC > > power supply (and sometimes of the battery too) is not reported correctly by > > the platform after the resume if the suspend-to-disk (ACPI S4) state has not > > been entered during the hibernation. I don't understand why this happens, but > > I'm going to find out. > > Hopefully it's not directly related to the matter under discussion. :-) > > > There remains one issue associated with always reinitializing devices > during resume from hibernation. In the one area I know a lot about (USB) > this actually does matter, at least a little. > > The USB specs include the notion of a "power session", which is > essentially an uninterrupted continuous connection between the host and > the device. As long as a power session exists, the host is guaranteed > that the device has not been unplugged or replaced with a different > device. > > On most systems, hibernation breaks power sessions. When the system wakes > back up it sees a bunch of USB devices connected, but it is not allowed > (by the spec!) to assume that these are the same devices as were attached > before. In fact, some of them might not be. > > Mostly this doesn't make any difference, but for mass-storage it does. > Memory mappings and filesystem mounts will be disrupted when the > underlying logical device goes away, even if the same physical device is > still attached to the same port. This has caused significant headaches > for USB users in the past. > > On the other hand, some systems are designed cleverly enough to maintain > power sessions across hibernation. Not many -- the only ones I've heard > about were all PPC Macs. The USB drivers have always tried to keep power > sessions intact across hibernation whenever the hardware and firmware > would permit, but of course reinitializing the USB controller would > destroy them. That seems to be one of the really rare cases in which a device driver can actually make sure that the device is in certain state after the hibernation on the basis of information provided by the device itself, so it doesn't need to make any assupmtions. In such cases it might be possible not to reinitialize the device, but that would have to be handled with much care. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 19:47 ` Rafael J. Wysocki @ 2007-05-03 19:59 ` Alan Stern 2007-05-03 20:21 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-03 19:59 UTC (permalink / raw) To: Rafael J. Wysocki Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > This suggests that the poweroff methods be completely independent of > > hibernation_ops (or whatever you are now calling it). Perhaps there > > should be a separate sysfs attribute controlling whether or not wakeup is > > enabled. If it is then poweroff should go through all the ACPI (or the > > platform's equivalent) hoops, otherwise everything should just be turned > > off completely. Regardless of whether the poweroff is part of a > > hibernate sequence. > > Well, after reviewing the code once again I see that we already do it this > way on ACPI systems, since the 'normal' power off is done by entering the > ACPI S5 state. Moreover, there shouldn't be any difference between > ACPI S4 and 'power off' with respect to the wake-up devices, so you're > absolutely right. > > It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4) > to finish the hibernation in order to avoid problems like (*) and for this purpose > we need to use hibernation_ops earlier during the hibernation. But why shouldn't a "normal" poweroff enter ACPI S4? And why shouldn't a "hibernate" poweroff enter ACPI S5? The choice of which state to enter is independent of the reason for shutting down, right? In other words, the choice for whether or not to call acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not you're hibernating. So it shouldn't affect the usage of hibernation_ops at all. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 19:59 ` Alan Stern @ 2007-05-03 20:21 ` Rafael J. Wysocki 2007-05-04 14:40 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 20:21 UTC (permalink / raw) To: Alan Stern Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thursday, 3 May 2007 21:59, Alan Stern wrote: > On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > > > This suggests that the poweroff methods be completely independent of > > > hibernation_ops (or whatever you are now calling it). Perhaps there > > > should be a separate sysfs attribute controlling whether or not wakeup is > > > enabled. If it is then poweroff should go through all the ACPI (or the > > > platform's equivalent) hoops, otherwise everything should just be turned > > > off completely. Regardless of whether the poweroff is part of a > > > hibernate sequence. > > > > Well, after reviewing the code once again I see that we already do it this > > way on ACPI systems, since the 'normal' power off is done by entering the > > ACPI S5 state. Moreover, there shouldn't be any difference between > > ACPI S4 and 'power off' with respect to the wake-up devices, so you're > > absolutely right. > > > > It seems, though, that we need to do acpi_enter_sleep_state(ACPI_STATE_S4) > > to finish the hibernation in order to avoid problems like (*) and for this purpose > > we need to use hibernation_ops earlier during the hibernation. > > But why shouldn't a "normal" poweroff enter ACPI S4? And why shouldn't a > "hibernate" poweroff enter ACPI S5? The choice of which state to enter is > independent of the reason for shutting down, right? Well, not exactly. > In other words, the choice for whether or not to call > acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not > you're hibernating. So it shouldn't affect the usage of hibernation_ops > at all. This works the other way around, I think. :-) Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4) as a 'power off method' so that they work correctly after the 'return' from hibernation. If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might not work on them (this is an experimental observation, I don't know what exactly the reason of it is). Now, since I have such a box, I need to do the acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power off method). *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used without previous preparations, which are made with the help of hibernation_ops. IOW, all hibernation_ops, including the ->enter() method that actually calls acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one (complicated) 'platform' power off method. It doesn't make sense to use the (other) hibernation_ops without the ->enter() method. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 20:21 ` Rafael J. Wysocki @ 2007-05-04 14:40 ` Alan Stern 2007-05-04 20:20 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Alan Stern @ 2007-05-04 14:40 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg Rafael, David, and Pavel: You all misunderstood the point I was trying to make. On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > But why shouldn't a "normal" poweroff enter ACPI S4? And why shouldn't a > > "hibernate" poweroff enter ACPI S5? The choice of which state to enter is > > independent of the reason for shutting down, right? > > Well, not exactly. > > > In other words, the choice for whether or not to call > > acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not > > you're hibernating. So it shouldn't affect the usage of hibernation_ops > > at all. > > This works the other way around, I think. :-) > > Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4) > as a 'power off method' so that they work correctly after the 'return' from hibernation. > If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might > not work on them (this is an experimental observation, I don't know what > exactly the reason of it is). > > Now, since I have such a box, I need to do the > acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off > method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power > off method). *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used > without previous preparations, which are made with the help of hibernation_ops. > > IOW, all hibernation_ops, including the ->enter() method that actually calls > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one > (complicated) 'platform' power off method. It doesn't make sense to use the > (other) hibernation_ops without the ->enter() method. Let's look at the big picture. Entering hibernation basically involves these steps: 1. Freeze tasks 2. Quiesce devices and drivers 3. Create snapshot 4. Reactivate devices and drivers 5. Save snapshot to disk 6. Prepare devices for wakeup 7. Power down (ACPI S4 on systems which support it) Leaving hibernation involves a similar sequence which I won't discuss. Notice that steps 1-5 above are _completely_ independent of all issues concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be carried out for hibernation to work, no matter how the system ends up getting shut down. On the other hand, steps 6 and 7 aren't really needed for hibernation. You _could_ shut the system off completely (ACPI S5). Automatic wakeup wouldn't work, but the next time the user turned the computer on manually it would still resume from hibernation. Conversely, steps 6 and 7 can make sense even in situations where you don't want to hibernate. For example, you might want a normal shutdown in which the operating system does a full restart when the firmware is signalled by a wakeup device. So there should be separate data structures associated with 1-5 and 6-7. Maybe the one associated with 6-7 is what you are calling hibernation_ops; if so then fine. But I still think that it should be usable for situations where you are not entering hibernation, and we should be possible to enter hibernation without using it. The system administrator should be able to choose which of S4 or S5 gets used for _any_ poweroff, regardless of whether it's to start hibernating. The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy to check and see), but that doesn't mean we have to use the terms synonymously. Does this make sense, or am I missing something very basic? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:40 ` Alan Stern @ 2007-05-04 20:20 ` Rafael J. Wysocki 2007-05-04 20:21 ` Johannes Berg 2007-05-04 20:58 ` Pavel Machek 2007-05-04 21:40 ` David Brownell 2 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 20:20 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Friday, 4 May 2007 16:40, Alan Stern wrote: > Rafael, David, and Pavel: > > You all misunderstood the point I was trying to make. > > On Thu, 3 May 2007, Rafael J. Wysocki wrote: > > > > But why shouldn't a "normal" poweroff enter ACPI S4? And why shouldn't a > > > "hibernate" poweroff enter ACPI S5? The choice of which state to enter is > > > independent of the reason for shutting down, right? > > > > Well, not exactly. > > > > > In other words, the choice for whether or not to call > > > acpi_enter_sleep_state(ACPI_STATE_S4) shouldn't depend on whether or not > > > you're hibernating. So it shouldn't affect the usage of hibernation_ops > > > at all. > > > > This works the other way around, I think. :-) > > > > Granted, some boxes require us to call acpi_enter_sleep_state(ACPI_STATE_S4) > > as a 'power off method' so that they work correctly after the 'return' from hibernation. > > If we do acpi_enter_sleep_state(ACPI_STATE_S5) instead, some things might > > not work on them (this is an experimental observation, I don't know what > > exactly the reason of it is). > > > > Now, since I have such a box, I need to do the > > acpi_enter_sleep_state(ACPI_STATE_S4) thing (IOW, use the 'platform' power off > > method) and not acpi_enter_sleep_state(ACPI_STATE_S5) (the 'shutdown' power > > off method). *However*, acpi_enter_sleep_state(ACPI_STATE_S4) cannot be used > > without previous preparations, which are made with the help of hibernation_ops. > > > > IOW, all hibernation_ops, including the ->enter() method that actually calls > > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one > > (complicated) 'platform' power off method. It doesn't make sense to use the > > (other) hibernation_ops without the ->enter() method. > > Let's look at the big picture. > > Entering hibernation basically involves these steps: > > 1. Freeze tasks > > 2. Quiesce devices and drivers > > 3. Create snapshot > > 4. Reactivate devices and drivers > > 5. Save snapshot to disk > > 6. Prepare devices for wakeup > > 7. Power down (ACPI S4 on systems which support it) > > Leaving hibernation involves a similar sequence which I won't discuss. > > Notice that steps 1-5 above are _completely_ independent of all issues > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > carried out for hibernation to work, no matter how the system ends up > getting shut down. > > On the other hand, steps 6 and 7 aren't really needed for hibernation. > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > wouldn't work, but the next time the user turned the computer on manually > it would still resume from hibernation. That's correct, with the exception that the user may find the system not fully functional after the resume in that case. > Conversely, steps 6 and 7 can make sense even in situations where you > don't want to hibernate. For example, you might want a normal shutdown in > which the operating system does a full restart when the firmware is > signalled by a wakeup device. > > So there should be separate data structures associated with 1-5 and 6-7. > Maybe the one associated with 6-7 is what you are calling hibernation_ops; > if so then fine. But I still think that it should be usable for > situations where you are not entering hibernation, and we should be > possible to enter hibernation without using it. The system administrator > should be able to choose which of S4 or S5 gets used for _any_ poweroff, > regardless of whether it's to start hibernating. Yes, this should be doable. > The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy > to check and see), Not directly. The word "hibernation" is never used in the ACPI specification (as of ACPI 2.0). > but that doesn't mean we have to use the terms synonymously. Agreed. > Does this make sense, or am I missing something very basic? Hmm, I think it makes sense. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:20 ` Rafael J. Wysocki @ 2007-05-04 20:21 ` Johannes Berg 2007-05-04 20:55 ` Pavel Machek 2007-05-04 21:06 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki 0 siblings, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-04 20:21 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list [-- Attachment #1.1: Type: text/plain, Size: 762 bytes --] On Fri, 2007-05-04 at 22:20 +0200, Rafael J. Wysocki wrote: > > On the other hand, steps 6 and 7 aren't really needed for hibernation. > > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > > wouldn't work, but the next time the user turned the computer on manually > > it would still resume from hibernation. > > That's correct, with the exception that the user may find the system not fully > functional after the resume in that case. Why is that anyway? Is it just a matter of the acpi code getting confused about the acpi bios state? How can the acpi bios possibly be screwed up after what it must see as a fresh boot? Does the acpi code poke it in ways it's not supposed to be poked after a fresh boot? johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:21 ` Johannes Berg @ 2007-05-04 20:55 ` Pavel Machek 2007-05-04 21:08 ` Johannes Berg 2007-05-04 21:06 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-04 20:55 UTC (permalink / raw) To: Johannes Berg; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list Hi! > > > On the other hand, steps 6 and 7 aren't really needed for hibernation. > > > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > > > wouldn't work, but the next time the user turned the computer on manually > > > it would still resume from hibernation. > > > > That's correct, with the exception that the user may find the system not fully > > functional after the resume in that case. > > Why is that anyway? Is it just a matter of the acpi code getting > confused about the acpi bios state? How can the acpi bios possibly be > screwed up after what it must see as a fresh boot? Does the acpi code > poke it in ways it's not supposed to be poked after a fresh boot? No, ACPI BIOS does not see a fresh boot. ACPI BIOS communicates with hw, too. Suppose it generates random number, stores it in memory and tells it to the keyboard conroller during bootup (more specifically during ACPI enable phase). Now, it periodically checks if number in memory is same as the number known by keyboard controller. If you suspend/resume without telling acpi, it will find out, because numbers will not match. (And now, ACPI is probably not crazy enough to store random numbers -- but it could -- but for example "I had AC power, now I do not, and I did not see a interrupt telling me it went away" can be counted as confusing for ACPI). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:55 ` Pavel Machek @ 2007-05-04 21:08 ` Johannes Berg 2007-05-04 21:15 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-04 21:08 UTC (permalink / raw) To: Pavel Machek; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list [-- Attachment #1.1: Type: text/plain, Size: 1569 bytes --] On Fri, 2007-05-04 at 22:55 +0200, Pavel Machek wrote: > > Why is that anyway? Is it just a matter of the acpi code getting > > confused about the acpi bios state? How can the acpi bios possibly be > > screwed up after what it must see as a fresh boot? Does the acpi code > > poke it in ways it's not supposed to be poked after a fresh boot? > > No, ACPI BIOS does not see a fresh boot. Sure. It just booted the machine so it must see it as a fresh boot. > ACPI BIOS communicates with hw, too. Suppose it generates random > number, stores it in memory and tells it to the keyboard conroller > during bootup (more specifically during ACPI enable phase). > > Now, it periodically checks if number in memory is same as the number > known by keyboard controller. > > If you suspend/resume without telling acpi, it will find out, because > numbers will not match. > > (And now, ACPI is probably not crazy enough to store random numbers -- > but it could -- but for example "I had AC power, now I do not, and I > did not see a interrupt telling me it went away" can be counted as > confusing for ACPI). I don't follow. * you have AC power. * you save system state and shut down (S5) * you boot up again on battery power * you restore system state * ... vs. * you have AC power * you shut down * you boot up again on battery power * ... where's the difference to the ACPI bios? Oh, I see, it stores it somewhere in the memory that you've stored/restored? Well, that's your bug then, don't touch it. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:08 ` Johannes Berg @ 2007-05-04 21:15 ` Pavel Machek 2007-05-04 21:53 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-04 21:15 UTC (permalink / raw) To: Johannes Berg; +Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list Hi! > > ACPI BIOS communicates with hw, too. Suppose it generates random > > number, stores it in memory and tells it to the keyboard conroller > > during bootup (more specifically during ACPI enable phase). > > > > Now, it periodically checks if number in memory is same as the number > > known by keyboard controller. > > > > If you suspend/resume without telling acpi, it will find out, because > > numbers will not match. > > > > (And now, ACPI is probably not crazy enough to store random numbers -- > > but it could -- but for example "I had AC power, now I do not, and I > > did not see a interrupt telling me it went away" can be counted as > > confusing for ACPI). > > I don't follow. > > * you have AC power. > * you save system state and shut down (S5) > * you boot up again on battery power > * you restore system state > * ... > > vs. > > * you have AC power > * you shut down > * you boot up again on battery power > * ... > > where's the difference to the ACPI bios? Oh, I see, it stores it > somewhere in the memory that you've stored/restored? Well, that's your > bug then, don't touch it. Not sure... yes, it stores parts somewhere in memory. Plus, it may have some parts related to the communications with operating system (*)... I guess we need to save those, and parts related to hw state... where your suggestion makes sense. (*) and yes, there probably are such parts. If we set backlight to 20%, we'll be confused if it is 100% after resume... we probably could handle those one-by-one... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:15 ` Pavel Machek @ 2007-05-04 21:53 ` Rafael J. Wysocki 2007-05-04 21:53 ` Johannes Berg 2007-05-05 15:52 ` Alan Stern 0 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 21:53 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, Pekka Enberg, Johannes Berg, Linux-pm mailing list On Friday, 4 May 2007 23:15, Pavel Machek wrote: > Hi! > > > > ACPI BIOS communicates with hw, too. Suppose it generates random > > > number, stores it in memory and tells it to the keyboard conroller > > > during bootup (more specifically during ACPI enable phase). > > > > > > Now, it periodically checks if number in memory is same as the number > > > known by keyboard controller. > > > > > > If you suspend/resume without telling acpi, it will find out, because > > > numbers will not match. > > > > > > (And now, ACPI is probably not crazy enough to store random numbers -- > > > but it could -- but for example "I had AC power, now I do not, and I > > > did not see a interrupt telling me it went away" can be counted as > > > confusing for ACPI). > > > > I don't follow. > > > > * you have AC power. > > * you save system state and shut down (S5) > > * you boot up again on battery power > > * you restore system state > > * ... > > > > vs. > > > > * you have AC power > > * you shut down > > * you boot up again on battery power > > * ... > > > > where's the difference to the ACPI bios? Oh, I see, it stores it > > somewhere in the memory that you've stored/restored? Well, that's your > > bug then, don't touch it. > > Not sure... yes, it stores parts somewhere in memory. These are reserved regions. On the majority of systems we handle them correctly. > Plus, it may have some parts related to the communications with operating > system (*)... I guess we need to save those, and parts related to hw > state... where your suggestion makes sense. If they are accessible to us, then we can, but what if they aren't (eg. the state information is stored in the embedded controller, can only be read with the help of some AML invocations and cannot be changed from the OS level)? > (*) and yes, there probably are such parts. If we set backlight to > 20%, we'll be confused if it is 100% after resume... we probably could > handle those one-by-one... *If* we reinitialize devices *and* ACPI from scratch after restoring the image, we'll discard the old value (20%) and read the new value (100%) from the BIOS. The problems occur, IMO, because we try to be smart and use the BIOS after the resume as though we'd resumed from a real suspend (eg. s2ram). Which is natural, if we use the same set of .resume() callbacks for both cases. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:53 ` Rafael J. Wysocki @ 2007-05-04 21:53 ` Johannes Berg 2007-05-04 22:25 ` Rafael J. Wysocki 2007-05-05 15:52 ` Alan Stern 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-04 21:53 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list [-- Attachment #1.1: Type: text/plain, Size: 646 bytes --] On Fri, 2007-05-04 at 23:53 +0200, Rafael J. Wysocki wrote: > > Plus, it may have some parts related to the communications with operating > > system (*)... I guess we need to save those, and parts related to hw > > state... where your suggestion makes sense. > > If they are accessible to us, then we can, but what if they aren't (eg. the > state information is stored in the embedded controller, can only be read with > the help of some AML invocations and cannot be changed from the OS level)? Well, in that case you also haven't overwritten/changed them during restore so there's no room for mismatches and confusion. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:53 ` Johannes Berg @ 2007-05-04 22:25 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 22:25 UTC (permalink / raw) To: Johannes Berg Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list On Friday, 4 May 2007 23:53, Johannes Berg wrote: > On Fri, 2007-05-04 at 23:53 +0200, Rafael J. Wysocki wrote: > > > Plus, it may have some parts related to the communications with operating > > > system (*)... I guess we need to save those, and parts related to hw > > > state... where your suggestion makes sense. > > > > If they are accessible to us, then we can, but what if they aren't (eg. the > > state information is stored in the embedded controller, can only be read with > > the help of some AML invocations and cannot be changed from the OS level)? > > Well, in that case you also haven't overwritten/changed them during > restore so there's no room for mismatches and confusion. Not if we went for S5 to finish the hibernation and then we try to be smart and rely on the BIOS-provided information/functionality *as though* we had passed through S4. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:53 ` Rafael J. Wysocki 2007-05-04 21:53 ` Johannes Berg @ 2007-05-05 15:52 ` Alan Stern 2007-05-07 1:16 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 15:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Fri, 4 May 2007, Rafael J. Wysocki wrote: > > Plus, it may have some parts related to the communications with operating > > system (*)... I guess we need to save those, and parts related to hw > > state... where your suggestion makes sense. > > If they are accessible to us, then we can, but what if they aren't (eg. the > state information is stored in the embedded controller, can only be read with > the help of some AML invocations and cannot be changed from the OS level)? > > > (*) and yes, there probably are such parts. If we set backlight to > > 20%, we'll be confused if it is 100% after resume... we probably could > > handle those one-by-one... > > *If* we reinitialize devices *and* ACPI from scratch after restoring the image, > we'll discard the old value (20%) and read the new value (100%) from the BIOS. > The problems occur, IMO, because we try to be smart and use the BIOS > after the resume as though we'd resumed from a real suspend (eg. s2ram). > > Which is natural, if we use the same set of .resume() callbacks for both cases. Agreed, these all sound like problems in the ACPI driver's implementation of suspend and resume. Problems that are caused (at least in part) by the fact that the PM core doesn't tell the driver whether it's doing suspend-to-RAM vs. hibernation. Once that is straighened out, everything else should become much simpler. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-05 15:52 ` Alan Stern @ 2007-05-07 1:16 ` David Brownell 2007-05-07 21:00 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 1:16 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Saturday 05 May 2007, Alan Stern wrote: > Agreed, these all sound like problems in the ACPI driver's implementation > of suspend and resume. Problems that are caused (at least in part) by the > fact that the PM core doesn't tell the driver whether it's doing > suspend-to-RAM vs. hibernation. Once that is straighened out, everything > else should become much simpler. I'm not sure I agree with that diagnosis, but for the record: updating drivers/pci/pci-acpi.c so that it can implement the platform_pci_choose_state() hook requires ACPI to export that information. So for now I have drivers/acpi/sleep/main.c exporting s_state = acpi_get_target_sleep_state(); so that ACPI-aware code can know to call "_S3D" instead of the "_S1D" or "_S4D" methods (and "_S3W" etc). Of course the $SUBJECT patch will finish borking that for S4. :( - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 1:16 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell @ 2007-05-07 21:00 ` Rafael J. Wysocki 2007-05-07 21:45 ` David Brownell 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-07 21:00 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Monday, 7 May 2007 03:16, David Brownell wrote: > On Saturday 05 May 2007, Alan Stern wrote: > > > Agreed, these all sound like problems in the ACPI driver's implementation > > of suspend and resume. Problems that are caused (at least in part) by the > > fact that the PM core doesn't tell the driver whether it's doing > > suspend-to-RAM vs. hibernation. Once that is straighened out, everything > > else should become much simpler. > > I'm not sure I agree with that diagnosis, but for the record: > updating drivers/pci/pci-acpi.c so that it can implement the > platform_pci_choose_state() hook requires ACPI to export that > information. > > So for now I have drivers/acpi/sleep/main.c exporting > > s_state = acpi_get_target_sleep_state(); > > so that ACPI-aware code can know to call "_S3D" instead of > the "_S1D" or "_S4D" methods (and "_S3W" etc). Of course > the $SUBJECT patch will finish borking that for S4. :( Why exactly? Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 21:00 ` Rafael J. Wysocki @ 2007-05-07 21:45 ` David Brownell 2007-05-07 22:16 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 21:45 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Monday 07 May 2007, Rafael J. Wysocki wrote: > On Monday, 7 May 2007 03:16, David Brownell wrote: > > So for now I have drivers/acpi/sleep/main.c exporting > > > > s_state = acpi_get_target_sleep_state(); > > > > so that ACPI-aware code can know to call "_S3D" instead of > > the "_S1D" or "_S4D" methods (and "_S3W" etc). Of course > > the $SUBJECT patch will finish borking that for S4. :( > > Why exactly? Because it adds new code paths ... currently pm_ops methods record the target state. Fixable later. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 21:45 ` David Brownell @ 2007-05-07 22:16 ` Rafael J. Wysocki 2007-05-09 19:23 ` David Brownell 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-07 22:16 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Monday, 7 May 2007 23:45, David Brownell wrote: > On Monday 07 May 2007, Rafael J. Wysocki wrote: > > On Monday, 7 May 2007 03:16, David Brownell wrote: > > > > So for now I have drivers/acpi/sleep/main.c exporting > > > > > > s_state = acpi_get_target_sleep_state(); > > > > > > so that ACPI-aware code can know to call "_S3D" instead of > > > the "_S1D" or "_S4D" methods (and "_S3W" etc). Of course > > > the $SUBJECT patch will finish borking that for S4. :( > > > > Why exactly? > > Because it adds new code paths ... currently pm_ops methods > record the target state. Fixable later. Hmm, I think hibernation_ops do the equivalent of what pm_ops did for ACPI_STATE_S4 and the target state is still recorded (in acpi_enter_sleep_state_prep()). Isn't that correct? ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 22:16 ` Rafael J. Wysocki @ 2007-05-09 19:23 ` David Brownell 0 siblings, 0 replies; 713+ messages in thread From: David Brownell @ 2007-05-09 19:23 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg, Linux-pm mailing list On Monday 07 May 2007, Rafael J. Wysocki wrote: > On Monday, 7 May 2007 23:45, David Brownell wrote: > > On Monday 07 May 2007, Rafael J. Wysocki wrote: > > > On Monday, 7 May 2007 03:16, David Brownell wrote: > > > > > > So for now I have drivers/acpi/sleep/main.c exporting > > > > > > > > s_state = acpi_get_target_sleep_state(); > > > > > > > > so that ACPI-aware code can know to call "_S3D" instead of > > > > the "_S1D" or "_S4D" methods (and "_S3W" etc). Of course > > > > the $SUBJECT patch will finish borking that for S4. :( > > > > > > Why exactly? > > > > Because it adds new code paths ... currently pm_ops methods > > record the target state. Fixable later. > > Hmm, I think hibernation_ops do the equivalent of what pm_ops did for > ACPI_STATE_S4 and the target state is still recorded (in > acpi_enter_sleep_state_prep()). Isn't that correct? I didn't use that method, because of information hiding. See the patch I just posted. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:21 ` Johannes Berg 2007-05-04 20:55 ` Pavel Machek @ 2007-05-04 21:06 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 21:06 UTC (permalink / raw) To: Johannes Berg Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list On Friday, 4 May 2007 22:21, Johannes Berg wrote: > On Fri, 2007-05-04 at 22:20 +0200, Rafael J. Wysocki wrote: > > > > On the other hand, steps 6 and 7 aren't really needed for hibernation. > > > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > > > wouldn't work, but the next time the user turned the computer on manually > > > it would still resume from hibernation. > > > > That's correct, with the exception that the user may find the system not fully > > functional after the resume in that case. > > Why is that anyway? Is it just a matter of the acpi code getting > confused about the acpi bios state? Yes, I think so. > How can the acpi bios possibly be screwed up after what it must see as a > fresh boot? Does the acpi code poke it in ways it's not supposed to be poked > after a fresh boot? Sort of. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:40 ` Alan Stern 2007-05-04 20:20 ` Rafael J. Wysocki @ 2007-05-04 20:58 ` Pavel Machek 2007-05-04 21:24 ` Rafael J. Wysocki 2007-05-04 21:40 ` David Brownell 2 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-04 20:58 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg Hi! > You all misunderstood the point I was trying to make. > > > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one > > (complicated) 'platform' power off method. It doesn't make sense to use the > > (other) hibernation_ops without the ->enter() method. > > Let's look at the big picture. > > Entering hibernation basically involves these steps: > > 1. Freeze tasks > > 2. Quiesce devices and drivers > > 3. Create snapshot > > 4. Reactivate devices and drivers > > 5. Save snapshot to disk > > 6. Prepare devices for wakeup > > 7. Power down (ACPI S4 on systems which support it) > > Leaving hibernation involves a similar sequence which I won't discuss. > > Notice that steps 1-5 above are _completely_ independent of all issues > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to > be No, they are not. You probably should tell ACPI at step 2 that you are suspending, and you definitely need to tell ACPI that you have resumed (so it can re-scan AC adapters, for example). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:58 ` Pavel Machek @ 2007-05-04 21:24 ` Rafael J. Wysocki 2007-05-05 16:19 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 21:24 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg Hi, On Friday, 4 May 2007 22:58, Pavel Machek wrote: > Hi! > > > You all misunderstood the point I was trying to make. > > > > > > acpi_enter_sleep_state(ACPI_STATE_S4), are just different pieces of one > > > (complicated) 'platform' power off method. It doesn't make sense to use the > > > (other) hibernation_ops without the ->enter() method. > > > > Let's look at the big picture. > > > > Entering hibernation basically involves these steps: > > > > 1. Freeze tasks > > > > 2. Quiesce devices and drivers > > > > 3. Create snapshot > > > > 4. Reactivate devices and drivers > > > > 5. Save snapshot to disk > > > > 6. Prepare devices for wakeup > > > > 7. Power down (ACPI S4 on systems which support it) > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to > > be > > No, they are not. You probably should tell ACPI at step 2 that you are > suspending, You can, but even if you don't, the BIOS shouldn't have problems. What might have problems is our ACPI code during the resume, if it cannot get appropriate information from the BIOS. > and you definitely need to tell ACPI that you have resumed > (so it can re-scan AC adapters, for example). Yes, but that can be done in two different ways: 1) "We have restored the hibernation image, but the BIOS state corresponds to a fresh reboot, so please initialize everything from scratch." 2) "We have restored the hibernation image and the ACPI S4 was used for powering off (hint: you may try not to initialize everything from scratch)." Of course, in the case 2) we are responsible for ensuring that the contents of the hibernation image are consistent with the information preserved by the BIOS. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:24 ` Rafael J. Wysocki @ 2007-05-05 16:19 ` Alan Stern 2007-05-05 17:46 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 16:19 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Fri, 4 May 2007, Rafael J. Wysocki wrote: > > > Entering hibernation basically involves these steps: > > > > > > 1. Freeze tasks > > > > > > 2. Quiesce devices and drivers > > > > > > 3. Create snapshot > > > > > > 4. Reactivate devices and drivers > > > > > > 5. Save snapshot to disk > > > > > > 6. Prepare devices for wakeup > > > > > > 7. Power down (ACPI S4 on systems which support it) > > > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to > > > be > > > > No, they are not. You probably should tell ACPI at step 2 that you are > > suspending, At step 2 you don't _know_ that you are suspending! Step 5 might fail. You should tell ACPI during step 6 or 7. > You can, but even if you don't, the BIOS shouldn't have problems. What might > have problems is our ACPI code during the resume, if it cannot get appropriate > information from the BIOS. > > > and you definitely need to tell ACPI that you have resumed > > (so it can re-scan AC adapters, for example). > > Yes, but that can be done in two different ways: > > 1) "We have restored the hibernation image, but the BIOS state corresponds to > a fresh reboot, so please initialize everything from scratch." > > 2) "We have restored the hibernation image and the ACPI S4 was used for > powering off (hint: you may try not to initialize everything from scratch)." > > Of course, in the case 2) we are responsible for ensuring that the contents of > the hibernation image are consistent with the information preserved by the > BIOS. Keep in mind also that before you can do either 1) or 2), the boot kernel has already communicated with the BIOS, possibly changing some of the ACPI state. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 16:19 ` Alan Stern @ 2007-05-05 17:46 ` Rafael J. Wysocki 2007-05-05 21:42 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 17:46 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Saturday, 5 May 2007 18:19, Alan Stern wrote: > On Fri, 4 May 2007, Rafael J. Wysocki wrote: > > > > > Entering hibernation basically involves these steps: > > > > > > > > 1. Freeze tasks > > > > > > > > 2. Quiesce devices and drivers > > > > > > > > 3. Create snapshot > > > > > > > > 4. Reactivate devices and drivers > > > > > > > > 5. Save snapshot to disk > > > > > > > > 6. Prepare devices for wakeup > > > > > > > > 7. Power down (ACPI S4 on systems which support it) > > > > > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to > > > > be > > > > > > No, they are not. You probably should tell ACPI at step 2 that you are > > > suspending, > > At step 2 you don't _know_ that you are suspending! Step 5 might fail. > You should tell ACPI during step 6 or 7. > > > You can, but even if you don't, the BIOS shouldn't have problems. What might > > have problems is our ACPI code during the resume, if it cannot get appropriate > > information from the BIOS. > > > > > and you definitely need to tell ACPI that you have resumed > > > (so it can re-scan AC adapters, for example). > > > > Yes, but that can be done in two different ways: > > > > 1) "We have restored the hibernation image, but the BIOS state corresponds to > > a fresh reboot, so please initialize everything from scratch." > > > > 2) "We have restored the hibernation image and the ACPI S4 was used for > > powering off (hint: you may try not to initialize everything from scratch)." > > > > Of course, in the case 2) we are responsible for ensuring that the contents of > > the hibernation image are consistent with the information preserved by the > > BIOS. > > Keep in mind also that before you can do either 1) or 2), the boot kernel > has already communicated with the BIOS, possibly changing some of the ACPI > state. That's correct, but it follows from the ACPI spec that there is a way for the boot kernel to distinguish 'normal' boot from 'S4 resume' boot. If this mechanism is used and the boot kernel states that it's doing a 'S4 resume', it will be able to leave ACPI alone and restore the hibernation image. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 17:46 ` Rafael J. Wysocki @ 2007-05-05 21:42 ` Alan Stern 2007-05-05 22:14 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 21:42 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > > Yes, but that can be done in two different ways: > > > > > > 1) "We have restored the hibernation image, but the BIOS state corresponds to > > > a fresh reboot, so please initialize everything from scratch." > > > > > > 2) "We have restored the hibernation image and the ACPI S4 was used for > > > powering off (hint: you may try not to initialize everything from scratch)." > > > > > > Of course, in the case 2) we are responsible for ensuring that the contents of > > > the hibernation image are consistent with the information preserved by the > > > BIOS. > > > > Keep in mind also that before you can do either 1) or 2), the boot kernel > > has already communicated with the BIOS, possibly changing some of the ACPI > > state. > > That's correct, but it follows from the ACPI spec that there is a way for the > boot kernel to distinguish 'normal' boot from 'S4 resume' boot. If this > mechanism is used and the boot kernel states that it's doing a 'S4 resume', > it will be able to leave ACPI alone and restore the hibernation image. Okay, good. That means part of the resume-from-hibernation handling must be included in the standard startup code of the ACPI driver, because it runs in the boot kernel rather than the restored kernel. Does it work that way now? You'd think it must... The restored kernel could do either 1) or 2), I don't see that it matters much which. 1) might be safer, because it's possible that external power was turned off at some point during the hibernation (and no battery power was available). Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 21:42 ` Alan Stern @ 2007-05-05 22:14 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 22:14 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Saturday, 5 May 2007 23:42, Alan Stern wrote: > On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > > > > Yes, but that can be done in two different ways: > > > > > > > > 1) "We have restored the hibernation image, but the BIOS state corresponds to > > > > a fresh reboot, so please initialize everything from scratch." > > > > > > > > 2) "We have restored the hibernation image and the ACPI S4 was used for > > > > powering off (hint: you may try not to initialize everything from scratch)." > > > > > > > > Of course, in the case 2) we are responsible for ensuring that the contents of > > > > the hibernation image are consistent with the information preserved by the > > > > BIOS. > > > > > > Keep in mind also that before you can do either 1) or 2), the boot kernel > > > has already communicated with the BIOS, possibly changing some of the ACPI > > > state. > > > > That's correct, but it follows from the ACPI spec that there is a way for the > > boot kernel to distinguish 'normal' boot from 'S4 resume' boot. If this > > mechanism is used and the boot kernel states that it's doing a 'S4 resume', > > it will be able to leave ACPI alone and restore the hibernation image. > > Okay, good. That means part of the resume-from-hibernation handling must > be included in the standard startup code of the ACPI driver, because it > runs in the boot kernel rather than the restored kernel. Does it work > that way now? You'd think it must... Well, I'm not sure, but I don't think so. It looks like the ACPI code that we use in the hibernation/suspend code paths is not in a good shape in general. IOW, we may want to implement that in the future, but I'd rather like to get 1) working reliably for everyone first. > The restored kernel could do either 1) or 2), I don't see that it matters > much which. 1) might be safer, because it's possible that external power > was turned off at some point during the hibernation (and no battery power > was available). I think that the 'ACPI S4' handling adds quite a lot of complexity to the picture and should be added on top of a working infrastructure, as an extension. Currently, we don't handle the hibernation in accordance with the ACPI spec anyway. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:40 ` Alan Stern 2007-05-04 20:20 ` Rafael J. Wysocki 2007-05-04 20:58 ` Pavel Machek @ 2007-05-04 21:40 ` David Brownell 2007-05-04 22:19 ` Rafael J. Wysocki 2007-05-05 16:08 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern 2 siblings, 2 replies; 713+ messages in thread From: David Brownell @ 2007-05-04 21:40 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Friday 04 May 2007, Alan Stern wrote: > Rafael, David, and Pavel: > > You all misunderstood the point I was trying to make. > > ... > > Let's look at the big picture. > > Entering hibernation basically involves these steps: > > 1. Freeze tasks > 2. Quiesce devices and drivers > 3. Create snapshot > 4. Reactivate devices and drivers > 5. Save snapshot to disk > 6. Prepare devices for wakeup > 7. Power down (ACPI S4 on systems which support it) > > Leaving hibernation involves a similar sequence which I won't discuss. > > Notice that steps 1-5 above are _completely_ independent of all issues > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > carried out for hibernation to work, no matter how the system ends up > getting shut down. Not exactly. Step 2 is supposed to be aware of the target state's capabilities, including what's wakeup-capable. ACPI uses target device states to choose which _SxD methods to execute, etc. (Or it should ... though come to think of it, I don't think I ever saw a hook whereby PCI could trigger that.) > On the other hand, steps 6 and 7 aren't really needed for hibernation. > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > wouldn't work, but the next time the user turned the computer on manually > it would still resume from hibernation. I believe I did comment on your point that step 7 could use S5. However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C) that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state. (Then fuzzes the issue in 2.4, but those bits are less relevant here; 2.2 also mentions G3 = "Mechanical OFF", which is the only state in which machine disassembly/reassembly is expected to be safe. ACPI is allowed to distinguish between S4 and S5 in more ways than just the power usage. It'd be fair for the AML to store state in something that retains power, and rely on that. It'd be better not to do things that are allowed to confuse ACPI. > Conversely, steps 6 and 7 can make sense even in situations where you > don't want to hibernate. For example, you might want a normal shutdown in > which the operating system does a full restart when the firmware is > signalled by a wakeup device. Wakeup devices in S4 are expected to be a superset of those in S5, and system documentation often covers that. Yeah, I know, "who bothers to RTFM". Still, the point is that these systems are now documented to work in a particular way, and there really ought to be a good reason to invalidate user training and documentation. > So there should be separate data structures associated with 1-5 and 6-7. > Maybe the one associated with 6-7 is what you are calling hibernation_ops; > if so then fine. But I still think that it should be usable for > situations where you are not entering hibernation, and we should be > possible to enter hibernation without using it. The system administrator > should be able to choose which of S4 or S5 gets used for _any_ poweroff, > regardless of whether it's to start hibernating. But ... why? What value would users see from that? We do have /sys/power/disk today, but that's only for hibernation. (And it's a bit confusing, too.) A "Soft OFF" should be S5 to conform to specs and documentation. > The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy > to check and see), but that doesn't mean we have to use the terms > synonymously. It talks S4 as a "sleeping" state, like S1, S2, and S3. Or, about S4 as a "Non-Volatle sleep" state I think it also assumes more intelligence on resume-from-S4 than Linux has just now, which may partly explain why it takes so long for swsusp to finish its thing. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:40 ` David Brownell @ 2007-05-04 22:19 ` Rafael J. Wysocki 2007-05-07 1:05 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell 2007-05-05 16:08 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 22:19 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Friday, 4 May 2007 23:40, David Brownell wrote: > On Friday 04 May 2007, Alan Stern wrote: > > Rafael, David, and Pavel: > > > > You all misunderstood the point I was trying to make. > > > > ... > > > > Let's look at the big picture. > > > > Entering hibernation basically involves these steps: > > > > 1. Freeze tasks > > 2. Quiesce devices and drivers > > 3. Create snapshot > > 4. Reactivate devices and drivers > > 5. Save snapshot to disk > > 6. Prepare devices for wakeup > > 7. Power down (ACPI S4 on systems which support it) > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > > carried out for hibernation to work, no matter how the system ends up > > getting shut down. > > Not exactly. Step 2 is supposed to be aware of the target state's > capabilities, including what's wakeup-capable. ACPI uses target > device states to choose which _SxD methods to execute, etc. (Or it > should ... though come to think of it, I don't think I ever saw a > hook whereby PCI could trigger that.) Still, step 4 effectively undoes at least some things we did in 2. At least the GPEs should be enabled for normal operation so that we can save the image. > > On the other hand, steps 6 and 7 aren't really needed for hibernation. > > You _could_ shut the system off completely (ACPI S5). Automatic wakeup > > wouldn't work, but the next time the user turned the computer on manually > > it would still resume from hibernation. > > I believe I did comment on your point that step 7 could use S5. > > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C) > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state. (Then > fuzzes the issue in 2.4, but those bits are less relevant here; > 2.2 also mentions G3 = "Mechanical OFF", which is the only state > in which machine disassembly/reassembly is expected to be safe. But then there's the nice picture in 9.3.3 (OS loading) that shows how OSPM (that would be us) can verify that the hardware configuration hasn't changed. In fact we don't do this, because we always go to the "Load OS Images" block and load the hibernation image from this newly loaded OS (aka the boot kernel). Thus our resume is always different from the "ACPI wake up from S4". > ACPI is allowed to distinguish between S4 and S5 in more ways > than just the power usage. It'd be fair for the AML to store > state in something that retains power, and rely on that. It'd > be better not to do things that are allowed to confuse ACPI. As far as I understand the specification, OSPM (ie. we) can always discard the fact that the system has entered S4 and reinitialize everything from scratch. > > Conversely, steps 6 and 7 can make sense even in situations where you > > don't want to hibernate. For example, you might want a normal shutdown in > > which the operating system does a full restart when the firmware is > > signalled by a wakeup device. > > Wakeup devices in S4 are expected to be a superset of those in S5, > and system documentation often covers that. Yeah, I know, "who > bothers to RTFM". Still, the point is that these systems are now > documented to work in a particular way, and there really ought to > be a good reason to invalidate user training and documentation. That's a very important point, IMO. > > So there should be separate data structures associated with 1-5 and 6-7. > > Maybe the one associated with 6-7 is what you are calling hibernation_ops; > > if so then fine. But I still think that it should be usable for > > situations where you are not entering hibernation, and we should be > > possible to enter hibernation without using it. The system administrator > > should be able to choose which of S4 or S5 gets used for _any_ poweroff, > > regardless of whether it's to start hibernating. > > But ... why? What value would users see from that? > > We do have /sys/power/disk today, but that's only for > hibernation. (And it's a bit confusing, too.) > > A "Soft OFF" should be S5 to conform to specs and > documentation. > > > > The ACPI spec might refer to S4 as "hibernation" (does it? -- I'm too lazy > > to check and see), but that doesn't mean we have to use the terms > > synonymously. > > It talks S4 as a "sleeping" state, like S1, S2, and S3. > Or, about S4 as a "Non-Volatle sleep" state > > I think it also assumes more intelligence on resume-from-S4 > than Linux has just now, which may partly explain why it > takes so long for swsusp to finish its thing. Well, please look at the picture in 9.3.3 and compare it to what we're doing. ;-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-04 22:19 ` Rafael J. Wysocki @ 2007-05-07 1:05 ` David Brownell 0 siblings, 0 replies; 713+ messages in thread From: David Brownell @ 2007-05-07 1:05 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Friday 04 May 2007, Rafael J. Wysocki wrote: > On Friday, 4 May 2007 23:40, David Brownell wrote: > > On Friday 04 May 2007, Alan Stern wrote: > > > 1. Freeze tasks > > > 2. Quiesce devices and drivers > > > 3. Create snapshot > > > 4. Reactivate devices and drivers > > > 5. Save snapshot to disk > > > 6. Prepare devices for wakeup > > > 7. Power down (ACPI S4 on systems which support it) > > > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > > > carried out for hibernation to work, no matter how the system ends up > > > getting shut down. > > > > Not exactly. Step 2 is supposed to be aware of the target state's > > capabilities, including what's wakeup-capable. ACPI uses target > > device states to choose which _SxD methods to execute, etc. (Or it > > should ... though come to think of it, I don't think I ever saw a > > hook whereby PCI could trigger that.) The hook is there, but it's not yet implemented ... patch in the works. Whoever implemented pci_choose_state() botched it up. > Still, step 4 effectively undoes at least some things we did in 2. At least > the GPEs should be enabled for normal operation so that we can save the image. And for that matter, wakeup shouldn't be limited to wake-from-sleep; runtime device PM should be able to use it. ACPI doesn't use GPEs very well at all, except maybe runtime GPEs. Step 6 needs to know the same info, so it can enable the GPEs that work from S4. > But then there's the nice picture in 9.3.3 (OS loading) that shows how OSPM > (that would be us) can verify that the hardware configuration hasn't changed. > > In fact we don't do this, because we always go to the "Load OS Images" block > and load the hibernation image from this newly loaded OS (aka the boot kernel). > > Thus our resume is always different from the "ACPI wake up from S4". Right ... "slower" being one consequence. > > ACPI is allowed to distinguish between S4 and S5 in more ways > > than just the power usage. It'd be fair for the AML to store > > state in something that retains power, and rely on that. It'd > > be better not to do things that are allowed to confuse ACPI. > > As far as I understand the specification, OSPM (ie. we) can always discard > the fact that the system has entered S4 and reinitialize everything from > scratch. At the price of making some things needlessly misbehave. Devices that can wake from D3cold will detect state being trashed if you re-init, which is at least sub-optimal if not wrong. > > Still, the point is that these systems are now > > documented to work in a particular way, and there really ought to > > be a good reason to invalidate user training and documentation. > > That's a very important point, IMO. So I just re-quoted it. ;) > > A "Soft OFF" should be S5 to conform to specs and > > documentation. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:40 ` David Brownell 2007-05-04 22:19 ` Rafael J. Wysocki @ 2007-05-05 16:08 ` Alan Stern 2007-05-05 17:50 ` Rafael J. Wysocki 2007-05-07 1:31 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell 1 sibling, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-05 16:08 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Fri, 4 May 2007, David Brownell wrote: > > Entering hibernation basically involves these steps: > > > > 1. Freeze tasks > > 2. Quiesce devices and drivers > > 3. Create snapshot > > 4. Reactivate devices and drivers > > 5. Save snapshot to disk > > 6. Prepare devices for wakeup > > 7. Power down (ACPI S4 on systems which support it) > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > > carried out for hibernation to work, no matter how the system ends up > > getting shut down. > > Not exactly. Step 2 is supposed to be aware of the target state's > capabilities, including what's wakeup-capable. Not true. Step 2 is (or should be) divorced from power-level considerations. All it needs to do is quiesce things so that a consistent snapshot can be obtained; changing power levels would take time and ideally should be avoided. Furthermore, anything done in step 2 should be reversed in step 4. Did you mean to say that Step _6_ is supposed to be aware of the target state's capabilities? I'll agree to that. > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C) > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state. (Then > fuzzes the issue in 2.4, but those bits are less relevant here; > 2.2 also mentions G3 = "Mechanical OFF", which is the only state > in which machine disassembly/reassembly is expected to be safe. Sure. But entering hibernation need not involve putting the system into a "sleeping" state. Going into G3 should also work for hibernation. > ACPI is allowed to distinguish between S4 and S5 in more ways > than just the power usage. It'd be fair for the AML to store > state in something that retains power, and rely on that. It'd > be better not to do things that are allowed to confuse ACPI. None of that should matter for post-snapshot-restore processing. The boot kernel interacts with ACPI when the system wakes up; the restored kernel is handed an already-running BIOS, which it should do its best to reinitialize from the existing hardware state. > > possible to enter hibernation without using it. The system administrator > > should be able to choose which of S4 or S5 gets used for _any_ poweroff, > > regardless of whether it's to start hibernating. > > But ... why? What value would users see from that? > > We do have /sys/power/disk today, but that's only for > hibernation. (And it's a bit confusing, too.) Yes. I'm proposing that it be generalized. (And it should be renamed, too -- that's a separate issue.) I'm also pointing out that the policy choice decided by the contents of /sys/power/disk comes into play during steps 6-7 above, but not at all in steps 1-5. Hence any associated software structures should explicitly be connected only with steps 6 and 7. And since normal shutdown ought to have its own analog of steps 6 and 7, the same software structures should be used there. Hence naming them "hibernation_ops" isn't a good idea. > I think it also assumes more intelligence on resume-from-S4 > than Linux has just now, which may partly explain why it > takes so long for swsusp to finish its thing. And it may explain some of the strange behavior people sometimes observe when they try to hibernate twice in a row. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 16:08 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern @ 2007-05-05 17:50 ` Rafael J. Wysocki 2007-05-05 21:43 ` Alan Stern 2007-05-07 1:31 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 17:50 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Saturday, 5 May 2007 18:08, Alan Stern wrote: > On Fri, 4 May 2007, David Brownell wrote: > > > > Entering hibernation basically involves these steps: > > > > > > 1. Freeze tasks > > > 2. Quiesce devices and drivers > > > 3. Create snapshot > > > 4. Reactivate devices and drivers > > > 5. Save snapshot to disk > > > 6. Prepare devices for wakeup > > > 7. Power down (ACPI S4 on systems which support it) > > > > > > Leaving hibernation involves a similar sequence which I won't discuss. > > > > > > Notice that steps 1-5 above are _completely_ independent of all issues > > > concerning wakeup devices and S4 vs. S5 vs. whatever. They have to be > > > carried out for hibernation to work, no matter how the system ends up > > > getting shut down. > > > > Not exactly. Step 2 is supposed to be aware of the target state's > > capabilities, including what's wakeup-capable. > > Not true. Step 2 is (or should be) divorced from power-level > considerations. All it needs to do is quiesce things so that a consistent > snapshot can be obtained; changing power levels would take time and > ideally should be avoided. Furthermore, anything done in step 2 should be > reversed in step 4. > > Did you mean to say that Step _6_ is supposed to be aware of the target > state's capabilities? I'll agree to that. > > > > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C) > > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state. (Then > > fuzzes the issue in 2.4, but those bits are less relevant here; > > 2.2 also mentions G3 = "Mechanical OFF", which is the only state > > in which machine disassembly/reassembly is expected to be safe. > > Sure. But entering hibernation need not involve putting the system into a > "sleeping" state. Going into G3 should also work for hibernation. > > > ACPI is allowed to distinguish between S4 and S5 in more ways > > than just the power usage. It'd be fair for the AML to store > > state in something that retains power, and rely on that. It'd > > be better not to do things that are allowed to confuse ACPI. > > None of that should matter for post-snapshot-restore processing. The > boot kernel interacts with ACPI when the system wakes up; the restored > kernel is handed an already-running BIOS, which it should do its best to > reinitialize from the existing hardware state. > > > > > possible to enter hibernation without using it. The system administrator > > > should be able to choose which of S4 or S5 gets used for _any_ poweroff, > > > regardless of whether it's to start hibernating. > > > > But ... why? What value would users see from that? > > > > We do have /sys/power/disk today, but that's only for > > hibernation. (And it's a bit confusing, too.) > > Yes. I'm proposing that it be generalized. (And it should be renamed, > too -- that's a separate issue.) > > I'm also pointing out that the policy choice decided by the contents of > /sys/power/disk comes into play during steps 6-7 above, but not at all in > steps 1-5. Hence any associated software structures should explicitly be > connected only with steps 6 and 7. At present, this policy choice does affect the earlier steps too. > And since normal shutdown ought to have its own analog of steps 6 and 7, > the same software structures should be used there. Hence naming them > "hibernation_ops" isn't a good idea. > > > > I think it also assumes more intelligence on resume-from-S4 > > than Linux has just now, which may partly explain why it > > takes so long for swsusp to finish its thing. > > And it may explain some of the strange behavior people sometimes observe > when they try to hibernate twice in a row. Yes, this seems to be the case. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 17:50 ` Rafael J. Wysocki @ 2007-05-05 21:43 ` Alan Stern 2007-05-05 22:16 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 21:43 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > I'm also pointing out that the policy choice decided by the contents of > > /sys/power/disk comes into play during steps 6-7 above, but not at all in > > steps 1-5. Hence any associated software structures should explicitly be > > connected only with steps 6 and 7. > > At present, this policy choice does affect the earlier steps too. Isn't this then another aspect of hibernation needing to be fixed? Or is there some genuine reason I'm not aware of that the choice of shutdown method should affect those steps? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 21:43 ` Alan Stern @ 2007-05-05 22:16 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 22:16 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Saturday, 5 May 2007 23:43, Alan Stern wrote: > On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > > > I'm also pointing out that the policy choice decided by the contents of > > > /sys/power/disk comes into play during steps 6-7 above, but not at all in > > > steps 1-5. Hence any associated software structures should explicitly be > > > connected only with steps 6 and 7. > > > > At present, this policy choice does affect the earlier steps too. > > Isn't this then another aspect of hibernation needing to be fixed? Or is > there some genuine reason I'm not aware of that the choice of shutdown > method should affect those steps? Well, I think it should be fixed, but I'm afraid that'll take a *lot* of time. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-05 16:08 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern 2007-05-05 17:50 ` Rafael J. Wysocki @ 2007-05-07 1:31 ` David Brownell 2007-05-07 16:33 ` Alan Stern 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 1:31 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Saturday 05 May 2007, Alan Stern wrote: > On Fri, 4 May 2007, David Brownell wrote: > > Did you mean to say that Step _6_ is supposed to be aware of the target > state's capabilities? I'll agree to that. Yes ... but I don't see why it would be wrong for step 2 either. If the device can't wake from S5, it wouldn't set up with the assumption that was a possibility. > > However, the ACPI spec *does* say up front (2.2 in ACPI 2.0C) > > that S5 == G2 "Soft OFF" is not a "sleeping" (G1) state. (Then > > fuzzes the issue in 2.4, but those bits are less relevant here; > > 2.2 also mentions G3 = "Mechanical OFF", which is the only state > > in which machine disassembly/reassembly is expected to be safe. > > Sure. But entering hibernation need not involve putting the system into a > "sleeping" state. Going into G3 should also work for hibernation. For some definitions of "should"; that's where specs get fuzzy. Since disassembly is allowed in G3, if you swapped a disk that should prevent the system from resuming ... it should force a boot-from-scratch. But if you just swapped a power supply it would probably work OK. > I'm also pointing out that the policy choice decided by the contents of > /sys/power/disk comes into play during steps 6-7 above, but not at all in > steps 1-5. Hence any associated software structures should explicitly be > connected only with steps 6 and 7. The difference between S4 and S5 could matter to step 2 though. Perhaps it's not the most likely thing, but certainly avoiding the work to setup wake-from-S4 is reasonable when going to S5. > And since normal shutdown ought to have its own analog of steps 6 and 7, > the same software structures should be used there. Hence naming them > "hibernation_ops" isn't a good idea. That's something of a different stance. And it's untrue for step 6 too ... suspend() and shutdown() differ a lot. Maybe if I saw some details, that would make more sense to me. > > I think it also assumes more intelligence on resume-from-S4 > > than Linux has just now, which may partly explain why it > > takes so long for swsusp to finish its thing. > > And it may explain some of the strange behavior people sometimes observe > when they try to hibernate twice in a row. There's all kinds of bizarreness there. I kind of get the feeling the ACPI folk were so deluged by IRQ and other resource setup issues (the "C" in ACPI) that the power management bits (the "P") didn't get that much attention. As pointed out very recently by Rafael. :) Plus there's the issue that while this thread has touched a lot on ACPI issues and models, Linux must not assume ACPI. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 1:31 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell @ 2007-05-07 16:33 ` Alan Stern 2007-05-07 20:49 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-07 16:33 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Linux-pm mailing list, Johannes Berg On Sun, 6 May 2007, David Brownell wrote: > On Saturday 05 May 2007, Alan Stern wrote: > > On Fri, 4 May 2007, David Brownell wrote: > > > > Did you mean to say that Step _6_ is supposed to be aware of the target > > state's capabilities? I'll agree to that. > > Yes ... but I don't see why it would be wrong for step 2 either. The principle of information hiding: If step 2 doesn't _need_ to know the final target state (which it shouldn't!) then we ought not to tell it. > If the device can't wake from S5, it wouldn't set up with the > assumption that was a possibility. But step 2 doesn't set up devices' wakeup functions. It merely quiesces them so the snapshot can be made safely. Then step 4 reactivates the devices, and step 6 takes care of setting up the devices for the final sleep state. > > Sure. But entering hibernation need not involve putting the system into a > > "sleeping" state. Going into G3 should also work for hibernation. > > For some definitions of "should"; that's where specs get fuzzy. > > Since disassembly is allowed in G3, if you swapped a disk that > should prevent the system from resuming ... it should force a > boot-from-scratch. But if you just swapped a power supply it > would probably work OK. Yep. The problem isn't so much in the specs; it's that no one has ever (so far as I know) given a precise definition of what Linux's "hibernate" is supposed to do. Is it supposed to be safe to disassemble a hibernating computer? Is remote wakeup necessarily supported? I've never seen answers to these questions. > > I'm also pointing out that the policy choice decided by the contents of > > /sys/power/disk comes into play during steps 6-7 above, but not at all in > > steps 1-5. Hence any associated software structures should explicitly be > > connected only with steps 6 and 7. > > The difference between S4 and S5 could matter to step 2 though. > Perhaps it's not the most likely thing, but certainly avoiding > the work to setup wake-from-S4 is reasonable when going to S5. I don't understand. Step 2 doesn't do the work to set up wake-from-S4; step 6 does. So why should the knowledge of S4 vs. S5 matter to step 2? > > And since normal shutdown ought to have its own analog of steps 6 and 7, > > the same software structures should be used there. Hence naming them > > "hibernation_ops" isn't a good idea. > > That's something of a different stance. And it's untrue for > step 6 too ... suspend() and shutdown() differ a lot. Maybe > if I saw some details, that would make more sense to me. It is true that for G3 type shutdown, step 6 can be empty. We don't need to do anything to the devices or drivers, we just turn off all the power. Still, the empty set _is_ a set. :-) Here's another way to express my ideas: We want to support at least two different kinds of powered-down states: (A) Remote wakeup may be enabled on some devices, there can be a certain power drain on the batteries or power line, it may not be safe to disassemble the machine, etc. (B) Remote wakeup is completely disabled, there is no power drain at all, it is safe to disassemble the machine provided you don't switch components like disks, etc. (With (B) it should always be _physically_ safe to switch disks and other components. Whether it is _logically_ safe depends on what happens the next time you start the machine: Will you try to restore a saved memory image or not? This isn't directly related to the nature of the powered-down state except for the obvious fact that you can't restore an image if no image has been saved.) I don't see any reason why (A) and (B) shouldn't both be allowed for hibernate, as in fact they are now by way of /sys/power/disk. And I don't see any reason why they shouldn't both be allowed for normal non-hibernate shutdowns as well. Furthermore, the choice of whether to use (A) or (B) shouldn't matter during steps 1-5 of the hibernate sequence. It should matter during steps 6-7 and during normal shutdown (which doesn't have steps 1-5 since it doesn't save a memory image). > Plus there's the issue that while this thread has touched a lot > on ACPI issues and models, Linux must not assume ACPI. Yes indeed. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 16:33 ` Alan Stern @ 2007-05-07 20:49 ` Pavel Machek 2007-05-07 21:38 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-07 20:49 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg Hi! > It is true that for G3 type shutdown, step 6 can be empty. We don't need > to do anything to the devices or drivers, we just turn off all the power. > Still, the empty set _is_ a set. :-) > > Here's another way to express my ideas: We want to support at least two > different kinds of powered-down states: > > (A) Remote wakeup may be enabled on some devices, there can be > a certain power drain on the batteries or power line, it may > not be safe to disassemble the machine, etc. > > (B) Remote wakeup is completely disabled, there is no power > drain at all, it is safe to disassemble the machine provided > you don't switch components like disks, etc. > ... > > I don't see any reason why (A) and (B) shouldn't both be allowed for > hibernate, as in fact they are now by way of /sys/power/disk. And I don't > see any reason why they shouldn't both be allowed for normal non-hibernate > shutdowns as well. No, sorry, that does not work. Software can't select (A) vs. (B). Only user can, by physically switching real power switch, or by unplugging the machine. And yes, there's documentation about expectations of swsusp, in Doc*/power/swsusp.txt. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 20:49 ` Pavel Machek @ 2007-05-07 21:38 ` Alan Stern 2007-05-08 0:30 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-07 21:38 UTC (permalink / raw) To: Pavel Machek Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg On Mon, 7 May 2007, Pavel Machek wrote: > > (A) Remote wakeup may be enabled on some devices, there can be > > a certain power drain on the batteries or power line, it may > > not be safe to disassemble the machine, etc. > > > > (B) Remote wakeup is completely disabled, there is no power > > drain at all, it is safe to disassemble the machine provided > > you don't switch components like disks, etc. > > > ... > > > > I don't see any reason why (A) and (B) shouldn't both be allowed for > > hibernate, as in fact they are now by way of /sys/power/disk. And I don't > > see any reason why they shouldn't both be allowed for normal non-hibernate > > shutdowns as well. > > No, sorry, that does not work. Software can't select (A) vs. (B). Only > user can, by physically switching real power switch, or by unplugging > the machine. Okay. Then what exactly is the difference between the kind of poweroff we do during hibernate (say with "platform" in /sys/power/disk) and the kind of poweroff we do during a normal system shutdown? > And yes, there's documentation about expectations of swsusp, in > Doc*/power/swsusp.txt. It says this near the start: * If you change * your hardware while system is suspended... well, it was not good idea; * but it will probably only crash. with similar warnings elsewhere. This appears to refer to confusion in the kernel after the image is restored; it doesn't seem to mean that you could damage equipment or electrocute yourself. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) 2007-05-07 21:38 ` Alan Stern @ 2007-05-08 0:30 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-08 0:30 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Linux-pm mailing list, Johannes Berg Hi! > It says this near the start: > > * If you change > * your hardware while system is suspended... well, it was not good idea; > * but it will probably only crash. > > with similar warnings elsewhere. > > This appears to refer to confusion in the kernel after the image is > restored; it doesn't seem to mean that you could damage equipment or > electrocute yourself. For electrocuting, see product manual :-). Basically, you have to unplug PC from AC power physically in order to open it. shutdown -h now is _not_ enough. For notebooks, remove battery, too. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 18:33 ` Alan Stern 2007-05-03 19:47 ` Rafael J. Wysocki @ 2007-05-03 20:33 ` David Brownell 1 sibling, 0 replies; 713+ messages in thread From: David Brownell @ 2007-05-03 20:33 UTC (permalink / raw) To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg On Thursday 03 May 2007, Alan Stern wrote: > First, as mentioned before this issue exists on only a small number of > systems. Second, I have submitted to Greg KH a couple of patches to > maintain persistence of USB devices even when the power sessions are lost > (they're still in his queue so you can't try them out yet). This feature > violates the USB spec and it is potentially dangerous -- users could > easily lose data ... which is why I don't like having it as any kind of option. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 14:00 ` Alan Stern 2007-05-03 17:17 ` Rafael J. Wysocki @ 2007-05-03 20:33 ` David Brownell 2007-05-03 20:51 ` Rafael J. Wysocki 2007-05-04 14:51 ` Alan Stern 2007-05-03 22:18 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek 2 siblings, 2 replies; 713+ messages in thread From: David Brownell @ 2007-05-03 20:33 UTC (permalink / raw) To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek On Thursday 03 May 2007, Alan Stern wrote: > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > same as a normal non-hibernate poweroff? No. One of the differences between ACPI S4 (hibernate) and S5 (poweroff) states is for example how wakeup behaves. Look for example at /proc/acpi/wakeup and see how many devices are listed as "can wake from S5" vs from S4 ... most systems support some S4 events, not so for S5. Another is that S4 can consume more power. (Although I believe I noticed a regression there in recent kernels ... previously I was able to trigger wakeup from hibernation using the RTC, but not with 2.6.21 patches.) Non-ACPI systems can make the same natural distinctions. > We are letting ourselves in for problems if we say that when the snapshot > is restored, devices may or may not need to be reinitialized. We have those problems already. Of course, most of the time S4/hibernate involves device re-init, while S3/STR doesn't. > Drivers > might not be able to tell which, so they would have to reinitialize > regardless, losing any advantage. For those specific devices. Of course, not many drivers are power-aware enough to notice. Most will re-init. On PCs the exceptions are USB and, maybe, network drivers. Drivers for embedded platforms more often leverage the "retention" states which don't require complete re-init, since those systems generally don't "hibernate". > Even worse, the device may _appear_ not > to need reinitialization because the firmware (BIOS) has already > initialized it but left it in a state that's useless for the kernel's > purposes. (That's part of the reason why PRETHAW was added.) That's *ALL* of the reason for PRETHAW. I asked the guy who did it. ;) > If the only remaining difference between poweroff for hibernate and normal > poweroff is which wakeup devices will function, then it seems pointless. There's the additional power usage involved in enabling additional wakeup sources, plus the additional system components that are expected (possibly unreasonably!) to work. > Why shouldn't the same devices work for wakeup from hibernate and wakeup > from normal poweroff? You're suggesting Linux not use the S5 state, essentially. So the question is really "why should Linux use S5 (and similar states on non-ACPI systems), instead of disregarding the ACPI spec?" The short answer: having a "true OFF" state is valuable, if for no other reason than to cope with buggy "partial-ON" states like S4. Also, it's not clear that disregarding ACPI's guidance here would be a good thing. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 20:33 ` David Brownell @ 2007-05-03 20:51 ` Rafael J. Wysocki 2007-05-04 14:51 ` Alan Stern 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-03 20:51 UTC (permalink / raw) To: David Brownell Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thursday, 3 May 2007 22:33, David Brownell wrote: > On Thursday 03 May 2007, Alan Stern wrote: > > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > > same as a normal non-hibernate poweroff? > > No. One of the differences between ACPI S4 (hibernate) > and S5 (poweroff) states is for example how wakeup behaves. > Look for example at /proc/acpi/wakeup and see how many > devices are listed as "can wake from S5" vs from S4 ... > most systems support some S4 events, not so for S5. > > Another is that S4 can consume more power. > > (Although I believe I noticed a regression there in recent > kernels ... previously I was able to trigger wakeup from > hibernation using the RTC, but not with 2.6.21 patches.) May I ask you to test a patch (appended)? Rafael --- From: Rafael J. Wysocki <rjw@sisk.pl> In the platform mode of hibernation swsusp calls (indirectly) the function acpi_pm_finish() in the nonerror resume-during-hibernation code paths, which is wrong, because this function effectively aborts the power transition and disables the wake-up capability of devices. Fix it. Remove references to the platform functions from the snapshot restore code path in kernel/power/user.c , since they should not be there. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> --- kernel/power/disk.c | 1 - kernel/power/user.c | 15 +++------------ 2 files changed, 3 insertions(+), 13 deletions(-) Index: linux-2.6.21/kernel/power/disk.c =================================================================== --- linux-2.6.21.orig/kernel/power/disk.c 2007-05-03 12:24:05.000000000 +0200 +++ linux-2.6.21/kernel/power/disk.c 2007-05-03 14:42:26.000000000 +0200 @@ -195,7 +195,6 @@ int hibernate(void) if (in_suspend) { enable_nonboot_cpus(); - platform_finish(); device_resume(); resume_console(); pr_debug("PM: writing image.\n"); Index: linux-2.6.21/kernel/power/user.c =================================================================== --- linux-2.6.21.orig/kernel/power/user.c 2007-05-03 12:22:57.000000000 +0200 +++ linux-2.6.21/kernel/power/user.c 2007-05-03 14:40:49.000000000 +0200 @@ -169,7 +169,7 @@ static inline int snapshot_suspend(int p } enable_nonboot_cpus(); Resume_devices: - if (platform_suspend) + if (platform_suspend && (!in_suspend || error)) platform_finish(); device_resume(); @@ -179,17 +179,12 @@ static inline int snapshot_suspend(int p return error; } -static inline int snapshot_restore(int platform_suspend) +static inline int snapshot_restore(void) { int error; mutex_lock(&pm_mutex); pm_prepare_console(); - if (platform_suspend) { - error = platform_prepare(); - if (error) - goto Finish; - } suspend_console(); error = device_suspend(PMSG_PRETHAW); if (error) @@ -201,12 +196,8 @@ static inline int snapshot_restore(int p enable_nonboot_cpus(); Resume_devices: - if (platform_suspend) - platform_finish(); - device_resume(); resume_console(); - Finish: pm_restore_console(); mutex_unlock(&pm_mutex); return error; @@ -272,7 +263,7 @@ static int snapshot_ioctl(struct inode * error = -EPERM; break; } - error = snapshot_restore(data->platform_suspend); + error = snapshot_restore(); break; case SNAPSHOT_FREE: ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 20:33 ` David Brownell 2007-05-03 20:51 ` Rafael J. Wysocki @ 2007-05-04 14:51 ` Alan Stern 2007-05-04 14:56 ` Johannes Berg 2007-05-04 22:00 ` David Brownell 1 sibling, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-04 14:51 UTC (permalink / raw) To: David Brownell Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Thu, 3 May 2007, David Brownell wrote: > On Thursday 03 May 2007, Alan Stern wrote: > > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > > same as a normal non-hibernate poweroff? > > No. One of the differences between ACPI S4 (hibernate) > and S5 (poweroff) states is for example how wakeup behaves. > Look for example at /proc/acpi/wakeup and see how many > devices are listed as "can wake from S5" vs from S4 ... > most systems support some S4 events, not so for S5. > > Another is that S4 can consume more power. You are describing the difference between ACPI S4 and S5, but I was talking about the difference between "normal" poweroff and "hibernate" poweroff. There doesn't seem to be any reason why we must always have hibernate = S4 and normal = S5. > Non-ACPI systems can make the same natural distinctions. On such systems there seems to be even less reason for those equalities (or rather, their analogs). > > We are letting ourselves in for problems if we say that when the snapshot > > is restored, devices may or may not need to be reinitialized. > > We have those problems already. Exactly because we are waffling on this issue. If we settled the matter once and for all (devices must ALWAYS be reinitialized after the snapshot is restored) then we wouldn't have those problems. (We might have other problems though...) > > Even worse, the device may _appear_ not > > to need reinitialization because the firmware (BIOS) has already > > initialized it but left it in a state that's useless for the kernel's > > purposes. (That's part of the reason why PRETHAW was added.) > > That's *ALL* of the reason for PRETHAW. I asked the > guy who did it. ;) Well, be fair. If your resume methods had some way to know whether or not a snapshot had just been restored then you wouldn't have needed to add PRETHAW. So another part of the reason is that restore() methods don't take a pm_message_t argument. > > Why shouldn't the same devices work for wakeup from hibernate and wakeup > > from normal poweroff? > > You're suggesting Linux not use the S5 state, essentially. No, I'm suggesting that the user should be able to control whether Linux uses S4 vs. S5 at poweroff time. If the user selected always to use S4 then wakeup devices would function in both hibernation and normal shutdown. If the user selected always to use S5 then wakeup devices would not function in either hibernation or normal shutdown. > So the question is really "why should Linux use S5 (and similar > states on non-ACPI systems), instead of disregarding the ACPI > spec?" > > The short answer: having a "true OFF" state is valuable, if > for no other reason than to cope with buggy "partial-ON" states > like S4. Also, it's not clear that disregarding ACPI's guidance > here would be a good thing. Which part of ACPI's so-called guidance are you referring to? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:51 ` Alan Stern @ 2007-05-04 14:56 ` Johannes Berg 2007-05-04 20:27 ` Rafael J. Wysocki 2007-05-04 22:00 ` David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-04 14:56 UTC (permalink / raw) To: Alan Stern; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 887 bytes --] On Fri, 2007-05-04 at 10:51 -0400, Alan Stern wrote: > Exactly because we are waffling on this issue. If we settled the matter > once and for all (devices must ALWAYS be reinitialized after the snapshot > is restored) then we wouldn't have those problems. (We might have other > problems though...) From what I've understood so far, ACPI is very unhappy on some machines if you go to S5 after hiberation. I still don't understand why, if the ACPI code would properly re-initialise itself (treat ACPI as a device and apply your "devices must ALWAYS be reinitialized after the snapshot is restored") then this shouldn't be possible to happen. And at that point I agree that the issue becomes completely orthogonal. (btw, it's always possible right now to go to S5 instead of S4 when doing hibernation simply by changing /sys/power/disk to "shutdown") johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:56 ` Johannes Berg @ 2007-05-04 20:27 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 20:27 UTC (permalink / raw) To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek On Friday, 4 May 2007 16:56, Johannes Berg wrote: > On Fri, 2007-05-04 at 10:51 -0400, Alan Stern wrote: > > > Exactly because we are waffling on this issue. If we settled the matter > > once and for all (devices must ALWAYS be reinitialized after the snapshot > > is restored) then we wouldn't have those problems. (We might have other > > problems though...) > > From what I've understood so far, ACPI is very unhappy on some machines > if you go to S5 after hiberation. I still don't understand why, if the > ACPI code would properly re-initialise itself (treat ACPI as a device > and apply your "devices must ALWAYS be reinitialized after the snapshot > is restored") then this shouldn't be possible to happen. I agree, and that's why I suspect that the ACPI driver's .resume() routines make some, well, ACPIish assumptions about the resume from hibernation, which is the source of the problem. If we separate the hibernation code from the suspend (s2ram, standby) code completely, this issue will have to be resolved somehow. > And at that point I agree that the issue becomes completely orthogonal. > > (btw, it's always possible right now to go to S5 instead of S4 when > doing hibernation simply by changing /sys/power/disk to "shutdown") That's correct. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:51 ` Alan Stern 2007-05-04 14:56 ` Johannes Berg @ 2007-05-04 22:00 ` David Brownell 2007-05-05 15:49 ` Alan Stern 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-04 22:00 UTC (permalink / raw) To: Alan Stern Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Friday 04 May 2007, Alan Stern wrote: > On Thu, 3 May 2007, David Brownell wrote: > > > On Thursday 03 May 2007, Alan Stern wrote: > > > > > In fact, shouldn't the poweroff at the end of a hibernate be exactly the > > > same as a normal non-hibernate poweroff? > > > > No. One of the differences between ACPI S4 (hibernate) > > and S5 (poweroff) states is for example how wakeup behaves. > > Look for example at /proc/acpi/wakeup and see how many > > devices are listed as "can wake from S5" vs from S4 ... > > most systems support some S4 events, not so for S5. > > > > Another is that S4 can consume more power. > > You are describing the difference between ACPI S4 and S5, but I was > talking about the difference between "normal" poweroff and "hibernate" > poweroff. There doesn't seem to be any reason why we must always have > > hibernate = S4 and normal = S5. What the ACPI spec describes for the "Non-Volatile Sleep" is that either S4 or S5 could match "hibernate" ... but for a software-controlled "poweroff", only S5 is appropriate. That's a reason. Another: pretty much all end-user docs on this stuff match what ACPI says. Lacking compelling reasons to violate specs (like them being clearly broken), I avoid breaking them. > > Non-ACPI systems can make the same natural distinctions. > > On such systems there seems to be even less reason for those equalities > (or rather, their analogs). This is one of those "less is more" things, right? :) People doing embedded designs _like_ their flexibility. It's common to have multiple power levels. If you mean that they _could_ give up that flexibility and only use one of those state analogues, yes they could ... but if you mean they'd see that as a Good Thing, I doubt it. > > > > We are letting ourselves in for problems if we say that when the snapshot > > > is restored, devices may or may not need to be reinitialized. > > > > We have those problems already. > > Exactly because we are waffling on this issue. If we settled the matter > once and for all (devices must ALWAYS be reinitialized after the snapshot > is restored) then we wouldn't have those problems. (We might have other > problems though...) We *WOULD* have problems. I guess I don't see why you want to throw away all the work the hardware (and/or software) designers did to ensure that some key devices use a "retention" mode in their S4-analogue state. Me, I always thought that leveraging those retention states was a great way to shrink wakeup times and get more functionality. > > > Even worse, the device may _appear_ not > > > to need reinitialization because the firmware (BIOS) has already > > > initialized it but left it in a state that's useless for the kernel's > > > purposes. (That's part of the reason why PRETHAW was added.) > > > > That's *ALL* of the reason for PRETHAW. I asked the > > guy who did it. ;) > > Well, be fair. If your resume methods had some way to know whether or not > a snapshot had just been restored then you wouldn't have needed to add > PRETHAW. So another part of the reason is that restore() methods don't > take a pm_message_t argument. Well, to be fair he says he didn't even consider such an intrusive change. The entire *reason* was to address that particular issue. Implementation tradeoffs are separate. > > > Why shouldn't the same devices work for wakeup from hibernate and wakeup > > > from normal poweroff? > > > > You're suggesting Linux not use the S5 state, essentially. > > No, I'm suggesting that the user should be able to control whether Linux > uses S4 vs. S5 at poweroff time. If the user selected always to use S4 > then wakeup devices would function in both hibernation and normal > shutdown. If the user selected always to use S5 then wakeup devices would > not function in either hibernation or normal shutdown. That's a different suggestion, yes. I'm not sure I see any benefit of that flexibility for "soft off" states though, especially if it made "off" consume more power. > > So the question is really "why should Linux use S5 (and similar > > states on non-ACPI systems), instead of disregarding the ACPI > > spec?" > > > > The short answer: having a "true OFF" state is valuable, if > > for no other reason than to cope with buggy "partial-ON" states > > like S4. Also, it's not clear that disregarding ACPI's guidance > > here would be a good thing. > > Which part of ACPI's so-called guidance are you referring to? Section 2.2 of the spec I looked at, which defines how non-volatile sleep relates to S4 and S5 states, and to the G3 "Mechanical OFF" which could also be entered from either of those by flick'o'switch. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 22:00 ` David Brownell @ 2007-05-05 15:49 ` Alan Stern 2007-05-07 1:10 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 15:49 UTC (permalink / raw) To: David Brownell Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Fri, 4 May 2007, David Brownell wrote: > > You are describing the difference between ACPI S4 and S5, but I was > > talking about the difference between "normal" poweroff and "hibernate" > > poweroff. There doesn't seem to be any reason why we must always have > > > > hibernate = S4 and normal = S5. > > What the ACPI spec describes for the "Non-Volatile Sleep" is > that either S4 or S5 could match "hibernate" ... but for > a software-controlled "poweroff", only S5 is appropriate. > > That's a reason. Another: pretty much all end-user docs > on this stuff match what ACPI says. > > Lacking compelling reasons to violate specs (like them > being clearly broken), I avoid breaking them. Again you misunderstand. I concede that either S4 or S5 is appropriate for "Non-Volatile Sleep" whereas only S5 is appropriate for software-controlled "poweroff". But who says that hibernate has to use "Non-Volatile Sleep" and normal shutdown has to use software-controlled "poweroff"? Why shouldn't the user be able to do it the other way 'round? > > > Non-ACPI systems can make the same natural distinctions. > > > > On such systems there seems to be even less reason for those equalities > > (or rather, their analogs). > > This is one of those "less is more" things, right? :) > > People doing embedded designs _like_ their flexibility. > > It's common to have multiple power levels. If you mean > that they _could_ give up that flexibility and only use > one of those state analogues, yes they could ... but if > you mean they'd see that as a Good Thing, I doubt it. No, no! That's not what I mean. I'm proposing that we offer the user _more_ flexibility by giving a choice of power levels. The user should be able to choose whether the system uses "Non-Volatile Sleep" vs. software-controlled "poweroff"; the choice shouldn't be dictated by whether or not the system is entering hibernation. > I guess I don't see why you want to throw away all the > work the hardware (and/or software) designers did to > ensure that some key devices use a "retention" mode > in their S4-analogue state. > > Me, I always thought that leveraging those retention > states was a great way to shrink wakeup times and get > more functionality. I can't imagine why you think I proposed anything along those lines. > > > You're suggesting Linux not use the S5 state, essentially. > > > > No, I'm suggesting that the user should be able to control whether Linux > > uses S4 vs. S5 at poweroff time. If the user selected always to use S4 > > then wakeup devices would function in both hibernation and normal > > shutdown. If the user selected always to use S5 then wakeup devices would > > not function in either hibernation or normal shutdown. > > That's a different suggestion, yes. I'm not sure I see any > benefit of that flexibility for "soft off" states though, > especially if it made "off" consume more power. The benefit is that it allows more devices to function as wakeup sources, right? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-05 15:49 ` Alan Stern @ 2007-05-07 1:10 ` David Brownell 2007-05-07 18:46 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 1:10 UTC (permalink / raw) To: Alan Stern Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Saturday 05 May 2007, Alan Stern wrote: > But who says that hibernate has to use "Non-Volatile Sleep" and normal > shutdown has to use software-controlled "poweroff"? Why shouldn't the > user be able to do it the other way 'round? Well, the definition of NVS matches hibernation, and the definition of soft-off matches poweroff. > > > No, I'm suggesting that the user should be able to control whether Linux > > > uses S4 vs. S5 at poweroff time. If the user selected always to use S4 > > > then wakeup devices would function in both hibernation and normal > > > shutdown. If the user selected always to use S5 then wakeup devices would > > > not function in either hibernation or normal shutdown. > > > > That's a different suggestion, yes. I'm not sure I see any > > benefit of that flexibility for "soft off" states though, > > especially if it made "off" consume more power. > > The benefit is that it allows more devices to function as wakeup sources, > right? With downsides of "more power consumed during 'off' states" and "invalidating documentation, training, and expectations". This is a case where the fact that something could technically be done doesn't recommend it to me. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 1:10 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell @ 2007-05-07 18:46 ` Alan Stern 2007-05-07 21:29 ` Rafael J. Wysocki 2007-05-07 21:43 ` David Brownell 0 siblings, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-07 18:46 UTC (permalink / raw) To: David Brownell Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Sun, 6 May 2007, David Brownell wrote: > On Saturday 05 May 2007, Alan Stern wrote: > > > But who says that hibernate has to use "Non-Volatile Sleep" and normal > > shutdown has to use software-controlled "poweroff"? Why shouldn't the > > user be able to do it the other way 'round? > > Well, the definition of NVS matches hibernation, and > the definition of soft-off matches poweroff. Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec. Here's the story in a nutshell: G3 = "mechanical off" = no wakeup devices are enabled, safe to disassemble G2/S5 = "soft off" = wakeup may be enabled, not safe to disassemble S4 = "non-volatile sleep" = hibernation, memory image is saved S5 = "soft off" = almost the same as S4 except there is no memory image The spec does not explicitly associate S4 with either G2 or G3, and in fact it contains language suggesting very strongly that the system could be in either one. The spec also uses the same name for G2 and for S5, no doubt leading to extra levels of confusion. So there's no question that S4 = NVS = hibernation. But hibernation can involve either G2 or G3. And there's no question (in my mind at least) that normal shutdown should be able to involve either G2/S5 or G3. So although the spec doesn't put things quite this way, we could say: hibernation = S4 = G2/S4 or G3/S4, shutdown = S5 = G2/S5 or G3/S5. Thus the choice between S4 vs. S5 is made at the very start, and steps 1-5 are executed only for S4. The choice between G2 vs. G3 can be (and should be!) deferred until steps 6-7. > > > That's a different suggestion, yes. I'm not sure I see any > > > benefit of that flexibility for "soft off" states though, > > > especially if it made "off" consume more power. > > > > The benefit is that it allows more devices to function as wakeup sources, > > right? > > With downsides of "more power consumed during 'off' states" > and "invalidating documentation, training, and expectations". Okay, let's clear up the confusion. The additional flexibility I'm suggesting for "soft off" = G2 states is that we should allow both G2/S4 and G2/S5. They would consume the same amount of power since they are both G2 states; the difference is that G2/S4 involves saving and restoring a memory image and G2/S5 does not. This does not invalidate any documentation or training so far as I know. And as for expectations... That's a little harder. What people _expect_ of Linux and what Linux actually _does_ don't always jibe well, owing to lack of sufficient documentation -- typical of Open Source projects. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 18:46 ` Alan Stern @ 2007-05-07 21:29 ` Rafael J. Wysocki 2007-05-07 22:22 ` Alan Stern 2007-05-07 21:43 ` David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-07 21:29 UTC (permalink / raw) To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg On Monday, 7 May 2007 20:46, Alan Stern wrote: > On Sun, 6 May 2007, David Brownell wrote: > > > On Saturday 05 May 2007, Alan Stern wrote: > > > > > But who says that hibernate has to use "Non-Volatile Sleep" and normal > > > shutdown has to use software-controlled "poweroff"? Why shouldn't the > > > user be able to do it the other way 'round? > > > > Well, the definition of NVS matches hibernation, and > > the definition of soft-off matches poweroff. > > Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec. Here's the story > in a nutshell: > > G3 = "mechanical off" = no wakeup devices are enabled, > safe to disassemble > G2/S5 = "soft off" = wakeup may be enabled, not safe to > disassemble > S4 = "non-volatile sleep" = hibernation, memory image is saved > S5 = "soft off" = almost the same as S4 except there is no > memory image > > The spec does not explicitly associate S4 with either G2 or G3, and in > fact it contains language suggesting very strongly that the system could > be in either one. The spec also uses the same name for G2 and for S5, no > doubt leading to extra levels of confusion. Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1. Moreover, it's reiterated several times in different places that S5 Soft off = G2. > So there's no question that S4 = NVS = hibernation. But hibernation > can involve either G2 or G3. Not according to ACPI. > And there's no question (in my mind at least) that normal shutdown should > be able to involve either G2/S5 or G3. So although the spec doesn't put > things quite this way, we could say: > > hibernation = S4 = G2/S4 or G3/S4, > > shutdown = S5 = G2/S5 or G3/S5. > > Thus the choice between S4 vs. S5 is made at the very start, and steps 1-5 > are executed only for S4. The choice between G2 vs. G3 can be (and should > be!) deferred until steps 6-7. The problem is that ACPI insists on treating S4 as a sleeping state. Still, I agree that what we do in steps 1 - 5 should be independent of whether or not we're going to enter S4. Devices should not be suspended before creating the image, because the system is not going to enter any power state *at that time*. There seems to be no reason whatsoever for putting devices in low power states for creating the hibernation image. > > > > That's a different suggestion, yes. I'm not sure I see any > > > > benefit of that flexibility for "soft off" states though, > > > > especially if it made "off" consume more power. > > > > > > The benefit is that it allows more devices to function as wakeup sources, > > > right? > > > > With downsides of "more power consumed during 'off' states" > > and "invalidating documentation, training, and expectations". > > Okay, let's clear up the confusion. The additional flexibility I'm > suggesting for "soft off" = G2 states is that we should allow both G2/S4 > and G2/S5. They would consume the same amount of power since they are > both G2 states; the difference is that G2/S4 involves saving and restoring > a memory image and G2/S5 does not. There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to avoid confusion. That's why I said that what we want to call 'hibernation' is and will probably always be different from an ACPI transition to S4 (at least until we make a bootloader capable of reading suspend images and ACPI-aware). Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 21:29 ` Rafael J. Wysocki @ 2007-05-07 22:22 ` Alan Stern 2007-05-07 22:47 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-07 22:22 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Mon, 7 May 2007, Rafael J. Wysocki wrote: > > G3 = "mechanical off" = no wakeup devices are enabled, > > safe to disassemble > > G2/S5 = "soft off" = wakeup may be enabled, not safe to > > disassemble > > S4 = "non-volatile sleep" = hibernation, memory image is saved > > S5 = "soft off" = almost the same as S4 except there is no > > memory image > > > > The spec does not explicitly associate S4 with either G2 or G3, and in > > fact it contains language suggesting very strongly that the system could > > be in either one. The spec also uses the same name for G2 and for S5, no > > doubt leading to extra levels of confusion. > > Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1. > Moreover, it's reiterated several times in different places that > S5 Soft off = G2. More confusion in the spec... It describes two different kinds of S4 states! I was talking about "S4 Non-Volatile Sleep", defined on p.20 just above Table 2-1. The text says this: The machine will then enter the S4 state. When the system leaves the Soft Off or Mechanical Off state,... That's a pretty clear indication that S4-NVS involves G2 or G3. You're talking about "S4 Sleeping State", defined on p.22, section 2.4. Evidently these two "S4" states are quite different. > The problem is that ACPI insists on treating S4 as a sleeping state. Section 2.4 is rather confusing. What I gather is that S4 and S5 are essentially the same except for the presence or absence of a stored memory snapshot. And yet S4 counts as a sleeping state while S5 doesn't. What's the explanation for that? > Still, I agree that what we do in steps 1 - 5 should be independent of > whether or not we're going to enter S4. Devices should not be > suspended before creating the image, because the system is not going to > enter any power state *at that time*. There seems to be no reason whatsoever > for putting devices in low power states for creating the hibernation image. Agreed. > There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to > avoid confusion. Except for the text on p.20. > That's why I said that what we want to call 'hibernation' is and will probably > always be different from an ACPI transition to S4 (at least until we make a > bootloader capable of reading suspend images and ACPI-aware). In what sense is the boot kernel different from a "bootloader"? It certainly is capable of reading suspend images and is ACPI-aware. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 22:22 ` Alan Stern @ 2007-05-07 22:47 ` Rafael J. Wysocki 2007-05-08 14:56 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-07 22:47 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Tuesday, 8 May 2007 00:22, Alan Stern wrote: > On Mon, 7 May 2007, Rafael J. Wysocki wrote: > > > > G3 = "mechanical off" = no wakeup devices are enabled, > > > safe to disassemble > > > G2/S5 = "soft off" = wakeup may be enabled, not safe to > > > disassemble > > > S4 = "non-volatile sleep" = hibernation, memory image is saved > > > S5 = "soft off" = almost the same as S4 except there is no > > > memory image > > > > > > The spec does not explicitly associate S4 with either G2 or G3, and in > > > fact it contains language suggesting very strongly that the system could > > > be in either one. The spec also uses the same name for G2 and for S5, no > > > doubt leading to extra levels of confusion. > > > > Well, it's quite clearly stated in 4.5 and in 15 that S4 belongs to G1. > > Moreover, it's reiterated several times in different places that > > S5 Soft off = G2. > > More confusion in the spec... It describes two different kinds of S4 > states! > > I was talking about "S4 Non-Volatile Sleep", defined on p.20 just above > Table 2-1. The text says this: > > The machine will then enter the S4 state. When the system > leaves the Soft Off or Mechanical Off state,... > > That's a pretty clear indication that S4-NVS involves G2 or G3. > > You're talking about "S4 Sleeping State", defined on p.22, section 2.4. > Evidently these two "S4" states are quite different. > > > The problem is that ACPI insists on treating S4 as a sleeping state. > > Section 2.4 is rather confusing. What I gather is that S4 and S5 are > essentially the same except for the presence or absence of a stored > memory snapshot. And yet S4 counts as a sleeping state while S5 doesn't. > What's the explanation for that? As far as I understand it, for S4 the platform provides a means for verifying if the hardware wasn't changed too much while the system was "sleeping" (via the NVS memory region). > > Still, I agree that what we do in steps 1 - 5 should be independent of > > whether or not we're going to enter S4. Devices should not be > > suspended before creating the image, because the system is not going to > > enter any power state *at that time*. There seems to be no reason whatsoever > > for putting devices in low power states for creating the hibernation image. > > Agreed. > > > > There's nothing like G2/S4 in ACPI and we shouldn't refer to such a notion to > > avoid confusion. > > Except for the text on p.20. Yes, this is very confusing. I think what they wanted to say there is that the image restore could in principle happen when the system is started after being in a "power off" state. In that case, however, it wouldn't be known if it's safe to restore the image and continue, because the hardware might have changed. For this reason, a special "sleeping" state is needed such that when leaving it, the PM software can detect any (substantial) hardware changes before even loading the entire image. > > That's why I said that what we want to call 'hibernation' is and will probably > > always be different from an ACPI transition to S4 (at least until we make a > > bootloader capable of reading suspend images and ACPI-aware). > > In what sense is the boot kernel different from a "bootloader"? It > certainly is capable of reading suspend images and is ACPI-aware. The boot loader uses the BIOS to read from disks and it can avoid initializing ACPI. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 22:47 ` Rafael J. Wysocki @ 2007-05-08 14:56 ` Alan Stern 2007-05-08 19:59 ` Rafael J. Wysocki ` (2 more replies) 0 siblings, 3 replies; 713+ messages in thread From: Alan Stern @ 2007-05-08 14:56 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Tue, 8 May 2007, Rafael J. Wysocki wrote: > As far as I understand it, for S4 the platform provides a means for verifying > if the hardware wasn't changed too much while the system was "sleeping" (via > the NVS memory region). Rereading p.20, it appears to go the other way: The system checks for hardware changes when booting from Soft Off. Or perhaps it always checks. I guess there aren't supposed to be any hardware changes while in S4, since then it's not safe to disassemble the machine. Sounds a lot like USB's power sessions... > Yes, this is very confusing. I think what they wanted to say there is that the > image restore could in principle happen when the system is started after being > in a "power off" state. In that case, however, it wouldn't be known if it's > safe to restore the image and continue, because the hardware might have > changed. For this reason, a special "sleeping" state is needed such that when > leaving it, the PM software can detect any (substantial) hardware changes > before even loading the entire image. And apparently the bootloader is not expected to restore the memory image if the hardware has changed too much. So here's the current state of my understanding of ACPI: S4 is the lowest-power Sleep state. RAM is not powered, the OS has stored a non-volatile memory image somewhere, and some ACPI state is maintained. S5 is misnamed, in that it isn't really a Sleep state at all -- it's an Off state. In fact, it is the state the computer enters when you first plug it in (or insert the battery). If the OS stores a memory image and then switches to S5, at reboot the bootloader will probably try to restore it. (That's what p.20 says.) And if the user unplugs the computer (removes the battery) while it is in S4, then upon replugging the computer will enter S5. Thus, when waking from either S4 or S5 the bootloader will try to restore an image if one can be found (and if the hardware hasn't changed too much and if the user doesn't abort the restore). I've never encountered any documentation saying that you shouldn't unplug the computer while it's in hibernation. It doesn't look like you would lose much by doing so, except that perhaps not as many wakeup devices are functional in S5 as in S4. Now as for how all this relates to Linux: What we do for hibernation is not an exact match for either S4 or S5. It may be closest to S4, but we don't use a bootloader. Instead the boot kernel does some sort of ACPI reset and restores the memory image all by itself. Whatever ACPI state information may be saved in the image is not accessible to the boot kernel. Conversely, the information about whether we booted from S4 or from S5 is lost when the image overwrites the boot kernel. As a result, hibernation is capable of using either S4 or S5 -- as it must be, since the user could always unplug the computer while it's in S4 -- although perhaps when using S4 it manages to confuse ACPI somewhat through not matching the spec's expectations. What do the differences between S4 and S5 amount to? As far as I can tell, they look like this: ACPI expects there to be a memory image in S4. In S5 there may or may not be an image. ACPI expects that when resuming from S4, the kernel will continue using some preserved ACPI state. It expects that when starting from S5, the kernel will need to reinitialize pretty much all the ACPI state. S4 involves a larger power consumption and may allow for more wakeup devices than S5. And how do these relate to Linux? In fact, ACPI has no way of knowing whether or not there is an image. The kernel is perfectly free to do whatever it wants. The boot kernel can't make much use of the state preserved by ACPI because it doesn't have access to the image kernel's records. It needs to reinitialize ACPI no matter what. Consequently the restored kernel cannot use any preserved ACPI state, since this state gets wiped out by the boot kernel. Information about hardware changes might be available to the boot kernel, which could in principle then decide not to restore the image. It's not clear that this would be a good idea. In any case, ACPI is limited to knowledge about devices on the motherboard -- it knows nothing about hotplugged devices, which makes the information less useful. Hibernation allows the user to choose whether to go to S4 or S5 by means of /sys/power/disk. Therefore the user gets to decide how the power-consumption vs. wakeup-functionality tradeoff should be made. In short, the boot kernel should do whatever it needs to in order to make ACPI happy. This might involve telling ACPI that it has successfully resumed from S4, even though the boot kernel is unaware of system state at the start of hibernation. In fact, the boot kernel has to take care of all this before it even knows whether a valid image exists in the swap partition. Putting this together, it says that there should be no impediment to doing a fresh boot from S4; i.e., not restoring a memory image but simply letting the boot kernel continue on with a normal startup. The corollary is that there should be no impediment to entering S4 during a normal shutdown. >From the user's point of view, the differences between S4 and S5 amount to just these: power consumption and availability of wakeup devices. (Perhaps also the presence of a blinking LED -- but in my experience the blinking LED indicates STR, not hibernation.) In the end, this is nothing more than the usual tradeoff between power usage and functionality. We give the user a chance to decide how this tradeoff should go when entering hibernation. Why not also give the user a chance to decide the tradeoff during normal shutdown? Yes, it violates the spec in the sense that we would be entering S4 without saving a memory image. But we _already_ violate the spec by not using a bootloader to restore the image. I don't see this as being any worse. Finally, what about non-ACPI systems? Basically this boils down to two choices: Should a memory image be stored? How much power/wakeup-functionality should the system consume/provide while it is down? The first choice is decided by the user, by either entering hibernation or shutting down. Why shouldn't the second also be decided by the user? Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-08 14:56 ` Alan Stern @ 2007-05-08 19:59 ` Rafael J. Wysocki 2007-05-08 21:26 ` Alan Stern 2007-05-09 8:17 ` Pavel Machek 2007-05-09 19:35 ` David Brownell 2 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-08 19:59 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Tuesday, 8 May 2007 16:56, Alan Stern wrote: > On Tue, 8 May 2007, Rafael J. Wysocki wrote: > > > As far as I understand it, for S4 the platform provides a means for verifying > > if the hardware wasn't changed too much while the system was "sleeping" (via > > the NVS memory region). > > Rereading p.20, it appears to go the other way: The system checks for > hardware changes when booting from Soft Off. Nope. That's clarified later on. Please read Section 15, "Waking and Sleeping" (it's short ;-)), in particular 15.3.3. > Or perhaps it always checks. I guess there aren't supposed to be any > hardware changes while in S4, since then it's not safe to disassemble the > machine. That's correct, and that's why the hardware signature in FACS is needed for S4 (according to the spec), while it's not needed for the wake up from "Soft Off" (S5). > Sounds a lot like USB's power sessions... Well, not exactly that. The hardware signature in FACS only covers some "essential" hardware (I'm not sure what that is, probably depends on the platform design). > > Yes, this is very confusing. I think what they wanted to say there is that the > > image restore could in principle happen when the system is started after being > > in a "power off" state. In that case, however, it wouldn't be known if it's > > safe to restore the image and continue, because the hardware might have > > changed. For this reason, a special "sleeping" state is needed such that when > > leaving it, the PM software can detect any (substantial) hardware changes > > before even loading the entire image. > > And apparently the bootloader is not expected to restore the memory image > if the hardware has changed too much. Yes. > So here's the current state of my understanding of ACPI: > > S4 is the lowest-power Sleep state. RAM is not powered, the OS > has stored a non-volatile memory image somewhere, and some ACPI > state is maintained. That's correct, AFAICS. > S5 is misnamed, in that it isn't really a Sleep state at all -- > it's an Off state. In fact, it is the state the computer enters > when you first plug it in (or insert the battery). Yes. > If the OS stores a memory image and then switches to S5, at reboot the > bootloader will probably try to restore it. (That's what p.20 says.) That may happen. The bootloader will probably check if there's the image and if it's there, it will try compare the hardware signature in the image with the one in FACS. If the test is passed, it will attempt to restore the image (this is illustrated in the picture in 15.3.3, BTW). > And if the user unplugs the computer (removes the battery) while it is in > S4, then upon replugging the computer will enter S5. Thus, when waking > from either S4 or S5 the bootloader will try to restore an image if one > can be found (and if the hardware hasn't changed too much and if the user > doesn't abort the restore). That's correct. > I've never encountered any documentation saying that you shouldn't unplug > the computer while it's in hibernation. It doesn't look like you would > lose much by doing so, except that perhaps not as many wakeup devices are > functional in S5 as in S4. > > Now as for how all this relates to Linux: > > What we do for hibernation is not an exact match for either S4 or S5. It > may be closest to S4, but we don't use a bootloader. Instead the boot > kernel does some sort of ACPI reset and restores the memory image all by > itself. Whatever ACPI state information may be saved in the image is not > accessible to the boot kernel. In principle, it could be, but we don't use it in the boot kernel. > Conversely, the information about whether we booted from S4 or from S5 > is lost when the image overwrites the boot kernel. Yes. > As a result, hibernation is capable of using either S4 or S5 -- as it must > be, since the user could always unplug the computer while it's in S4 -- > although perhaps when using S4 it manages to confuse ACPI somewhat through > not matching the spec's expectations. > > What do the differences between S4 and S5 amount to? As far as I can > tell, they look like this: > > ACPI expects there to be a memory image in S4. In S5 there > may or may not be an image. > > ACPI expects that when resuming from S4, the kernel will > continue using some preserved ACPI state. It expects that > when starting from S5, the kernel will need to reinitialize > pretty much all the ACPI state. > > S4 involves a larger power consumption and may allow for > more wakeup devices than S5. > > And how do these relate to Linux? > > In fact, ACPI has no way of knowing whether or not there is an > image. The kernel is perfectly free to do whatever it wants. > > The boot kernel can't make much use of the state preserved by > ACPI because it doesn't have access to the image kernel's > records. It needs to reinitialize ACPI no matter what. To be precise, it usually needs to initialize ACPI to read the image (drivers use ACPI to some extent). In principle we could make it behave as though ACPI were not compiled in and read the image while being in that state. Then, it could use the ACPI state information contained in the image (it would have to be pointed to by the image header, but that's easy). > Consequently the restored kernel cannot use any preserved ACPI > state, since this state gets wiped out by the boot kernel. > Information about hardware changes might be available to the > boot kernel, which could in principle then decide not to restore > the image. It's not clear that this would be a good idea. In > any case, ACPI is limited to knowledge about devices on the > motherboard -- it knows nothing about hotplugged devices, which > makes the information less useful. > > Hibernation allows the user to choose whether to go to S4 or S5 > by means of /sys/power/disk. Therefore the user gets to decide > how the power-consumption vs. wakeup-functionality tradeoff > should be made. > > In short, the boot kernel should do whatever it needs to in order to make > ACPI happy. This might involve telling ACPI that it has successfully > resumed from S4, even though the boot kernel is unaware of system state at > the start of hibernation. In fact, the boot kernel has to take care of > all this before it even knows whether a valid image exists in the swap > partition. > > Putting this together, it says that there should be no impediment to doing > a fresh boot from S4; i.e., not restoring a memory image but simply > letting the boot kernel continue on with a normal startup. The corollary > is that there should be no impediment to entering S4 during a normal > shutdown. > > From the user's point of view, the differences between S4 and S5 amount to > just these: power consumption and availability of wakeup devices. > (Perhaps also the presence of a blinking LED -- but in my experience the > blinking LED indicates STR, not hibernation.) In the end, this is nothing > more than the usual tradeoff between power usage and functionality. > > We give the user a chance to decide how this tradeoff should go when > entering hibernation. Why not also give the user a chance to decide the > tradeoff during normal shutdown? > > Yes, it violates the spec in the sense that we would be entering S4 > without saving a memory image. But we _already_ violate the spec by not > using a bootloader to restore the image. I don't see this as being any > worse. > > > Finally, what about non-ACPI systems? Basically this boils down to two > choices: > > Should a memory image be stored? > > How much power/wakeup-functionality should the system > consume/provide while it is down? > > The first choice is decided by the user, by either entering hibernation or > shutting down. Why shouldn't the second also be decided by the user? I generally agree. Moreover, it doesn't seem to be necessary to assume that the image should be created and saved *after* we've put devices into low power states and prepared ACPI for the power transition. I think it's equally possible to create and save the image *before* the power transition is initiated. Greetings, Rafael > > Alan Stern > > > -- If you don't have the time to read, you don't have the time or the tools to write. - Stephen King ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-08 19:59 ` Rafael J. Wysocki @ 2007-05-08 21:26 ` Alan Stern 0 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-05-08 21:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Tue, 8 May 2007, Rafael J. Wysocki wrote: > On Tuesday, 8 May 2007 16:56, Alan Stern wrote: > > On Tue, 8 May 2007, Rafael J. Wysocki wrote: > > > > > As far as I understand it, for S4 the platform provides a means for verifying > > > if the hardware wasn't changed too much while the system was "sleeping" (via > > > the NVS memory region). > > > > Rereading p.20, it appears to go the other way: The system checks for > > hardware changes when booting from Soft Off. > > Nope. That's clarified later on. Please read Section 15, "Waking and > Sleeping" (it's short ;-)), in particular 15.3.3. You're right. It says specifically that when booting from an S4 state, the bootloader compares the signature in the NVS image with hardware signature in the BIOS's FACS table. (Although Figure 15-5 makes no mention of different pathways for S4 and S5.) Does the Linux boot kernel actually do the comparison? Chapter 15 doesn't seem to take into account the possibility that the computer might be unplugged after entering S4. It talks about the next wakeup being a wake from S4 -- although the actions of the BIOS are supposed to be the same when waking from S4 or booting from S5. In either case the BIOS runs the POST and initializes the ACPI tables. Only the actions of the bootloader are different. So how is the bootloader supposed to know whether it is booting from S4 or S5? Does it just assume that the presence of a valid NVS image indicates an S4 boot, even though it may really be booting from S5? > > Sounds a lot like USB's power sessions... > > Well, not exactly that. The hardware signature in FACS only covers some > "essential" hardware (I'm not sure what that is, probably depends on the > platform design). 15.1.4.1 says: A change in hardware configuration is defined to be any change in the platform hardware that would cause the platform to fail when trying to restore the S4 context; this hardware is normally limited to boot devices. For example, changing the graphics adapter or hard disk controller while in the S4 state should cause the hardware signature to change. On the other hand, removing or adding a PC Card device from a PC Card slot should not cause the hardware signature to change. Take it for what it's worth. > > The boot kernel can't make much use of the state preserved by > > ACPI because it doesn't have access to the image kernel's > > records. It needs to reinitialize ACPI no matter what. > > To be precise, it usually needs to initialize ACPI to read the image (drivers > use ACPI to some extent). In principle we could make it behave as though > ACPI were not compiled in and read the image while being in that state. > Then, it could use the ACPI state information contained in the image > (it would have to be pointed to by the image header, but that's easy). In other words, make the boot kernel act as a bootloader. Isn't this likely to cause problems? There must be plenty of systems that won't work properly without ACPI. Certainly there are reported cases of IRQ routing being wrong (and also cases where it is wrong only when ACPI _is_ in use). > I generally agree. > > Moreover, it doesn't seem to be necessary to assume that the image should > be created and saved *after* we've put devices into low power states and > prepared ACPI for the power transition. I think it's equally possible to > create and save the image *before* the power transition is initiated. Possible and desirable, both. Okay, so the two of us are in agreement. I don't know about anyone else, though... :-) Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-08 14:56 ` Alan Stern 2007-05-08 19:59 ` Rafael J. Wysocki @ 2007-05-09 8:17 ` Pavel Machek 2007-05-09 15:21 ` Alan Stern 2007-05-09 19:35 ` David Brownell 2 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-09 8:17 UTC (permalink / raw) To: Alan Stern; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg Hi! > We give the user a chance to decide how this tradeoff should go when > entering hibernation. Why not also give the user a chance to decide the > tradeoff during normal shutdown? > > Yes, it violates the spec in the sense that we would be entering S4 > without saving a memory image. I think you already replied to yourself :-). There are more reasons, like we getting useless code paths to debug. So far you demonstrated that S4-on-shutdown is probably possible, and while violating specs, it should probably work. What do you expect now? Me jumping with joy and implementing S4-on-shutdown because it should be possible? Now... if you feel very strongly about S4-on-shutdown, you may try to create a patch. If it is not-too-ugly, and if it is really good for something, we may merge it. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-09 8:17 ` Pavel Machek @ 2007-05-09 15:21 ` Alan Stern 0 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-05-09 15:21 UTC (permalink / raw) To: Pavel Machek; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg On Wed, 9 May 2007, Pavel Machek wrote: > Hi! > > > We give the user a chance to decide how this tradeoff should go when > > entering hibernation. Why not also give the user a chance to decide the > > tradeoff during normal shutdown? > > > > Yes, it violates the spec in the sense that we would be entering S4 > > without saving a memory image. > > I think you already replied to yourself :-). Yes -- but going to S5 during hibernation (which is what "echo shutdown >/sys/power/disk" does, right?) also violates the spec. So I don't feel too guilty about this. > There are more reasons, like we getting useless code paths to > debug. So far you demonstrated that S4-on-shutdown is probably > possible, and while violating specs, it should probably work. > > What do you expect now? Me jumping with joy and implementing > S4-on-shutdown because it should be possible? Actually all I wanted was someone to look over my reasoning and check that it was correct. You and Raphael have now done so, thank you. And when I first began contributing to this thread, the main purpose was to point out that hibernation_ops (or anything else related to the shutdown method) should not be involved in the steps responsible for creating and storing the snapshot image. > Now... if you feel very strongly about S4-on-shutdown, you may try to > create a patch. If it is not-too-ugly, and if it is really good for > something, we may merge it. At some time I might just do that... Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-08 14:56 ` Alan Stern 2007-05-08 19:59 ` Rafael J. Wysocki 2007-05-09 8:17 ` Pavel Machek @ 2007-05-09 19:35 ` David Brownell 2007-05-09 20:04 ` Alan Stern 2 siblings, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-09 19:35 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Tuesday 08 May 2007, Alan Stern wrote: > So here's the current state of my understanding of ACPI: > > S4 is the lowest-power Sleep state. RAM is not powered, the OS > has stored a non-volatile memory image somewhere, and some ACPI > state is maintained. > > S5 is misnamed, in that it isn't really a Sleep state at all -- > it's an Off state. It's called "Soft Off" ... :) The reason it resembles a sleep state is that various events other than power switches are allowed to wake systems in S5. RTC alarms and keyboard events come to mind as common examples. Agreed that the distinction between S4 and S5 seems too much in the category of "because we said so!" than because of real technical differences (beyond presence/absence of a non-volatile image, and a few additional wakeup event sources). > In fact, it is the state the computer enters > when you first plug it in (or insert the battery). No; again, you're missing the entire point of G3 "mechanical off". When you first plug it in, it's going to be in G3. Then you turn on the power switch. Then you press the "on/off" button. >From then on you can use only the "on/off" button, but the system is vampiric ... when off/dead, it can choose to come alive, and is always sucking power/blood at a low level. But the "large red switch" option is available to put the system into G3 ... driving a bloody stake through its heart, so it can't re-activate itself at midnight, and preventing constant power drain. > From the user's point of view, the differences between S4 and S5 amount to > just these: power consumption and availability of wakeup devices. And the fact that in S4 there's always a resumable OS image. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-09 19:35 ` David Brownell @ 2007-05-09 20:04 ` Alan Stern 2007-05-09 20:21 ` David Brownell 2007-05-09 21:07 ` Pavel Machek 0 siblings, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-09 20:04 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Wed, 9 May 2007, David Brownell wrote: > > In fact, it is the state the computer enters > > when you first plug it in (or insert the battery). > > No; again, you're missing the entire point of G3 "mechanical off". > > When you first plug it in, it's going to be in G3. Then you turn > on the power switch. Then you press the "on/off" button. > > From then on you can use only the "on/off" button, but the system > is vampiric ... when off/dead, it can choose to come alive, and is > always sucking power/blood at a low level. > > But the "large red switch" option is available to put the system > into G3 ... driving a bloody stake through its heart, so it can't > re-activate itself at midnight, and preventing constant power drain. Sorry. What I meant to say was that S5 is the state the computer enters when you first plug it in and turn on the power switch -- before you press the on/off button. > > From the user's point of view, the differences between S4 and S5 amount to > > just these: power consumption and availability of wakeup devices. > > And the fact that in S4 there's always a resumable OS image. Are you sure? What happens if the OSPM writes a defective, non-resumable OS image and then goes into S4? What happens if the OS writes a resumable OS image and goes into S4, and then the user unplugs the computer, plugs it back in, and turns the power switch on? At that point the system must be in S5 (by definition), but there's still a resumable image. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-09 20:04 ` Alan Stern @ 2007-05-09 20:21 ` David Brownell 2007-05-10 15:17 ` Alan Stern 2007-05-09 21:07 ` Pavel Machek 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-09 20:21 UTC (permalink / raw) To: Alan Stern Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg > > > From the user's point of view, the differences between S4 and S5 amount to > > > just these: power consumption and availability of wakeup devices. > > > > And the fact that in S4 there's always a resumable OS image. > > Are you sure? What happens if the OSPM writes a defective, non-resumable > OS image and then goes into S4? The ACPI spec omits all such error transitions. As well as a fair number of non-error ones ... like how to enter G3. > What happens if the OS writes a resumable OS image and goes into S4, and > then the user unplugs the computer, plugs it back in, and turns the power > switch on? At that point the system must be in S5 (by definition), but > there's still a resumable image. As allowed by the chapter 2 text I pointed out earlier. S4 *always* has such an image. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-09 20:21 ` David Brownell @ 2007-05-10 15:17 ` Alan Stern 0 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-05-10 15:17 UTC (permalink / raw) To: David Brownell Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, linux-pm, Johannes Berg On Wed, 9 May 2007, David Brownell wrote: > > > > From the user's point of view, the differences between S4 and S5 amount to > > > > just these: power consumption and availability of wakeup devices. > > > > > > And the fact that in S4 there's always a resumable OS image. > > What happens if the OS writes a resumable OS image and goes into S4, and > > then the user unplugs the computer, plugs it back in, and turns the power > > switch on? At that point the system must be in S5 (by definition), but > > there's still a resumable image. > > As allowed by the chapter 2 text I pointed out earlier. > S4 *always* has such an image. So the correct statement is that S4 always has a resumable OS image and S5 may have a resumable image. From a user's point of view that doesn't sound like much of a difference, especially since the image can be successfully restored from either state. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-09 20:04 ` Alan Stern 2007-05-09 20:21 ` David Brownell @ 2007-05-09 21:07 ` Pavel Machek 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-05-09 21:07 UTC (permalink / raw) To: Alan Stern; +Cc: Nigel Cunningham, Pekka Enberg, linux-pm, Johannes Berg Hi! > > > In fact, it is the state the computer enters > > > when you first plug it in (or insert the battery). > > > > No; again, you're missing the entire point of G3 "mechanical off". > > > > When you first plug it in, it's going to be in G3. Then you turn > > on the power switch. Then you press the "on/off" button. > > > > From then on you can use only the "on/off" button, but the system > > is vampiric ... when off/dead, it can choose to come alive, and is > > always sucking power/blood at a low level. > > > > But the "large red switch" option is available to put the system > > into G3 ... driving a bloody stake through its heart, so it can't > > re-activate itself at midnight, and preventing constant power drain. > > Sorry. What I meant to say was that S5 is the state the computer enters > when you first plug it in and turn on the power switch -- before you press > the on/off button. Actually... some machines just power on when you first plug them in, and some other have it configurable in BIOS. For server, you want it to power up after power fail. For home desktop, you definitely want it to stay powered off after power fail. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 18:46 ` Alan Stern 2007-05-07 21:29 ` Rafael J. Wysocki @ 2007-05-07 21:43 ` David Brownell 2007-05-07 22:41 ` Alan Stern 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 21:43 UTC (permalink / raw) To: Alan Stern Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Monday 07 May 2007, Alan Stern wrote: > On Sun, 6 May 2007, David Brownell wrote: > > > On Saturday 05 May 2007, Alan Stern wrote: > > > > > But who says that hibernate has to use "Non-Volatile Sleep" and normal > > > shutdown has to use software-controlled "poweroff"? Why shouldn't the > > > user be able to do it the other way 'round? > > > > Well, the definition of NVS matches hibernation, and > > the definition of soft-off matches poweroff. > > Okay, I read sections 2.2 and 2.4 of the ACPI 3.0 spec. Here's the story > in a nutshell: > > G3 = "mechanical off" = no wakeup devices are enabled, > safe to disassemble > G2/S5 = "soft off" = wakeup may be enabled, not safe to > disassemble > S4 = "non-volatile sleep" = hibernation, memory image is saved > S5 = "soft off" = almost the same as S4 except there is no > memory image This summary suggests there are two S5 states, which I believe is incorrect. G2 is just another name for S5. See Fig 3-1; the ACPI 2.0 spec has the same figure. Also, section 2.2 highlights that after S5 the OS restarts, which it doesn't do from S4 (table 2-1) ... although when it describes S4/NVS it fuzzes that issue by saying the key issue is whether an NVS state file is found and used, not the level of power available. > The spec does not explicitly associate S4 with either G2 or G3, and in > fact it contains language suggesting very strongly that the system could > be in either one. The spec also uses the same name for G2 and for S5, no > doubt leading to extra levels of confusion. Figure 3-1 seemed quite explicit to me ... S4 is one of the G1 states, S5 is the only G2 state, and G3 is is a different beast. Text elsewhere agrees with that. What's confusing is how it describes NVS/hibernate. It's very explicitly a G1 state. But leaving G2 or G3 can also trigger a resume-from-NVS ... according to the text in 2.2 but not the state diagrams, which don't show entering G3 even cleanly, much less uncleanly (like a neighborhood power failure). Bleech. I think the implication is that going to either G2 or G3 "off" states discards something that a G1 state preserves. But I'd have to search more deeply to see if that's clearly defined. It's suggestive that there are no "_S5D" or "_S5W" methods; such wake events would evidently be managed by BIOS not OSPM. > So there's no question that S4 = NVS = hibernation. But hibernation > can involve either G2 or G3. I suspect there's a reason this part of ACPI is so vague; it may relate to the desire to allow direct BIOS handling of the NVS state. > And there's no question (in my mind at least) that normal shutdown should > be able to involve either G2/S5 or G3. G2/S5, yes ... that can be entered under software control. But by definition, not G3 since it requires a mechanical/manual power switch update. ("Mechanical OFF", or in the spec's example "movement of a large red switch".) > So although the spec doesn't put > things quite this way, we could say: > > hibernation = S4 = G2/S4 or G3/S4, > > shutdown = S5 = G2/S5 or G3/S5. No, you're missing the key "mechanical" red-switch-ish step in G3. G3 *can't* be entered under software control. By definition. It's there for among other things regulatory reasons ... the only power consumed in G3 is from the on-board RTC battery. > > > > That's a different suggestion, yes. I'm not sure I see any > > > > benefit of that flexibility for "soft off" states though, > > > > especially if it made "off" consume more power. > > > > > > The benefit is that it allows more devices to function as wakeup sources, > > > right? > > > > With downsides of "more power consumed during 'off' states" > > and "invalidating documentation, training, and expectations". > > Okay, let's clear up the confusion. The additional flexibility I'm > suggesting for "soft off" = G2 states is that we should allow both G2/S4 > and G2/S5. They would consume the same amount of power since they are > both G2 states; the difference is that G2/S4 involves saving and restoring > a memory image and G2/S5 does not. There is no G2/S4 state; it's G1/S4 or G2/S5. And S5 does not involve an NVS file, or it'd be S4. The ACPI spec is sadly vague in those areas, however. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) 2007-05-07 21:43 ` David Brownell @ 2007-05-07 22:41 ` Alan Stern 0 siblings, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-05-07 22:41 UTC (permalink / raw) To: David Brownell Cc: Linux-pm mailing list, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Mon, 7 May 2007, David Brownell wrote: > This summary suggests there are two S5 states, which I believe > is incorrect. G2 is just another name for S5. See Fig 3-1; > the ACPI 2.0 spec has the same figure. > > Also, section 2.2 highlights that after S5 the OS restarts, > which it doesn't do from S4 (table 2-1) ... although when it > describes S4/NVS it fuzzes that issue by saying the key issue > is whether an NVS state file is found and used, not the level > of power available. It also says that the NVS state file is found and used when the system leaves the Soft Off (G2) or Mechanical Off (G3) state. How did it enter either of those states in the first place if S4-NVS is a Sleeping (G1) state? I imagine that business about the OS not restarting from S4-NVS is intended to mean the OS continues from the restored image rather than starting over completely fresh. > Figure 3-1 seemed quite explicit to me ... S4 is one of the G1 > states, S5 is the only G2 state, and G3 is is a different beast. > Text elsewhere agrees with that. Yes, okay. > What's confusing is how it describes NVS/hibernate. It's very > explicitly a G1 state. But leaving G2 or G3 can also trigger > a resume-from-NVS ... according to the text in 2.2 but not the > state diagrams, which don't show entering G3 even cleanly, much > less uncleanly (like a neighborhood power failure). Bleech. You can understand my confusion... > I think the implication is that going to either G2 or G3 "off" > states discards something that a G1 state preserves. But I'd > have to search more deeply to see if that's clearly defined. Or what it is that gets discarded. Especially since 2.4 lists only one difference between S5 and S4: whether or not there is a saved image. > I suspect there's a reason this part of ACPI is so vague; > it may relate to the desire to allow direct BIOS handling > of the NVS state. Could be. I wish the spec was more upfront about its vagueness, explaining what has been left out and why instead of just skipping over some things and contradicting itself. > G2/S5, yes ... that can be entered under software control. > > But by definition, not G3 since it requires a mechanical/manual > power switch update. ("Mechanical OFF", or in the spec's example > "movement of a large red switch".) Okay, I understand that now. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 14:00 ` Alan Stern 2007-05-03 17:17 ` Rafael J. Wysocki 2007-05-03 20:33 ` David Brownell @ 2007-05-03 22:18 ` Pavel Machek 2007-05-04 14:57 ` Alan Stern 2 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-05-03 22:18 UTC (permalink / raw) To: Alan Stern; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham Hi! > > > Well... the powerdown during hibernation... does not have _anything_ > > > to do with snapshot/restore. It is really a very deep sleep; similar > > > to soft powerdown, but not quite. > > Is this really a good idea? We have no other choice. ACPI spec says we should use S4. > For that matter, what are the differences among the various sorts of > poweroff? > > Which devices remain minimally powered for wakeup purposes? > > Anything else? Blinking moon LED. Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-03 22:18 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek @ 2007-05-04 14:57 ` Alan Stern 2007-05-04 20:50 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-04 14:57 UTC (permalink / raw) To: Pavel Machek; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham On Fri, 4 May 2007, Pavel Machek wrote: > Hi! > > > > > Well... the powerdown during hibernation... does not have _anything_ > > > > to do with snapshot/restore. It is really a very deep sleep; similar > > > > to soft powerdown, but not quite. > > > > Is this really a good idea? > > We have no other choice. ACPI spec says we should use S4. I haven't checked the spec, but I find it hard to believe. What could possibly be wrong with using S5? It works just fine for normal poweroff, with no wakeup devices enabled. Provided you don't enable the wakeup devices during hibernation, why not use S5? > Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS. We do normal powerdown whenever someone shuts off his computer without hibernating. I haven't noticed any ACPI BIOS confusion from that... Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 14:57 ` Alan Stern @ 2007-05-04 20:50 ` Rafael J. Wysocki 2007-05-04 20:49 ` Johannes Berg 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 20:50 UTC (permalink / raw) To: Alan Stern Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Friday, 4 May 2007 16:57, Alan Stern wrote: > On Fri, 4 May 2007, Pavel Machek wrote: > > > Hi! > > > > > > > Well... the powerdown during hibernation... does not have _anything_ > > > > > to do with snapshot/restore. It is really a very deep sleep; similar > > > > > to soft powerdown, but not quite. > > > > > > Is this really a good idea? > > > > We have no other choice. ACPI spec says we should use S4. > > I haven't checked the spec, but I find it hard to believe. What could > possibly be wrong with using S5? It works just fine for normal poweroff, > with no wakeup devices enabled. Provided you don't enable the wakeup > devices during hibernation, why not use S5? I think the problem is the "reinitialize from scratch after the resume" part. If we're waking up from the hibernation, device drivers should reinitialize their devices, but if we're waking up from a suspend (eg. s2ram), it would be wrong to reinitialize, for example, the ACPI subsystem from scratch. Now, the problem is that the drivers (including ACPI drivers) cannot tell whether the resume is from hibernation or from suspend so they try to do something "generic". This may lead to having the system not fully functional after the resume from hibernation if we don't tell the ACPI BIOS that we're hibernating (by entering the S4 state instead of S5). > > Unfortunately if we do normal powerdown, we'll confuse ACPI BIOS. > > We do normal powerdown whenever someone shuts off his computer without > hibernating. I haven't noticed any ACPI BIOS confusion from that... In fact, I think, the BIOS isn't confused, but it may preserve some state information that the OS can use later on. By entering S4 we tell the BIOS to tell the "next" kernel that we've hibernated and to preserve some configuration information for it. If this information is not present, our own ACPI drivers get confised during the resume. To prevent this from happening, we need a separate set of hibernation callbacks in device drivers. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:50 ` Rafael J. Wysocki @ 2007-05-04 20:49 ` Johannes Berg 2007-05-04 21:11 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-04 20:49 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 332 bytes --] On Fri, 2007-05-04 at 22:50 +0200, Rafael J. Wysocki wrote: > To prevent this from happening, we need a separate set of hibernation callbacks > in device drivers. You *can* actually do that now with prethaw and all that afaict. But all the more argument for splitting up the callbacks as discussed previously. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 20:49 ` Johannes Berg @ 2007-05-04 21:11 ` Rafael J. Wysocki 2007-05-04 21:23 ` Johannes Berg 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 21:11 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Friday, 4 May 2007 22:49, Johannes Berg wrote: > On Fri, 2007-05-04 at 22:50 +0200, Rafael J. Wysocki wrote: > > > To prevent this from happening, we need a separate set of hibernation callbacks > > in device drivers. > > You *can* actually do that now with prethaw and all that afaict. Actually, prethaw is to prevent drivers loaded before the image is restored from doing unreasonable things. It doesn't have any effect on the drivers' .resume() routines. Besides, if the drivers in question are compiled as modules and not loaded before the image is restored, prethaw doesn't have any effect on them and on their devices at all. ;-) > But all the more argument for splitting up the callbacks as discussed > previously. Yes. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:11 ` Rafael J. Wysocki @ 2007-05-04 21:23 ` Johannes Berg 2007-05-04 21:55 ` Rafael J. Wysocki 2007-05-05 16:15 ` Alan Stern 0 siblings, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-04 21:23 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 389 bytes --] On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > Actually, prethaw is to prevent drivers loaded before the image is restored > from doing unreasonable things. It doesn't have any effect on the drivers' > .resume() routines. Oh, but it can, you could have a flag in your driver saying "the next resume is after restore" and you set that flag in prethaw. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:23 ` Johannes Berg @ 2007-05-04 21:55 ` Rafael J. Wysocki 2007-05-04 21:54 ` Johannes Berg 2007-05-04 22:12 ` David Brownell 2007-05-05 16:15 ` Alan Stern 1 sibling, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 21:55 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Friday, 4 May 2007 23:23, Johannes Berg wrote: > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > from doing unreasonable things. It doesn't have any effect on the drivers' > > .resume() routines. > > Oh, but it can, you could have a flag in your driver saying "the next > resume is after restore" and you set that flag in prethaw. No, you should have set that flag in .suspend(), really. :-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:55 ` Rafael J. Wysocki @ 2007-05-04 21:54 ` Johannes Berg 2007-05-04 22:21 ` Rafael J. Wysocki 2007-05-04 22:12 ` David Brownell 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-05-04 21:54 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 755 bytes --] On Fri, 2007-05-04 at 23:55 +0200, Rafael J. Wysocki wrote: > On Friday, 4 May 2007 23:23, Johannes Berg wrote: > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > > from doing unreasonable things. It doesn't have any effect on the drivers' > > > .resume() routines. > > > > Oh, but it can, you could have a flag in your driver saying "the next > > resume is after restore" and you set that flag in prethaw. > > No, you should have set that flag in .suspend(), really. :-) Yeah, whatever. You can fix the problem but it's ugly. Let's come up with a good way to do the 6 callbacks mentioned in some other thread earlier. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:54 ` Johannes Berg @ 2007-05-04 22:21 ` Rafael J. Wysocki 2007-05-05 15:37 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 22:21 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Friday, 4 May 2007 23:54, Johannes Berg wrote: > On Fri, 2007-05-04 at 23:55 +0200, Rafael J. Wysocki wrote: > > On Friday, 4 May 2007 23:23, Johannes Berg wrote: > > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > > > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > > > from doing unreasonable things. It doesn't have any effect on the drivers' > > > > .resume() routines. > > > > > > Oh, but it can, you could have a flag in your driver saying "the next > > > resume is after restore" and you set that flag in prethaw. > > > > No, you should have set that flag in .suspend(), really. :-) > > Yeah, whatever. You can fix the problem but it's ugly. Let's come up > with a good way to do the 6 callbacks mentioned in some other thread > earlier. This is the plan, but we need to do some preparations. For example, I think, we should introduce some consistent terminology, so that we *always* know what we're talking about. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 22:21 ` Rafael J. Wysocki @ 2007-05-05 15:37 ` Alan Stern 2007-05-05 18:49 ` Rafael J. Wysocki 0 siblings, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 15:37 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > Yeah, whatever. You can fix the problem but it's ugly. Let's come up > > with a good way to do the 6 callbacks mentioned in some other thread > > earlier. > > This is the plan, but we need to do some preparations. > > For example, I think, we should introduce some consistent terminology, so that > we *always* know what we're talking about. A proposal: For suspend-to-RAM we already have suspend() and resume(). At the possible cost of introducing some confusion, I think it makes sense to keep those method names. For hibernation we need these: pre_snapshot() post_snapshot() pre_restore() post_restore() In addition we may want to have early/late variations on these (for use after interrupts have been disabled), which would lead to: pre_snapshot() pre_snapshot_late() post_snapshot_early() post_snapshot() pre_restore() pre_restore_late() post_restore_early() post_restore() Yes, it's a large list... But it seems to be necessary for providing all the information drivers will need. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 15:37 ` Alan Stern @ 2007-05-05 18:49 ` Rafael J. Wysocki 2007-05-05 21:44 ` Alan Stern 2007-05-07 8:51 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 0 siblings, 2 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 18:49 UTC (permalink / raw) To: Alan Stern Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Saturday, 5 May 2007 17:37, Alan Stern wrote: > On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > > > Yeah, whatever. You can fix the problem but it's ugly. Let's come up > > > with a good way to do the 6 callbacks mentioned in some other thread > > > earlier. > > > > This is the plan, but we need to do some preparations. > > > > For example, I think, we should introduce some consistent terminology, so that > > we *always* know what we're talking about. > > A proposal: > > For suspend-to-RAM we already have suspend() and resume(). At the > possible cost of introducing some confusion, I think it makes sense to > keep those method names. I agree. > For hibernation we need these: > > pre_snapshot() > post_snapshot() > pre_restore() > post_restore() > > In addition we may want to have early/late variations on these (for use > after interrupts have been disabled), which would lead to: > > pre_snapshot() > pre_snapshot_late() > post_snapshot_early() > post_snapshot() > pre_restore() > pre_restore_late() > post_restore_early() > post_restore() > > Yes, it's a large list... But it seems to be necessary for providing all > the information drivers will need. I think we may need yet another callback, executed before pre_snapshot() and before we shrink memory during the hibernation, to be used by drivers that need a lot of additional memory in pre_snapshot(). Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 18:49 ` Rafael J. Wysocki @ 2007-05-05 21:44 ` Alan Stern 2007-05-05 22:36 ` Rafael J. Wysocki 2007-05-07 8:51 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 1 sibling, 1 reply; 713+ messages in thread From: Alan Stern @ 2007-05-05 21:44 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > In addition we may want to have early/late variations on these (for use > > after interrupts have been disabled), which would lead to: > > > > pre_snapshot() > > pre_snapshot_late() > > post_snapshot_early() > > post_snapshot() > > pre_restore() > > pre_restore_late() > > post_restore_early() > > post_restore() > > > > Yes, it's a large list... But it seems to be necessary for providing all > > the information drivers will need. > > I think we may need yet another callback, executed before pre_snapshot() > and before we shrink memory during the hibernation, to be used by drivers > that need a lot of additional memory in pre_snapshot(). pre_snapshot_early() Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 21:44 ` Alan Stern @ 2007-05-05 22:36 ` Rafael J. Wysocki 2007-05-06 22:01 ` Alan Stern 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-05 22:36 UTC (permalink / raw) To: Alan Stern Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Saturday, 5 May 2007 23:44, Alan Stern wrote: > On Sat, 5 May 2007, Rafael J. Wysocki wrote: > > > > In addition we may want to have early/late variations on these (for use > > > after interrupts have been disabled), which would lead to: > > > > > > pre_snapshot() > > > pre_snapshot_late() > > > post_snapshot_early() > > > post_snapshot() > > > pre_restore() > > > pre_restore_late() > > > post_restore_early() > > > post_restore() > > > > > > Yes, it's a large list... But it seems to be necessary for providing all > > > the information drivers will need. > > > > I think we may need yet another callback, executed before pre_snapshot() > > and before we shrink memory during the hibernation, to be used by drivers > > that need a lot of additional memory in pre_snapshot(). > > pre_snapshot_early() OK So, I think the hibernation code ordering should be like this (let's forget about ACPI for now): 1) tasks are frozen 2) pre_snapshot_early() 3) memory is freed for the snapshot image 4) pre_snapshod() 5) nonboot CPUs are offlined 6) IRQs are disabled 7) pre_snapshot_late() 8) sysdev_pre_snapshot() 9) snapshot image is created 10) sysdev_post_snapshot() 11) post_snapshot_early() 12) IRQs are enabled 13) nonboot CPUs are enabled 14) post_snapshot() 15) snapshot image is saved 16) device_shutdown() 17) system is powered off Apart from this, we may need notifiers for subsystems that should do something before the freezing and after the thawing of tasks (like FUSE etc.). Also, if there's an error, we have to be able to thaw tasks after post_snapshot() and continue running. The restore code, IMO, should be like this (again, let's ignore ACPI for now): 1) boot kernel is started, initrd is loaded etc. 2) tasks are frozen 3) snapshot image is loaded 4) pre_restore() 5) nonboot CPUs are offlined 6) IRQs are disabled 7) pre_restore_late() 8) sysdev_pre_restore() 9) boot kernel is replaced with the 'hibernated' kernel 10) sysdev_post_restore() 11) post_restore_early() 12) IRQs are enabled 13) nonboot CPUs are enabled 14) post_restore() 15) tasks are thawed 16) system is running and we may need a notifier for subsystems that should do something after tasks have been thawed. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 22:36 ` Rafael J. Wysocki @ 2007-05-06 22:01 ` Alan Stern 2007-05-06 22:31 ` Rafael J. Wysocki 2007-05-07 1:37 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell 0 siblings, 2 replies; 713+ messages in thread From: Alan Stern @ 2007-05-06 22:01 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Sun, 6 May 2007, Rafael J. Wysocki wrote: > > > I think we may need yet another callback, executed before pre_snapshot() > > > and before we shrink memory during the hibernation, to be used by drivers > > > that need a lot of additional memory in pre_snapshot(). > > > > pre_snapshot_early() > > OK I changed my mind -- pre_hibernate() seems like a better name. There could be a matching post_hibernate(), if anyone finds it necessary. I considered pre_freeze(), but that's not such a good choice since the freezer can be used for other things in addition to hibernation. > So, I think the hibernation code ordering should be like this (let's forget > about ACPI for now): > > 1) tasks are frozen > 2) pre_snapshot_early() Or rather: 2) pre_hibernate() > 3) memory is freed for the snapshot image > 4) pre_snapshod() > 5) nonboot CPUs are offlined > 6) IRQs are disabled > 7) pre_snapshot_late() > 8) sysdev_pre_snapshot() > 9) snapshot image is created > 10) sysdev_post_snapshot() > 11) post_snapshot_early() > 12) IRQs are enabled > 13) nonboot CPUs are enabled > 14) post_snapshot() > 15) snapshot image is saved > 16) device_shutdown() > 17) system is powered off > > Apart from this, we may need notifiers for subsystems that should do something > before the freezing and after the thawing of tasks (like FUSE etc.). Quite so. > Also, if there's an error, we have to be able to thaw tasks after > post_snapshot() and continue running. > > The restore code, IMO, should be like this (again, let's ignore ACPI for now): > > 1) boot kernel is started, initrd is loaded etc. > 2) tasks are frozen > 3) snapshot image is loaded > 4) pre_restore() > 5) nonboot CPUs are offlined > 6) IRQs are disabled > 7) pre_restore_late() > 8) sysdev_pre_restore() > 9) boot kernel is replaced with the 'hibernated' kernel > 10) sysdev_post_restore() > 11) post_restore_early() > 12) IRQs are enabled > 13) nonboot CPUs are enabled > 14) post_restore() > 15) tasks are thawed > 16) system is running > > and we may need a notifier for subsystems that should do something after > tasks have been thawed. It sounds good to me. Now if only it were possible to get rid of those pesky sysdevs... Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-06 22:01 ` Alan Stern @ 2007-05-06 22:31 ` Rafael J. Wysocki 2007-05-07 1:37 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-06 22:31 UTC (permalink / raw) To: Alan Stern Cc: Johannes Berg, Pekka Enberg, linux-pm, Pavel Machek, Nigel Cunningham On Monday, 7 May 2007 00:01, Alan Stern wrote: > On Sun, 6 May 2007, Rafael J. Wysocki wrote: > > > > > I think we may need yet another callback, executed before pre_snapshot() > > > > and before we shrink memory during the hibernation, to be used by drivers > > > > that need a lot of additional memory in pre_snapshot(). > > > > > > pre_snapshot_early() > > > > OK > > I changed my mind -- pre_hibernate() seems like a better name. OK > There could be a matching post_hibernate(), if anyone finds it necessary. I > considered pre_freeze(), but that's not such a good choice since the > freezer can be used for other things in addition to hibernation. Agreed. > > So, I think the hibernation code ordering should be like this (let's forget > > about ACPI for now): > > > > 1) tasks are frozen > > 2) pre_snapshot_early() > > Or rather: 2) pre_hibernate() OK > > 3) memory is freed for the snapshot image > > 4) pre_snapshod() > > 5) nonboot CPUs are offlined > > 6) IRQs are disabled > > 7) pre_snapshot_late() > > 8) sysdev_pre_snapshot() > > 9) snapshot image is created > > 10) sysdev_post_snapshot() > > 11) post_snapshot_early() > > 12) IRQs are enabled > > 13) nonboot CPUs are enabled > > 14) post_snapshot() > > 15) snapshot image is saved > > 16) device_shutdown() > > 17) system is powered off > > > > Apart from this, we may need notifiers for subsystems that should do something > > before the freezing and after the thawing of tasks (like FUSE etc.). > > Quite so. > > > Also, if there's an error, we have to be able to thaw tasks after > > post_snapshot() and continue running. > > > > The restore code, IMO, should be like this (again, let's ignore ACPI for now): > > > > 1) boot kernel is started, initrd is loaded etc. > > 2) tasks are frozen > > 3) snapshot image is loaded > > 4) pre_restore() > > 5) nonboot CPUs are offlined > > 6) IRQs are disabled > > 7) pre_restore_late() > > 8) sysdev_pre_restore() > > 9) boot kernel is replaced with the 'hibernated' kernel > > 10) sysdev_post_restore() > > 11) post_restore_early() > > 12) IRQs are enabled > > 13) nonboot CPUs are enabled > > 14) post_restore() > > 15) tasks are thawed > > 16) system is running > > > > and we may need a notifier for subsystems that should do something after > > tasks have been thawed. > > It sounds good to me. Now if only it were possible to get rid of those > pesky sysdevs... I think that will be possible over time. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) 2007-05-06 22:01 ` Alan Stern 2007-05-06 22:31 ` Rafael J. Wysocki @ 2007-05-07 1:37 ` David Brownell 2007-05-08 2:57 ` Greg KH 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-07 1:37 UTC (permalink / raw) To: linux-pm; +Cc: Nigel Cunningham, Pekka Enberg, Pavel Machek, Johannes Berg On Sunday 06 May 2007, Alan Stern wrote: > It sounds good to me. Now if only it were possible to get rid of those > pesky sysdevs... Other than lack of patches ... is there a reason?? I thought that sysdevs were no longer needed. - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) 2007-05-07 1:37 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell @ 2007-05-08 2:57 ` Greg KH 0 siblings, 0 replies; 713+ messages in thread From: Greg KH @ 2007-05-08 2:57 UTC (permalink / raw) To: David Brownell Cc: Pekka Enberg, linux-pm, Nigel Cunningham, Johannes Berg, Pavel Machek On Sun, May 06, 2007 at 06:37:36PM -0700, David Brownell wrote: > On Sunday 06 May 2007, Alan Stern wrote: > > It sounds good to me. Now if only it were possible to get rid of those > > pesky sysdevs... > > Other than lack of patches ... is there a reason?? > I thought that sysdevs were no longer needed. I would love to get rid of them, patches gladly accepted :) thanks, greg k-h ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-05 18:49 ` Rafael J. Wysocki 2007-05-05 21:44 ` Alan Stern @ 2007-05-07 8:51 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-05-07 8:51 UTC (permalink / raw) To: Rafael J. Wysocki; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek [-- Attachment #1.1: Type: text/plain, Size: 504 bytes --] On Sat, 2007-05-05 at 20:49 +0200, Rafael J. Wysocki wrote: > I think we may need yet another callback, executed before pre_snapshot() > and before we shrink memory during the hibernation, to be used by drivers > that need a lot of additional memory in pre_snapshot(). I'm not sure we really need a callback here for that, your suspend memory allocation chain seemed good enough since most drivers won't actually be using it and it's not a hard requirement. Not that I care much. johannes [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:55 ` Rafael J. Wysocki 2007-05-04 21:54 ` Johannes Berg @ 2007-05-04 22:12 ` David Brownell 2007-05-04 22:31 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: David Brownell @ 2007-05-04 22:12 UTC (permalink / raw) To: linux-pm; +Cc: Johannes Berg, Pekka Enberg, Nigel Cunningham, Pavel Machek On Friday 04 May 2007, Rafael J. Wysocki wrote: > On Friday, 4 May 2007 23:23, Johannes Berg wrote: > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > > from doing unreasonable things. It doesn't have any effect on the drivers' > > > .resume() routines. > > > > Oh, but it can, you could have a flag in your driver saying "the next > > resume is after restore" and you set that flag in prethaw. > > No, you should have set that flag in .suspend(), really. :-) That doesn't work very well. Not only does suspend() not know the target state, but you don't want to trash the controller state if you're getting resumed after some kind of fault in the suspend-to-disk path... I'm hoping that explains the smiley! - Dave ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 22:12 ` David Brownell @ 2007-05-04 22:31 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-04 22:31 UTC (permalink / raw) To: David Brownell Cc: linux-pm, Pekka Enberg, Johannes Berg, Pavel Machek, Nigel Cunningham On Saturday, 5 May 2007 00:12, David Brownell wrote: > On Friday 04 May 2007, Rafael J. Wysocki wrote: > > On Friday, 4 May 2007 23:23, Johannes Berg wrote: > > > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > > > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > > > from doing unreasonable things. It doesn't have any effect on the drivers' > > > > .resume() routines. > > > > > > Oh, but it can, you could have a flag in your driver saying "the next > > > resume is after restore" and you set that flag in prethaw. > > > > No, you should have set that flag in .suspend(), really. :-) > > That doesn't work very well. Not only does suspend() not > know the target state, but you don't want to trash the > controller state if you're getting resumed after some kind > of fault in the suspend-to-disk path... > > I'm hoping that explains the smiley! Yes, among other things (like that passing anything from prethaw to .resume() really doesn't work unless the data are stored in a device ;-)). Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-04 21:23 ` Johannes Berg 2007-05-04 21:55 ` Rafael J. Wysocki @ 2007-05-05 16:15 ` Alan Stern 1 sibling, 0 replies; 713+ messages in thread From: Alan Stern @ 2007-05-05 16:15 UTC (permalink / raw) To: Johannes Berg; +Cc: linux-pm, Pekka Enberg, Nigel Cunningham, Pavel Machek On Fri, 4 May 2007, Johannes Berg wrote: > On Fri, 2007-05-04 at 23:11 +0200, Rafael J. Wysocki wrote: > > > Actually, prethaw is to prevent drivers loaded before the image is restored > > from doing unreasonable things. It doesn't have any effect on the drivers' > > .resume() routines. > > Oh, but it can, you could have a flag in your driver saying "the next > resume is after restore" and you set that flag in prethaw. You're both wrong. PRETHAW is to prevent drivers present in the image from doing reasonable-but-wrong things (because they were misled by actions taken by the boot kernel or the BIOS before the image was restored). It gives the boot kernel's driver a chance to put the device in a state which won't be misleading. And while you could have a flag in your driver saying "the next resume is after restore", setting it during PRETHAW would accomplish nothing. PRETHAW occurs immediately before the image is restored, which means the flag would get overwritten by the contents of the image. Alan Stern ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) 2007-05-02 9:16 ` Pavel Machek 2007-05-02 9:25 ` Johannes Berg @ 2007-05-02 13:43 ` Rafael J. Wysocki 1 sibling, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-02 13:43 UTC (permalink / raw) To: Pavel Machek; +Cc: Johannes Berg, Pekka Enberg, linux-pm, Nigel Cunningham On Wednesday, 2 May 2007 11:16, Pavel Machek wrote: > Hi! > > > > Well, having a look on the ACPI spec I'm thinking that what we're trying to do > > > with this patch is actually wrong. > > > > No idea :) > > > > > Instead, we should rip off all of the invocations of pm_ops->whatever() from > > > the hibernation code paths (with the below exceptions) and *if* the platform > > > method is to be used, call pm_ops to make the system go down, in the following > > > way: > > > 1) call pm_ops->prepare(PM_SUSPEND_DISK) > > > 2) suspend devices (ie. call device_suspend() etc.) > > > 3) call pm_ops->enter(PM_SUSPEND_DISK) > > > and if that *fails* (ie. pm_ops->enter() returns): > > > 4) call pm_ops->finish(PM_SUSPEND_DISK) > > > 5) halt the system > > > > Can we still split that off to another method so we don't use pm_ops? No > > matter how we invoke hibernation_ops or in what order, imho we shouldn't > > use pm_ops. > > Well... the powerdown during hibernation... does not have _anything_ > to do with snapshot/restore. Agreed. > It is really a very deep sleep; similar to soft powerdown, but not quite. Yeah, not quite. For example, we may want to use some devices for waking up the system, but with the current code it's impossible, because pm_ops->finish() disables this capability of devices. I think we shouldn't confuse the quiescing of devices before the image creation with a power transition. This is not a power transition, since it's not completed by calling pm_ops->enter(). Instead, we kinda-sorta abort it with pm_ops->finish() which confuses the heck out of the ACPI firmware (please see my reply to Alexey in the same thread for a detailed analysis). > So this usage of pm_ops seems ok. To me, it doesn't. These are the main problems I see with it: 1) device_suspend() should be called before the _PTS method is executed (IMO it's correct not to execute _PTS at all if we don't want to do a real power transition) 2) The _GTS method shouldn't be executed in acpi_pm_prepare(), but instead should be executed in acpi_pm_enter(), right before the transition is completed 3) The _BFS method shouldn't be executed in the resume-during-hibernation code path 4) The wake-up capability of devices should be enabled before we execute pm_ops->enter() and shouldn't be enabled before the image creation (what for?). 5) The first part of 4) requires that the transition be started over after the image has been saved. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:18 ` Linus Torvalds 2007-04-25 22:27 ` Nigel Cunningham @ 2007-04-25 22:42 ` Pavel Machek 2007-04-25 22:58 ` Linus Torvalds 2007-04-25 22:43 ` Chuck Ebbert ` (2 subsequent siblings) 4 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 22:42 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > Not the same... but they are still related. "freeze" (for atomic > > snapshot) is actually subset of "suspend"... freeze needs DMAs off and > > saved state, and you need DMAs off and saved state for "suspend". > > THEY HAVE ABSOLUTELY NOTHING IN COMMON! > > Nobody in their right mind thinks that "disable DMA" and "suspend" are > similar operations. > > > So it is actually correct to do "suspend" when you want "freeze"; it > > is just slow. That's why they only differ in parameter these days. > > It is *not* correct to "suspend" when you want "freeze". Example? > I don't understand how you can even *claim* something like that. > > Here's a trivial example: > - SCSI disk > > Tell me, what does "suspend" do, and what does "freeze" (snapshot) do? Suspend syncs caches/spins down. Freeze does not do anything. That's okay, I keep claiming "freeze" is subset of "suspend". Can you name device where that is not true? Remember we do suspend(PMSG_FREEZE) atomic snapshot resume() write snapshot. So if we do spin the scsi disk down, nothing really bad happens, we'll just spin it up. (So scsi disk is not example I want. Spining down scsi disk on freeze is slow and stupid, but it is not incorrect). Yes, If I'd knew what I know now, drivers would have suspend/freeze/thaw/resume methods. We probably still can do that change. Unfortunately, it needs driver authors to understand 4 hooks (not 2) and do the right thing. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:42 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek @ 2007-04-25 22:58 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 22:58 UTC (permalink / raw) To: Pavel Machek Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > Suspend syncs caches/spins down. Freeze does not do anything. > > That's okay, I keep claiming "freeze" is subset of "suspend". Can you > name device where that is not true? Sure. Like just about any PCI device that doesn't do things on its own. A "freeze" does nothing at all, or perhaps shuts down the reader side (for something like a network controller). A "suspend" does "write D3 to the suspend register". Absolutely zero in common. > Remember we do > > suspend(PMSG_FREEZE) > atomic snapshot > resume() > write snapshot. AND THAT IS STUPID. It mixes up "suspend()" and creating a snapshot in ways that are totally idiotic. There is nothing in common! Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:18 ` Linus Torvalds 2007-04-25 22:27 ` Nigel Cunningham 2007-04-25 22:42 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek @ 2007-04-25 22:43 ` Chuck Ebbert 2007-04-25 23:00 ` Linus Torvalds 2007-04-25 22:49 ` Pavel Machek 2007-04-25 22:57 ` Alan Cox 4 siblings, 1 reply; 713+ messages in thread From: Chuck Ebbert @ 2007-04-25 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Linus Torvalds wrote: > Tell me, what does "suspend" do, and what does "freeze" (snapshot) do? > > And name *one* thing that have in common. > > I'll tell you: Nada. Zero. Zilch. Nothing. > > "Freeze" for a disk is a total no-op. There is no DMA, there is no > nothing. In contrast, "suspend" for a disk is a totally valid operation. > Freeze is a subset of suspend, isn't it? (It might be an empty subset in some cases.) ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:43 ` Chuck Ebbert @ 2007-04-25 23:00 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 23:00 UTC (permalink / raw) To: Chuck Ebbert Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Chuck Ebbert wrote: > > Freeze is a subset of suspend, isn't it? (It might be an empty subset > in some cases.) NO IT IS NOT! Yes, you are parroting Pavel, but he can say it a million times, and it's *still* not true. That's like saying "read() is a subset of write(), isn't it?" On many devices, they share some of the setup, like writing the same sector registers with the same values. Does that make them subsets of each other? Or does it mean that they *may* use some of the same common helper functions for some devices? Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:18 ` Linus Torvalds ` (2 preceding siblings ...) 2007-04-25 22:43 ` Chuck Ebbert @ 2007-04-25 22:49 ` Pavel Machek 2007-04-25 23:10 ` Linus Torvalds 2007-04-25 22:57 ` Alan Cox 4 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 22:49 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > Not the same... but they are still related. "freeze" (for atomic > > snapshot) is actually subset of "suspend"... freeze needs DMAs off and > > saved state, and you need DMAs off and saved state for "suspend". > > THEY HAVE ABSOLUTELY NOTHING IN COMMON! > > Nobody in their right mind thinks that "disable DMA" and "suspend" are > similar operations. > > > So it is actually correct to do "suspend" when you want "freeze"; it > > is just slow. That's why they only differ in parameter these days. > > It is *not* correct to "suspend" when you want "freeze". > > I don't understand how you can even *claim* something like that. BTW most problems are in thaw/resume functions. Both suspend and freeze are pretty simple, and they both need to save device state. In SCSI disk, it would be nice to save options set by sdparm. And both thaw and resume need to be able to restore the device from both "powered down" and "some state preserved". Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:49 ` Pavel Machek @ 2007-04-25 23:10 ` Linus Torvalds 2007-04-25 23:28 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 23:10 UTC (permalink / raw) To: Pavel Machek Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > > > I don't understand how you can even *claim* something like that. > > BTW most problems are in thaw/resume functions. And do you realize that the thaw/resume functions are totally different too? Or rather, they *would* be, if you allowed them to. For example, for "snapshot + thaw", the _sane_ thing is to actually make the snapshot just throw away all the DMA tables etc, and let the thawing just do a full initialization (as it did on boot). It basically needs to do that anyway, and it simplifies the whole thing (ie you don't even *want* to save things like the DMA command queues etc - the ones that will quite often be stepped on by the final "write snapshot to disk" stuff anyway). For suspend to ram, in contrast, since you *know* that nobody will be touching the hardware, and since the timings are very different anyway (you'd hope that you can resume in a second or two), you'd generally want to keep the DMA engine tables right where they are, and just literally suspend the PCI chip itself. See? Again, *nothing* in common. You think they have things in common just because your whole (incorrect) mindset has _forced_ them to have things in common, becasue your setup stupidly thinks that "resume" is the same as "thaw", the same way you think "freeze" is the same as "suspend". NEITHER is true. You've _made_ them true in your mind, but there's absolutely zero reason that they *should* be true. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:10 ` Linus Torvalds @ 2007-04-25 23:28 ` Pavel Machek 2007-04-25 23:57 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 23:28 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > > I don't understand how you can even *claim* something like that. > > > > BTW most problems are in thaw/resume functions. > > And do you realize that the thaw/resume functions are totally different > too? > > Or rather, they *would* be, if you allowed them to. > > For example, for "snapshot + thaw", the _sane_ thing is to actually make > the snapshot just throw away all the DMA tables etc, and let the thawing > just do a full initialization (as it did on boot). It basically needs to > do that anyway, and it simplifies the whole thing (ie you don't even > *want* to save things like the DMA command queues etc - the ones that will > quite often be stepped on by the final "write snapshot to disk" stuff > anyway). I'd prefer thaw to be similar to module insert, yes. > For suspend to ram, in contrast, since you *know* that nobody will be > touching the hardware, and since the timings are very different anyway > (you'd hope that you can resume in a second or two), you'd generally want > to keep the DMA engine tables right where they are, and just literally > suspend the PCI chip itself. I'd actually prefer resume to be similar to module insert, too... Do you think that resume is _that_ time critical? > You think they have things in common just because your whole (incorrect) > mindset has _forced_ them to have things in common, becasue your setup > stupidly thinks that "resume" is the same as "thaw", the same way you > think "freeze" is the same as "suspend". > > NEITHER is true. You've _made_ them true in your mind, but there's > absolutely zero reason that they *should* be true. [I'd like you to drop me a line saying you understand current design and that it works -- even if it is very inelegant] Now, we can separate suspend/freeze and resume/thaw (with some common helpers). It will speed the code up by avoiding unneccessary operations. It also needs attetion from driver writers (ouch). Do we want to do that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:28 ` Pavel Machek @ 2007-04-25 23:57 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 23:57 UTC (permalink / raw) To: Pavel Machek Cc: Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > > For suspend to ram, in contrast, since you *know* that nobody will be > > touching the hardware, and since the timings are very different anyway > > (you'd hope that you can resume in a second or two), you'd generally want > > to keep the DMA engine tables right where they are, and just literally > > suspend the PCI chip itself. > > I'd actually prefer resume to be similar to module insert, too... Do > you think that resume is _that_ time critical? I think it probably depends on the device, and it should depend on the driver writer how he wants to do it. My _point_ is that there is absolutely zero reason to think that the two events are the same. We *know* that for snapshot+shutdown, we need to actually keep the DMA tables intact *over* the snapshot (because writing out the snapshot may _need_ them). But exactly because we keep them intact, a driver writer may sanely say "I didn't even bother shutting them down, so on thaw, I cannot trust them, so I'll just re-initialize them entirely". In contrast, over suspend-to-ram, it's entirely reasonable to just leave them in memory, and just keep them. There's no *reason* not to. And that's my whole point in this argument: the two paths are fundamentally totally different. You *claim* that "snapshot()" needs to stop DMA etc, but that's simply not true. So I claim: - for a lot of devices, it's actually a *lot* easier to just have snapshot not do anythign at all, and re-initialze on thaw - for those same devices, for s2ram, since the tables are *safe* and don't get touched by anything else, it's probably easier to just let them be. See? The "it's easier to do X" is a _different_ X for the two cases. So the whole "suspend is a superset of freeze" is simply not true. > [I'd like you to drop me a line saying you understand current design > and that it works -- even if it is very inelegant] I _do_ understand the current design. I just think that it's totally and seriously broken. I've ranted against it before. I think it's stupid to play like you're "suspending" something just to save some state, especially since most users probably don't even *want* to suspend the state, and would quite happily re-initialize the chip instead. And I think it's horrible to have a dynamic flag to tell the difference between two or more state changes that the devices should statically be able to determine. _If_ some driver really does have the same routine, just use the same routine. There are no downsides to splitting them up. > Now, we can separate suspend/freeze and resume/thaw (with some common > helpers). It will speed the code up by avoiding unneccessary > operations. It also needs attetion from driver writers (ouch). > > Do we want to do that? I'd personally certainly want to do that. But I want to split up the callers too. Right now we mix those a lot as well. I suspect that would automatically be fixed by just forcing them to separate out (since they now call different functions of the devices), but I'm not 100% sure. There might be other issues. Just as an example: one of the most painful things there is in the suspend sequence is that we shut off the console (because the console device will be suspended in hw, and it's thus not safe to use it over a suspend/resume sequence). That should just go away entirely for "snapshot()", because there is *never* any excuse for actually turning off the console during a snapshot: even a network console should be entirely functional. Things like that - things that sound like small issues, but that really aren't. (Right now you can enable the "don't disable the console" config option, but since network drivers will actually shut down etc, it just means that you'll have oopses etc if you do, and you have netconsole enabled) Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:18 ` Linus Torvalds ` (3 preceding siblings ...) 2007-04-25 22:49 ` Pavel Machek @ 2007-04-25 22:57 ` Alan Cox 2007-04-25 23:20 ` Linus Torvalds 4 siblings, 1 reply; 713+ messages in thread From: Alan Cox @ 2007-04-25 22:57 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven > Tell me, what does "suspend" do, and what does "freeze" (snapshot) do? > > And name *one* thing that have in common. Both of them have to ensure you can make a consistent snapshot. Doing that means you've got to be able to define a single "point" at which the snapshot is made and is internally self-consistent. That in both cases tends to mean you've got to ensure nothing occurs which pees on the image while you are making that snapshot (such as outstanding O_DIRECT I/O to user pages). Alan ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 22:57 ` Alan Cox @ 2007-04-25 23:20 ` Linus Torvalds 2007-04-25 23:52 ` Pavel Machek 2007-04-26 0:24 ` Alan Cox 0 siblings, 2 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-25 23:20 UTC (permalink / raw) To: Alan Cox Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Alan Cox wrote: > > Both of them have to ensure you can make a consistent snapshot. Bzzt. Wrong again. Very much so. STR does not need to "ensure that you have a consistent snapshot". Why? Becuase there is no _room_ for inconsistency. There's nothing to be "inconsistent with", since any changes to memory (by things like DMA or other setup that happens while the suspend process is going on) is by _definition_ consistent with the resume image (becasue there is no separate image). > Doing that means you've got to be able to define a single "point" at > which the snapshot is made and is internally self-consistent. That in > both cases tends to mean you've got to ensure nothing occurs which pees > on the image while you are making that snapshot (such as outstanding > O_DIRECT I/O to user pages). Get off the drugs, Alan. There *is* no snapshot with suspend-to-ram. Which is the whole point I'm trying to make! A _lot_ of people are confused about this. With suspend-to-ram, you don't need to do a damn thing to the chip, except suspend it and resume it. There are _zero_ consistency issues. There is no need to freeze anything at any point. You can suspend each device totally independently of all other devices (taking into account things like bus topology, of course), and there is no "atomic" snapshot that needs to ever exist. That's TOTALLY DIFFERENT from "suspend to disk". In suspend to disk, you need a completely different kind of mindset, namely you need a single consistent image, where the image is consistent not only with memory, but with all the devices. For example, the whole myth that "freeze" needs to shut off DMA is a total and utter *myth*. It needs nothing of the sort at all. Rather than shut off DMA and try to make the hardware be wevy wevy quiet while it's hunting wabbits, it's a lot easier to just do nothing at all on "freeze", and just make sure that "thaw" will re-initialze the DMA tables entirely! All drivers have code to do that anyway, since that's what you need to do at boot. Notice? Totally different. Absolutely NOTHING in common. Not on a practical plane, and not even conceptually. The current (broken!) implementation has forced a totally idiotic model on things, where instead of snapshotting doing the sane and simple thing, it ends up doing extra work that is totally unnecessary, but *becomes* necessary just because it *also* shares the "resume" path (which should _not_ be the same either!) Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:20 ` Linus Torvalds @ 2007-04-25 23:52 ` Pavel Machek 2007-04-26 0:05 ` Linus Torvalds 2007-04-26 0:24 ` Alan Cox 1 sibling, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-25 23:52 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > Both of them have to ensure you can make a consistent snapshot. > > Bzzt. Wrong again. Very much so. > > STR does not need to "ensure that you have a consistent snapshot". > > Why? Becuase there is no _room_ for inconsistency. There's nothing to be > "inconsistent with", since any changes to memory (by things like DMA or > other setup that happens while the suspend process is going on) is by > _definition_ consistent with the resume image (becasue there is no > separate image). Do you propose to keep DMAs running while suspending-to-RAM? That sounds really unsafe; we are shutting down our PCI controllers at that time; doing that while DMAs are running sounds bad. > That's TOTALLY DIFFERENT from "suspend to disk". In suspend to disk, you > need a completely different kind of mindset, namely you need a single > consistent image, where the image is consistent not only with memory, but > with all the devices. > > For example, the whole myth that "freeze" needs to shut off DMA is a total > and utter *myth*. It needs nothing of the sort at all. Rather than shut > off DMA and try to make the hardware be wevy wevy quiet while it's hunting > wabbits, it's a lot easier to just do nothing at all on "freeze", No. Sorry, you are wrong here. Remember that during resume we run freeze() copy old data into memory thaw() . Now, if the old kernel left DMAs running, it could be overwriting the data we are copying in. It is not about DMA tables. While resuming, CPU needs to be alone, without interference from DMA engines (or other CPUs), because copying back old image means writing to memory that was not properly alocated. (Now, we could add one more hook, turn_off_dmas_for_copyback(), but that looks like way too many hooks to me. And I'm not comfortable with DMA engines running while I'm trying to copy image. They may be overwriting data I'm trying to copy...) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:52 ` Pavel Machek @ 2007-04-26 0:05 ` Linus Torvalds 2007-04-26 0:14 ` Pavel Machek 2007-04-26 0:34 ` Linus Torvalds 0 siblings, 2 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 0:05 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > > > Why? Becuase there is no _room_ for inconsistency. There's nothing to be > > "inconsistent with", since any changes to memory (by things like DMA or > > other setup that happens while the suspend process is going on) is by > > _definition_ consistent with the resume image (becasue there is no > > separate image). > > Do you propose to keep DMAs running while suspending-to-RAM? What part of "suspend a chip" do you have trouble with? DMA obviously does *not* happen with a suspended device. There's no need to turn DMA even off - it just doesn't happen! > > For example, the whole myth that "freeze" needs to shut off DMA is a total > > and utter *myth*. It needs nothing of the sort at all. Rather than shut > > off DMA and try to make the hardware be wevy wevy quiet while it's hunting > > wabbits, it's a lot easier to just do nothing at all on "freeze", > > No. Sorry, you are wrong here. > > Remember that during resume we run > > freeze() > copy old data into memory > thaw() > > Now, if the old kernel left DMAs running, it could be overwriting > the data we are copying in. The *thaw* needs to happen with devices quiescent. But that sure doesn't have anythign to do with the "snapshot()" path. In fact, you'll have rebooted the machine in between. So what does that have to do with "snapshotting"? Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:05 ` Linus Torvalds @ 2007-04-26 0:14 ` Pavel Machek 2007-04-25 23:51 ` David Lang 2007-04-26 0:38 ` Linus Torvalds 2007-04-26 0:34 ` Linus Torvalds 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 0:14 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > > Why? Becuase there is no _room_ for inconsistency. There's nothing to be > > > "inconsistent with", since any changes to memory (by things like DMA or > > > other setup that happens while the suspend process is going on) is by > > > _definition_ consistent with the resume image (becasue there is no > > > separate image). > > > > Do you propose to keep DMAs running while suspending-to-RAM? > > What part of "suspend a chip" do you have trouble with? > > DMA obviously does *not* happen with a suspended device. There's no need > to turn DMA even off - it just doesn't happen! Ok, I guess I'll have nightmares of DMA controllers doing DMAs from chips that are no longer there tonight. > > > For example, the whole myth that "freeze" needs to shut off DMA is a total > > > and utter *myth*. It needs nothing of the sort at all. Rather than shut > > > off DMA and try to make the hardware be wevy wevy quiet while it's hunting > > > wabbits, it's a lot easier to just do nothing at all on "freeze", > > > > No. Sorry, you are wrong here. > > > > Remember that during resume we run > > > > freeze() > > copy old data into memory > > thaw() > > > > Now, if the old kernel left DMAs running, it could be overwriting > > the data we are copying in. > > The *thaw* needs to happen with devices quiescent. > > But that sure doesn't have anythign to do with the "snapshot()" path. In > fact, you'll have rebooted the machine in between. Only the fact that we are currently using same device call during snapshot() and during restore(). We obviously could do _5_ device calls (suspend/resume/freeze/quiesce_disable_dma/thaw) ...but that looks like too many calls to me. > So what does that have to do with "snapshotting"? I'm not comfortable with memory I'm copying changing under my hands because of some DMA. It just looks like asking for trouble. I _think_ we can get away with DMA running during snapshot, because driver may not assume anything about the DMA result before it got completion interrupt, but... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:14 ` Pavel Machek @ 2007-04-25 23:51 ` David Lang 2007-04-26 0:38 ` Linus Torvalds 1 sibling, 0 replies; 713+ messages in thread From: David Lang @ 2007-04-25 23:51 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: >>> Now, if the old kernel left DMAs running, it could be overwriting >>> the data we are copying in. >> >> The *thaw* needs to happen with devices quiescent. >> >> But that sure doesn't have anythign to do with the "snapshot()" path. In >> fact, you'll have rebooted the machine in between. > > Only the fact that we are currently using same device call during > snapshot() and during restore(). We obviously could do _5_ device > calls > > (suspend/resume/freeze/quiesce_disable_dma/thaw) > > ...but that looks like too many calls to me. > >> So what does that have to do with "snapshotting"? > > I'm not comfortable with memory I'm copying changing under my hands > because of some DMA. It just looks like asking for trouble. I _think_ > we can get away with DMA running during snapshot, because driver may > not assume anything about the DMA result before it got completion > interrupt, but... the key is that with STR you don't need to copy the memory (it's staying where it is) for STD you need to copy the memory, and there you halt DMA becouse you need to make an atomic snapshot. David Lang ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:14 ` Pavel Machek 2007-04-25 23:51 ` David Lang @ 2007-04-26 0:38 ` Linus Torvalds 2007-04-26 2:04 ` H. Peter Anvin 1 sibling, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 0:38 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Pavel Machek wrote: > > Ok, I guess I'll have nightmares of DMA controllers doing DMAs from > chips that are no longer there tonight. Umm. Welcome to the 21st century: we don't do that "separate DMA controller" thing any more. All devices do their own DMA. > Only the fact that we are currently using same device call during > snapshot() and during restore(). We obviously could do _5_ device > calls > > (suspend/resume/freeze/quiesce_disable_dma/thaw) > > ...but that looks like too many calls to me. I'd much rather have five or even more functions that each do *one* obvious thing. Think like a device driver writer: would you prefer to just implement five functions that do something very specific that you know trivially how to do ("I know how to disable interrupts and DMA") or would you want to do some high-level opertion that you don't even know why the caller asks you to suspend? What does "suspend()" even mean when the caller is just going to wake up up immediately again? Is it performance-critical? Should I tear down all my DMA's? I dunno! In other words, splitting things up actually makes things simpler. That's *doubly* true if you can then give each specific function some really clear goals. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:38 ` Linus Torvalds @ 2007-04-26 2:04 ` H. Peter Anvin 2007-04-26 2:32 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: H. Peter Anvin @ 2007-04-26 2:04 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Linus Torvalds wrote: > > On Thu, 26 Apr 2007, Pavel Machek wrote: >> Ok, I guess I'll have nightmares of DMA controllers doing DMAs from >> chips that are no longer there tonight. > > Umm. Welcome to the 21st century: we don't do that "separate DMA > controller" thing any more. All devices do their own DMA. > That was the 1990s. On a brand new server system: 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA Engine (rev b1) For better or worse, slave DMA seems to be making a comeback of sorts. Not to mention all kinds of embedded crap^Whardware with optimized DMA engines which look nothing like PCI at all. -hpa ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 2:04 ` H. Peter Anvin @ 2007-04-26 2:32 ` Linus Torvalds 2007-04-26 13:14 ` Alan Cox 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 2:32 UTC (permalink / raw) To: H. Peter Anvin Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, H. Peter Anvin wrote: > > That was the 1990s. On a brand new server system: > > 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA > Engine (rev b1) > > For better or worse, slave DMA seems to be making a comeback of sorts. > Not to mention all kinds of embedded crap^Whardware with optimized DMA > engines which look nothing like PCI at all. Well, the solution to that tends to be to just leave them be, and hold them on until the very end - and just ignore them (and just make-believe that it's actually the device itself that does the DMA transfer). The PCI spec for controlling DMA is really pretty nasty. You can disable it in the PCI config word, of course, but that usually just messes up the device entirely. So in practice, the way to shut up DMA (regardless of whether it's an internal DMA engine or an external one) is that you just tell the device not to listen any more (for example, for a network controller, the way to make sure it doesn't do DMA is just to make sure that you're not sending any frames, but also that it's not listening to any either)! So whether it's internal to the device, or some "system DMA controller", the sequence for shutting down DMA always ends up being the same: - make sure the host itself doesn't generate any new traffic (eg shut down the send-queue). This is generally a higher-level thing anyway, ie not really a driver decision. - the driver needs to tell the hardware to stop listening (ie "stop scanning the command mailboxes" or "stop walking USB command structures" or "stop receiving data") - the driver then needs to wait for the controller to say "ok, I'm idle". because regardless of whether it's the system DMA controller or some on-chip DMA controller, you generally can *not* just say "stop transferring DMA data", because that will generally just lock the chip up or cause other major unhappiness. So I don't think an external DMA controller (like the i8237, ugh!) really _changes_ anything. Except for just the horrible pain of serializing access to them for programming etc horrible resource handling issues, of course (but that's not specific to suspend/resume). Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 2:32 ` Linus Torvalds @ 2007-04-26 13:14 ` Alan Cox 2007-04-26 16:02 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Alan Cox @ 2007-04-26 13:14 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven > The PCI spec for controlling DMA is really pretty nasty. You can disable > it in the PCI config word, of course, but that usually just messes up the > device entirely. And some devices ignore it. Some of the older Cyrix stuff I have appears not to care how the master bit is set. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 13:14 ` Alan Cox @ 2007-04-26 16:02 ` Linus Torvalds 0 siblings, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 16:02 UTC (permalink / raw) To: Alan Cox Cc: H. Peter Anvin, Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Alan Cox wrote: > > > The PCI spec for controlling DMA is really pretty nasty. You can disable > > it in the PCI config word, of course, but that usually just messes up the > > device entirely. > > And some devices ignore it. Some of the older Cyrix stuff I have appears > not to care how the master bit is set. I'm not surprised. If the choice is between locking up the PCI bus by hanging the device in endless retries, or just ignoring the bit, I suspect "just ignore it" is actually the better choice. Of course, in a perfect world you'd happily honor it, raise a PCI error, and all is good, but in practice the internal state machine of most non-trivial hardware is simply so complicated that the "abort gracefully" simply isn't an option. The hw people have enough problems in getting things to work when everything is peachy and well, and a lot of companies end up releasing stuff with known errata for even the _normal_ cases, just because they expect software to work around them ("Doctor, doctor, it hurts when I do the documented access!" "You didn't read errata #317, did you? Don't do that, then!") Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:05 ` Linus Torvalds 2007-04-26 0:14 ` Pavel Machek @ 2007-04-26 0:34 ` Linus Torvalds 2007-04-26 20:12 ` Rafael J. Wysocki 1 sibling, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 0:34 UTC (permalink / raw) To: Pavel Machek Cc: Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Wed, 25 Apr 2007, Linus Torvalds wrote: > > The *thaw* needs to happen with devices quiescent. Btw, I sure as hell hope you didn't use "suspend()" for that. You're (again) much better off having a totally separate function that just freezes stuff. So in the "snapshot+shutdown" path, you should have: - prepare_to_snapshot() - allocate memory, and possibly return errors We can skip this, if we just make the rule be that any devices that want to support snapshotting must always have the memory required for snapshotting pre-allocated. Most devices really do allocate memory for their state anyway, and the only real reason for the "prepare" stage here is becasue the final snapshot has to happen with interrupts off, obviously. So *if* we don't need to allocate any memory, and if we don't expect to want to accept some early error case, this is likely useless. - snapshot() - actually save device state that is consistent with the memory image at the time. Called with interrupts off, but the device has to be usable both before and afterwards! And I would seriously suggest that "snapshot()" be documented to not rely on any DMA memory, exactly because the device has to be accessible both before and after (before - because we're running and allocating memory, and after - because we'll be writing thigns out). But see later: For the "resume snapshot" path, I would suggest having - freeze(): quiesce the device. This literally just does the absolute minimum to make sure that the device doesn't do anything surprising (no interrupts, no DMA, no nothing). For many devices, it's a no-op, even if they can do DMA (eg most disk controllers will do DMA, but only as an actual result of a request, and upper layers will be quiescent anyway, so they do *not* need to disable DMA) NOTE! The "freeze()" gets called from the *old* kernel just _before_ a snapshot unpacking!! - restart_snapshot() - actually restart the snapshot (and usually this would involve re-setting the device, not so much trying to restore all the saved state. IOW, it's easier to just re-initialize the DMA command queues than to try to make them "atomic" in the snapshot). NOTE! This gets called by the *new* kernel _after_ the snapshot resume! And if you *want* to, I can see that you might want to actually do a "unfreeze()" thing too, and make the actual shapshotting be: /* We may not even need this.. */ for_each_device() { err = prepare_to_snapshot(); if (err) return err; } /* This is the real work for snapshotting */ cli(); for_each_device() freeze(dev); for_each_device() snapshot(dev); .. snapshot current memory image .. for_each_device_depth_first() unfreeze(dev); sti(); and maybe it's worth it, but I would almost suggest that you just make the rule be that any DMA etc just *has* to be re-initialized by "restart_snapshot()", in which case it's not even necessary to freeze/unfreeze over the device, and "snapshot()" itself only needs to make sure any non-DMA data is safe. But adding the freeze/unfreeze (which is a no-op for most hardware anyway) might make things easier to think about, so I would certainly not *object* to it, even if I suspect it's not necessary. Anyway, the restore_snapshot() sequence should be: /* Old kernel.. Normal boot, load snapshot image */ cli() for_each_device() freeze(dev); restore_snapshot_image(); restore_regs_and_jump_to_image(); /* noreturn */ /* New kernel, gets called at the snapshot restore address * with interrupts off and devices frozen, and memory image * constsntent with what it was at "snapshot()" time */ for_each_dev_depth_first() restore_snapshot(dev); /* And if you want to, just to be "symmetric" for_each_dev_depth_first() unfreeze(dev) although I think you could just make "restore_snapshot()" implicitly unfreeze it too.. */ sti(); /* We're up */ and notice how *different* this is from what happens for s2ram. There really isn't anything in common here. Exactly because s2ram simply doesn't _have_ any of the issues with atomic memory images. So s2ram is just for_each_dev() suspend(dev); cli(); for_each_dev() late_suspend(dev); .. go to sleep .. for_each_dev_depth_first() early_resume(dev); sti(); for_each_dev_depth_first() resume(dev); and has none of the "freeze" issues at all. Doesn't that seem a lot more straightforward? Yes, it's more functions, but each function is a lot more obvious. This follows the unix rule of "do one thing, and do that thing well", instead of trying to make one function do many very different things depending on what you actually want done.. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:34 ` Linus Torvalds @ 2007-04-26 20:12 ` Rafael J. Wysocki 0 siblings, 0 replies; 713+ messages in thread From: Rafael J. Wysocki @ 2007-04-26 20:12 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thursday, 26 April 2007 02:34, Linus Torvalds wrote: > > On Wed, 25 Apr 2007, Linus Torvalds wrote: > > > > The *thaw* needs to happen with devices quiescent. > > Btw, I sure as hell hope you didn't use "suspend()" for that. You're > (again) much better off having a totally separate function that just > freezes stuff. > > So in the "snapshot+shutdown" path, you should have: > > - prepare_to_snapshot() - allocate memory, and possibly return errors > > We can skip this, if we just make the rule be that any devices that > want to support snapshotting must always have the memory required for > snapshotting pre-allocated. Most devices really do allocate memory for > their state anyway, and the only real reason for the "prepare" stage > here is becasue the final snapshot has to happen with interrupts off, > obviously. So *if* we don't need to allocate any memory, and if we > don't expect to want to accept some early error case, this is likely > useless. I think we need this. Apparently, some device drivers need as much as 30 meg of RAM at later stages (I don't know why and what for). > - snapshot() - actually save device state that is consistent with the > memory image at the time. Called with interrupts off, but the device > has to be usable both before and afterwards! > > And I would seriously suggest that "snapshot()" be documented to not rely > on any DMA memory, exactly because the device has to be accessible both > before and after (before - because we're running and allocating memory, > and after - because we'll be writing thigns out). But see later: Please note that some drivers are compiled as modules and they may deal with uninitialized hardware (or worse, with the hardware initialized by the BIOS in a crazy way) after restart_snapshot(). It may be better for them to actually quiesce the devices here too to avoid problems after restart_snapshot() . > For the "resume snapshot" path, I would suggest having > > - freeze(): quiesce the device. This literally just does the absolute > minimum to make sure that the device doesn't do anything surprising (no > interrupts, no DMA, no nothing). For many devices, it's a no-op, even > if they can do DMA (eg most disk controllers will do DMA, but only as > an actual result of a request, and upper layers will be quiescent > anyway, so they do *not* need to disable DMA) > > NOTE! The "freeze()" gets called from the *old* kernel just _before_ a > snapshot unpacking!! Yes, and usually the majority of modules is not loaded at that time. > - restart_snapshot() - actually restart the snapshot (and usually this > would involve re-setting the device, not so much trying to restore all > the saved state. IOW, it's easier to just re-initialize the DMA command > queues than to try to make them "atomic" in the snapshot). > > NOTE! This gets called by the *new* kernel _after_ the snapshot resume! I think devices _should_ be resetted in restart_snapshot(), unless it's possible to check if they have already been initialized by the "old" kernel - but this information would have to be available from the device itself. > And if you *want* to, I can see that you might want to actually do a > "unfreeze()" thing too, and make the actual shapshotting be: What unfreeze() would be needed for in that case? > /* We may not even need this.. */ > for_each_device() { > err = prepare_to_snapshot(); > if (err) > return err; > } We need to free as much memory as we'll need for the image creation at this point. > /* This is the real work for snapshotting */ > cli(); > for_each_device() > freeze(dev); You've added freeze() here, but it's not on your list above? > for_each_device() > snapshot(dev); > .. snapshot current memory image .. > for_each_device_depth_first() > unfreeze(dev); > sti(); > > and maybe it's worth it, but I would almost suggest that you just make the > rule be that any DMA etc just *has* to be re-initialized by > "restart_snapshot()", in which case it's not even necessary to > freeze/unfreeze over the device, and "snapshot()" itself only needs to > make sure any non-DMA data is safe. > > But adding the freeze/unfreeze (which is a no-op for most hardware anyway) > might make things easier to think about, so I would certainly not *object* > to it, even if I suspect it's not necessary. I think it's not necessary. > Anyway, the restore_snapshot() sequence should be: > > /* Old kernel.. Normal boot, load snapshot image */ > cli() > for_each_device() > freeze(dev); > restore_snapshot_image(); > restore_regs_and_jump_to_image(); > /* noreturn */ > > > /* New kernel, gets called at the snapshot restore address > * with interrupts off and devices frozen, and memory image > * constsntent with what it was at "snapshot()" time > */ > for_each_dev_depth_first() > restore_snapshot(dev); > /* And if you want to, just to be "symmetric" > > for_each_dev_depth_first() > unfreeze(dev) > > although I think you could just make "restore_snapshot()" > implicitly unfreeze it too.. Agreed. > */ > sti(); > /* We're up */ > > and notice how *different* this is from what happens for s2ram. There > really isn't anything in common here. Exactly because s2ram simply doesn't > _have_ any of the issues with atomic memory images. Agreed again. Moreover, in the s2ram case there are no problems with device drivers compiled as modules. Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 23:20 ` Linus Torvalds 2007-04-25 23:52 ` Pavel Machek @ 2007-04-26 0:24 ` Alan Cox 2007-04-26 1:10 ` Linus Torvalds 2007-04-26 7:08 ` Andy Grover 1 sibling, 2 replies; 713+ messages in thread From: Alan Cox @ 2007-04-26 0:24 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven > STR does not need to "ensure that you have a consistent snapshot". Linus I think someone's been spiking your guinness again... > Why? Becuase there is no _room_ for inconsistency. There's nothing to be > "inconsistent with", since any changes to memory (by things like DMA or > other setup that happens while the suspend process is going on) is by > _definition_ consistent with the resume image (becasue there is no > separate image). You bet there is. We need to know if data arrived or not, because there is no guarantee that the data retrieved if we inadvertently re-execute a command will be the same. The hardware state itself isn't the problem, its the combination of hardware state and internal state which need to match in some cases. > off DMA and try to make the hardware be wevy wevy quiet while it's hunting > wabbits, it's a lot easier to just do nothing at all on "freeze", and just > make sure that "thaw" will re-initialze the DMA tables entirely! All Who cares about DMA mapping tables, those are easy to deal with, not even that bad with an IOMMU to restore. More problematic is the users data because if we have a device where re-executing a command is not repeatable (eg O_DIRECT SCSI on a shared bus) then we risk being inconsistent in our S2RAM. If we re-run the command on resume having allowed it to prattle on while doing S2anything then we'll get the wrong answer. Now there are lots of devices we don't care about as they don't have state in the form that causes problems - network cards, TV capture etc, but there are cases where it matters that every operation is either finished or not started and there is no doubt about them getting done during the S2RAM/S2DISK S2DISK/S2RAM both need to synchronize the state of a device so it can build a valid snapshot. That bit is a shared requirement just like you said didn't exist. Doesn't even need to involve turning DMA off, just getting a consistent state. The rest can be quite different. Mind you some laptops think S2RAM is just a transition state on the way to disk, leave them in ACPI S2RAM and the firmware will magically turn it into a save to disk and back to ram if the battery runs low or you leave it idle too long. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:24 ` Alan Cox @ 2007-04-26 1:10 ` Linus Torvalds 2007-04-26 14:04 ` Mark Lord 2007-04-26 7:08 ` Andy Grover 1 sibling, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 1:10 UTC (permalink / raw) To: Alan Cox Cc: Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Alan Cox wrote: > > You bet there is. We need to know if data arrived or not, because there > is no guarantee that the data retrieved if we inadvertently re-execute a > command will be the same. The hardware state itself isn't the problem, > its the combination of hardware state and internal state which need to > match in some cases. ... which is why "suspend()" suspends the hardware. Is that so hard to understand? Once the hardware is suspended, it's not doing anything. But STR doesn't have any need for atomicity guarantees _between_devices_. That's a really *fundamental* difference. The reason s2ram is *so* different from snapshot-to-disk is exactly the fact that s2ram can (and does) work on one device at a time. In contrast, snapshot-to-disk needs to snapshot all the devices *together*, since it has a separate disk image. See? Two *totally* different cases. They have *nothing* in common. Not the call sequence, not the logic, not *anything*. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 1:10 ` Linus Torvalds @ 2007-04-26 14:04 ` Mark Lord 2007-04-26 16:10 ` Linus Torvalds 0 siblings, 1 reply; 713+ messages in thread From: Mark Lord @ 2007-04-26 14:04 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Linus Torvalds wrote: > > See? Two *totally* different cases. They have *nothing* in common. Not the > call sequence, not the logic, not *anything*. Except that both methods cannot rely upon hot-pluggable devices still being present on resume/restore. It is exceptionally common to unplug all USB/firewire cables, mouse, keyboard, docking cables etc.. after a machine is in S2R state. Cheers ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 14:04 ` Mark Lord @ 2007-04-26 16:10 ` Linus Torvalds 2007-04-26 21:00 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 16:10 UTC (permalink / raw) To: Mark Lord Cc: Alan Cox, Pavel Machek, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, 26 Apr 2007, Mark Lord wrote: > Linus Torvalds wrote: > > > > See? Two *totally* different cases. They have *nothing* in common. Not the > > call sequence, not the logic, not *anything*. > > Except that both methods cannot rely upon hot-pluggable devices > still being present on resume/restore. It is exceptionally common > to unplug all USB/firewire cables, mouse, keyboard, docking cables etc.. > after a machine is in S2R state. Right, and that has nothing to do with suspend/resume. You'd better be able to handle unexpected hotplugs _regardless_. For example, it's quite common that people just "remove" the pcmcia/cardbus card while the driver is active. And in fact, when that happens, it's also quite common that the hardware raises the irq for that (active) driver (in fact, it's more than common: since the "card removal" interrupt for the Cardbus controller is generally always the same as the "card interrupt" interrupt for the low-level card driver, you can pretty much *guarantee* that you get that interrupt). So the end result is that the interrupt handler and all normal IO routines for a hotpluggable piece of hardware baically _have_ to be able to gracefully handle the "oops, the hw simply isn't there any more" case! The resume code isn't any different at all. It should run perfectly normally, but for hotpluggable devices, it has to follow all the same rules: handle the "oops, the hw is gone" case gracefully. No different, and it's totally unrelated to suspend/resume: it's a *generic* issue. In fact, suspend/resume is better off than a lot of the other code is, simply because it's easier to test that case and know you hit that particular sequence! It's much harder to verify that the "send packet" case is safe, because how are you going to know to remove the card at the right point to trigger it? Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 16:10 ` Linus Torvalds @ 2007-04-26 21:00 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 21:00 UTC (permalink / raw) To: Linus Torvalds Cc: Mark Lord, Alan Cox, Kenneth Crudup, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven Hi! > > > See? Two *totally* different cases. They have *nothing* in common. Not the > > > call sequence, not the logic, not *anything*. > > > > Except that both methods cannot rely upon hot-pluggable devices > > still being present on resume/restore. It is exceptionally common > > to unplug all USB/firewire cables, mouse, keyboard, docking cables etc.. > > after a machine is in S2R state. > > Right, and that has nothing to do with suspend/resume. You'd better be > able to handle unexpected hotplugs _regardless_. Actually, with suspend/resume it is quite easy to cheat, and just "unplug" the hardware on suspend, then "plug it back" on resume. That works very well for devices like keyboards and mice (where you can't tell if you are talking to the same hw, anyway). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 0:24 ` Alan Cox 2007-04-26 1:10 ` Linus Torvalds @ 2007-04-26 7:08 ` Andy Grover 1 sibling, 0 replies; 713+ messages in thread From: Andy Grover @ 2007-04-26 7:08 UTC (permalink / raw) To: linux-kernel; +Cc: suspend2-devel Alan Cox wrote: > Mind you some laptops think S2RAM is just a transition state on the way > to disk, leave them in ACPI S2RAM and the firmware will magically turn it > into a save to disk and back to ram if the battery runs low or you leave > it idle too long. The OS does this (or at least it's supposed to). STR with battery low, it comes back on fully via a battery wake event, and then STD (aka snapshot/poweroff). The ACPI state machine always goes through S0, FWIW. OK, now back to the "should we have 2 function pointers or 4" debate... -- Andy ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 20:08 ` Linus Torvalds 2007-04-25 20:27 ` Pavel Machek @ 2007-04-26 0:41 ` Thomas Orgis 1 sibling, 0 replies; 713+ messages in thread From: Thomas Orgis @ 2007-04-26 0:41 UTC (permalink / raw) To: Linus Torvalds Cc: Kenneth Crudup, Nick Piggin, suspend2-devel, Mike Galbraith, linux-kernel, Con Kolivas, Andrew Morton, Pavel Machek, Thomas Gleixner, Ingo Molnar, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 4979 bytes --] Sort of my 2-many-cents story on why I need "snapshot/restore"... Am Wed, 25 Apr 2007 13:08:09 -0700 (PDT) schrieb Linus Torvalds <torvalds@linux-foundation.org>: > > > On Wed, 25 Apr 2007, Kenneth Crudup wrote: > > > > Any working suspend-to-disk method takes care of that for me. (I'm > > really not sure why Linus hates S2D so much, though. Back in the day > > there was a lot more BIOS support, but that's been years now.) > > The really sad part is that APM actually did this better.. This really triggers a nerve in me. My laptops (always used models from some years ago, even) didn't necessarily get easier with respect to power management (suspend) over time. My first laptop (Siemens Scenic Mobile 710, 200Mhz Pentium, maxed to 192MB RAM) worked just fine with APM, be it s2ram or s2disk. Everything handled by the BIOS. Admittedly, S2disk was quite slow as it stored all ram and didn't write to the disk as fast as possible, but it worked. S2ram was also a viable option because I was even able to easily swap batteries because the thing had two bays to put batteries in. The next one was a Toshiba Portege 7020 CT (366MHz Pentium2 with dynamic clock, 192MB), supporting both APM and ACPI. Installing Linux was not that easy, I think I remember that APM in kernel froze the box (early 2.6 kernel), while ACPI needed some headache to set up (compiling a fixed DSDT into the kernel, for example)... I needed experimental toshiba_acpi to get functions and the acpi_pm_timer to get something like continuous system clock (special cpu throttling has funny effects). Well, I got it together after some time. Used suspend2 for "snapshot/restore" and actually was able to use ACPI S3 with the glitch of having to unload/load psmouse driver ... until I realized that it only resumed in about 80% of cases (BIOS ....). So suspend2 was a badly needed "hack" around the hardware/BIOS to get some sane workflow. I remember dealing with swsusp / pmdisk before... but I really ended up with suspend2 as the thing that works (and I wouldn't have bothered finding this patch if the in-kernel stuff worked for me). Of course this was a long time ago and recently I have seen that in-kernel swsusp works ok, just this unresponsiveness after "restore" due to missing page cache... Now I have an IBM ThinkPad X31 (600-1.4GHz Pentium M, 512MB). ACPI. SpeedStep. The machine generally works fine, hardware config via ACPI seems to be fine. But doing S3/STR? Well... this machine has the odd idea that turning the system off but the screen backlight back on after a second is a good idea. Of course just now S3 worked fine... you cannot even depend on the malfunction -- could have something to do with changing bootup video from LCD to VGA output for some other reason recently. Hm. Perhaps it even may work (after tricking the BIOS!?). But I doubt I'll suddenly develop trust in that. I _had_ trust in APM STR and STD. I am quite confident in suspend2 being able to correctly resume (restore) after a successful suspend (snapshot/restore). And then, STR doesn't help me on the road when I need to exchange the battery (I'd need this special extra battery to put under the ThinkPad for that). Another thing is that the old Siemens has a nice auxilliary monochrome LCD that shows the charge status of the batteries in 5 levels, so you have some means to predict the time you have in STR. The Thinkpad has greed LED for "battery level OK" and red for "battery level low". Well, but the Linux kernel won't change that... Perhaps at some time ACPI implementations in BIOS get to something reliable (hm, should I get a PowerBook instead?) and can be a good partner for Linux which struggles for many years now to get into the post-APM era. Remember reading desktop PC test reports in the c't magazine in the last years, S3 usually did _not_ work; with Windows, even. Well, there must be a reason Microsoft chose to implement the "hibernate" (it _is_ in software, right?). The APM->ACPI transition made me use the software STD (snapshot/restore...;-) and I think I will stay with it for the forseeable future, and be it because I can do fancy things like image encryption. ACPI S3 / STR is a nice addition when it works, for the smaller pauses (changing a train at the station, leaving office for half an hour...), but I consider STD really to be the more important feature that enables me to _never_ close my applications unless I want to do a kernel update. I really must say that some sort of STD is a total must for a laptop for me. On the other hand I once had a Psion 5MX, which basically was on STR all the (non-working) time -- and enabled well over 20h of working time on two AAs. When laptops enter that range of battery life, I guess I could arrange with just doing STR and won't have to worry about changing batteries without AC connection;-) Alrighty then, Thomas. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-25 7:23 ` Pavel Machek ` (3 preceding siblings ...) 2007-04-25 19:43 ` Kenneth Crudup @ 2007-05-26 17:37 ` Martin Steigerwald 2007-05-26 20:35 ` Rafael J. Wysocki 4 siblings, 1 reply; 713+ messages in thread From: Martin Steigerwald @ 2007-05-26 17:37 UTC (permalink / raw) To: suspend2-devel Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 4902 bytes --] Am Mittwoch 25 April 2007 schrieb Pavel Machek: > Hi! > > > This is why there's a lot to be said for > > > > echo mem > /sys/power/state > > > > and being able to follow the path through _one_ object (the kernel) > > over trying to figure out the interaction between many different > > parts with different versions. > > The 'promise' is 'if you can get echo disk > /sys/power/state working, > uswsusp will work. too'. IOW it should be ok to debug the in-kernel > parts, only. Hello Nigel, Pavel, Rafael and everyone else who is involved, I would like to ask what come out of the suspend2 merge discussion. Nigel just told that suspend2 likely won't be merged anytime soon and thats its business as usual: --------------------------------------------------------------------- It's pretty much business as usual. Linus doesn't want another implementation merged, and he wants the three of us (Pavel, Rafael and myself) to agree on a way forward. He also believes that we're approaching things from the wrong direction at the moment. Funnily enough, this is the one area on which we do all agree. --------------------------------------------------------------------- http://lists.suspend2.net/lurker/message/20070510.021641.fe306add.en.html Has there been any further discussion and preferably agreement on the way to go forward? Although you Linus, as I read from different mails only use suspend to RAM, there are many users out there who use suspend to disk daily. I used in kernel software suspend initially and it worked quite nice with starting from 2.6.10 or 2.6.11 where suspend2 didn't work for me before 2.6.14 with the hibernate script. But from then on suspend2 worked better than in kernel software suspend for me and colleagues on: - ThinkPad T23 - ThinkPad T42 - Possibly some other ThinkPads - as well various Dell workstations we have at work It was faster and more reliable, yielding uptimes up to 40 days on my workstation recently (with 2.6.17.7 still). And even that uptime was only ended by booting a newly build kernel (2.6.21 with sws 2.2.9.13). For me in the role of a user actually this is a really satisfying solution! I tried userspace software suspend from time to time but then just was fed up with it, cause I could not get it to work within any sensible amount of time - even with some bog standard Debian kernel, I think it was some 2.6.18 one. Maybe I am dumb, but so be it, it should not be that complicated to get it to work. Recently I didn't even bother to try anymore. Well and I read in the suspend2 merge discussion that even in kernel suspend does not work reliably anymore. As long I cannot be convinced that the vanilla kernel contains a suspend to disk solution that works as good as suspend2 I will patch suspend2 into all of the desktop kernels I build. Thats quite bad IMHO for exactly the same reason than having drivers maintained out of the kernel. For the same reason I think swap prefetch should go in as soon as possible. It will never have the adoption and care taking of an in-kernel-tree solution. I am convinced that a working suspend to RAM just is not enough - well it wasn't working correctly last time I checked. But I even don't bother about suspend to RAM anymore. I can wait those additional seconds for suspend to disk and it allows me to drive my laptops without batteries most of the time and have workstations switched off completely so that they do not consume standby power. So please, pretty please consider working together on a reliable, fast, stable, easy to use and configurable in-kernel-tree snapshot solution! Actually I as a user I couldn't care less about the implementation details, but as someone who is interested in kernel technologies I like it to be a clean and well designed solution, too. ;-) Maybe when the Linux Foundation organizes a meeting for Nigel, Pavel, Rafael and other kernel developers interested in creating such a solution it will help. To me it seems such a concentrated meeting in a good atmosphere could be more effective than endless mailing list discussions not leading to a clear result. When its not easy for the involved people to work together maybe a casual bystander who understands enough of kernel details should moderate the meeting and help finding an agreement. It would just be such a pity to miss the chance to have a nicely working snapshot solution in the Linux kernel, that may even be interesting for virtualization (you could store a backup of a machine state permanently or even more of them - if not already available through other technologies like well suspend2 with filewriter for example). Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-05-26 17:37 ` Martin Steigerwald @ 2007-05-26 20:35 ` Rafael J. Wysocki 2007-05-26 22:23 ` Martin Steigerwald 0 siblings, 1 reply; 713+ messages in thread From: Rafael J. Wysocki @ 2007-05-26 20:35 UTC (permalink / raw) To: Martin Steigerwald Cc: suspend2-devel, Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar, Andrew Morton, Arjan van de Ven, Pekka J Enberg On Saturday, 26 May 2007 19:37, Martin Steigerwald wrote: > Am Mittwoch 25 April 2007 schrieb Pavel Machek: > > Hi! > > > > > This is why there's a lot to be said for > > > > > > echo mem > /sys/power/state > > > > > > and being able to follow the path through _one_ object (the kernel) > > > over trying to figure out the interaction between many different > > > parts with different versions. > > > > The 'promise' is 'if you can get echo disk > /sys/power/state working, > > uswsusp will work. too'. IOW it should be ok to debug the in-kernel > > parts, only. > > Hello Nigel, Pavel, Rafael and everyone else who is involved, > > I would like to ask what come out of the suspend2 merge discussion. Nigel > just told that suspend2 likely won't be merged anytime soon and thats its > business as usual: > > --------------------------------------------------------------------- > It's pretty much business as usual. Linus doesn't want another > implementation merged, and he wants the three of us (Pavel, Rafael and > myself) to agree on a way forward. He also believes that we're > approaching things from the wrong direction at the moment. Funnily > enough, this is the one area on which we do all agree. > --------------------------------------------------------------------- > http://lists.suspend2.net/lurker/message/20070510.021641.fe306add.en.html > > > Has there been any further discussion and preferably agreement on the way > to go forward? The outcome was, more-or-less, that we'll work on merging suspend2 or at least some parts of it. However, in the meantime there have been some discussions implying that we have some important problems with suspend/hibernation that suspend2 doesn't solve and that IMHO are more urgent than the merging of suspend2 right not. So, as far as I'm concerned, the plan is to fix the more urgent problems first and to work on merging suspend2 as far as there's time to do this. The problem is there are only a few people working on it and there's a lot to do, so I can only ask you to be patient. ;-) Greetings, Rafael ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-05-26 20:35 ` Rafael J. Wysocki @ 2007-05-26 22:23 ` Martin Steigerwald 0 siblings, 0 replies; 713+ messages in thread From: Martin Steigerwald @ 2007-05-26 22:23 UTC (permalink / raw) To: suspend2-devel Cc: Rafael J. Wysocki, Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, Ingo Molnar, Andrew Morton, Arjan van de Ven, Pekka J Enberg [-- Attachment #1: Type: text/plain, Size: 1277 bytes --] Am Samstag 26 Mai 2007 schrieb Rafael J. Wysocki: Hi Rafael! > The outcome was, more-or-less, that we'll work on merging suspend2 or > at least some parts of it. > > However, in the meantime there have been some discussions implying that > we have some important problems with suspend/hibernation that suspend2 > doesn't solve and that IMHO are more urgent than the merging of > suspend2 right not. > > So, as far as I'm concerned, the plan is to fix the more urgent > problems first and to work on merging suspend2 as far as there's time > to do this. > > The problem is there are only a few people working on it and there's a > lot to do, so I can only ask you to be patient. ;-) Thats fine with me - I understand that. I just thought that there has been no outcome at all. I will try to be patient as long as I do not dig into kernel hacking myself deeply enough to be able to help with that - did not do more than to put together two conflicting patches to compile my own kernels till now and forward port a patch for a sundance network card. I can help with testing once there is something testable tough. Regards, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-24 21:24 ` Pavel Machek 2007-04-24 23:41 ` Linus Torvalds @ 2007-04-26 10:17 ` Johannes Berg 2007-04-26 10:30 ` Pavel Machek 2007-04-26 11:35 ` Christoph Hellwig 1 sibling, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 10:17 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven [-- Attachment #1: Type: text/plain, Size: 290 bytes --] On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote: > I believe uswsusp user/kernel separation is clean enough. Kernel > provides "snapshot image" and "resume image". (Thanks go to Rafael for > very clean interface). The interface isn't even 64/32-bit compatible... johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:17 ` Johannes Berg @ 2007-04-26 10:30 ` Pavel Machek 2007-04-26 10:40 ` Pavel Machek 2007-04-26 11:04 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg 2007-04-26 11:35 ` Christoph Hellwig 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 10:30 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu 2007-04-26 12:17:12, Johannes Berg wrote: > On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote: > > > I believe uswsusp user/kernel separation is clean enough. Kernel > > provides "snapshot image" and "resume image". (Thanks go to Rafael for > > very clean interface). > > The interface isn't even 64/32-bit compatible... Which parts? read/write on /dev/snapshot looks ok. ioctl(SNAPSHOT_FREEZE, UNFREEZE, ATOMIC_RESTORE, FREE, FREE_SWAP_PAGE, SNAPSHOT_S2RAM, is okay, because it does not pass any data. ioctl(ATOMIC_SNAPSHOT, returns 0/1 through pointer. Should be ok. (Maybe we should do if (!error) error = put_user(in_suspend, (u32 __user *)arg); ...instead, to make it very explicit? ioctl(SET_IMAGE_SIZE, is okay, because it just uses arg directly. ioctl(PMOPS, is okay, because it just uses arg directly... and it is in range 0-3 or something. ioctl(AVAIL_SWAP, ...hmm, is this the one you are complaining about? It returns loff_t through a pointer. Maybe there's another interface that can return available swap, and we should use that, instead? ioctl(GET_SWAP_PAGE, returns sector_t through a pointer. NOt sure if that's good idea, either. ioctl(SET_SWAP_FILE, does old_decode_dev(arg). Is that ok? ioctl(SET_SWAP_AREA, shares struct resume_swap_area between user and kernel. I guess that's bad..? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:30 ` Pavel Machek @ 2007-04-26 10:40 ` Pavel Machek 2007-04-26 11:11 ` Johannes Berg 2007-04-26 13:45 ` Johannes Berg 2007-04-26 11:04 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg 1 sibling, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 10:40 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > The interface isn't even 64/32-bit compatible... > > Which parts? > > ioctl(AVAIL_SWAP, > ...hmm, is this the one you are complaining about? It returns > loff_t through a pointer. Maybe there's another interface > that can return available swap, and we should use that, > instead? loff_t is 64bit on i386, so I do not see immediate problem here, but maybe we should just explicitely pass u64? > ioctl(GET_SWAP_PAGE, > returns sector_t through a pointer. NOt sure if that's good > idea, either. Ok, that's very bad idea, because sector_t can be 32-bit or 64-bit, depending on CONFIG_LBD. We need to use u64 here. > ioctl(SET_SWAP_FILE, > does old_decode_dev(arg). Is that ok? > > ioctl(SET_SWAP_AREA, > shares struct resume_swap_area between user and kernel. I > guess that's bad..? struct resume_swap_area { loff_t offset; u_int32_t dev; } __attribute__((packed)); ...I guess we should change loff_t -> u64 and problem is solved? Old_decode_dev takes u16. That sucks for majors/minors > 256, but fortunately those are not common. Does this seem to help? Pavel diff --git a/kernel/power/power.h b/kernel/power/power.h index eb461b8..dc13af5 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -114,7 +114,7 @@ extern int snapshot_image_loaded(struct * SNAPSHOT_SET_SWAP_AREA ioctl */ struct resume_swap_area { - loff_t offset; + u_int64_t offset; u_int32_t dev; } __attribute__((packed)); diff --git a/kernel/power/user.c b/kernel/power/user.c index 558e18e..d0730c1 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -215,8 +215,7 @@ static int snapshot_ioctl(struct inode * { int error = 0; struct snapshot_data *data; - loff_t avail; - sector_t offset; + u64 avail, offset; if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC) return -ENOTTY; @@ -286,7 +285,7 @@ static int snapshot_ioctl(struct inode * case SNAPSHOT_AVAIL_SWAP: avail = count_swap_pages(data->swap, 1); avail <<= PAGE_SHIFT; - error = put_user(avail, (loff_t __user *)arg); + error = put_user(avail, (u64 __user *)arg); break; case SNAPSHOT_GET_SWAP_PAGE: @@ -304,7 +303,7 @@ static int snapshot_ioctl(struct inode * offset = alloc_swapdev_block(data->swap, data->bitmap); if (offset) { offset <<= PAGE_SHIFT; - error = put_user(offset, (sector_t __user *)arg); + error = put_user(offset, (u64 __user *)arg); } else { error = -ENOSPC; } -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:40 ` Pavel Machek @ 2007-04-26 11:11 ` Johannes Berg 2007-04-26 11:16 ` Pavel Machek 2007-04-26 13:45 ` Johannes Berg 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-26 11:11 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 335 bytes --] On Thu, 2007-04-26 at 12:40 +0200, Pavel Machek wrote: > Does this seem to help? No idea, I haven't actually tried it yet, last time I tried uswsusp on my 32/32 machine it didn't work due to endian problems that were supposed to be resolved but I haven't had a chance to pick all the bits together that you need. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:11 ` Johannes Berg @ 2007-04-26 11:16 ` Pavel Machek 2007-04-26 11:27 ` Johannes Berg 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:16 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > Does this seem to help? > > No idea, I haven't actually tried it yet, last time I tried uswsusp on > my 32/32 machine it didn't work due to endian problems that were > supposed to be resolved but I haven't had a chance to pick all the bits > together that you need. This one should prevent ioctl numbers changing, too. diff --git a/kernel/power/power.h b/kernel/power/power.h index eb461b8..a18b85a 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -114,23 +114,23 @@ extern int snapshot_image_loaded(struct * SNAPSHOT_SET_SWAP_AREA ioctl */ struct resume_swap_area { - loff_t offset; + u_int64_t offset; u_int32_t dev; } __attribute__((packed)); #define SNAPSHOT_IOC_MAGIC '3' #define SNAPSHOT_FREEZE _IO(SNAPSHOT_IOC_MAGIC, 1) #define SNAPSHOT_UNFREEZE _IO(SNAPSHOT_IOC_MAGIC, 2) -#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) +#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */ #define SNAPSHOT_ATOMIC_RESTORE _IO(SNAPSHOT_IOC_MAGIC, 4) #define SNAPSHOT_FREE _IO(SNAPSHOT_IOC_MAGIC, 5) -#define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) -#define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, void *) -#define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, void *) +#define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */ +#define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */ +#define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */ #define SNAPSHOT_FREE_SWAP_PAGES _IO(SNAPSHOT_IOC_MAGIC, 9) -#define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int) +#define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */ #define SNAPSHOT_S2RAM _IO(SNAPSHOT_IOC_MAGIC, 11) -#define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int) +#define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */ #define SNAPSHOT_SET_SWAP_AREA _IOW(SNAPSHOT_IOC_MAGIC, 13, \ struct resume_swap_area) #define SNAPSHOT_IOC_MAXNR 13 diff --git a/kernel/power/user.c b/kernel/power/user.c index 558e18e..d0730c1 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -215,8 +215,7 @@ static int snapshot_ioctl(struct inode * { int error = 0; struct snapshot_data *data; - loff_t avail; - sector_t offset; + u64 avail, offset; if (_IOC_TYPE(cmd) != SNAPSHOT_IOC_MAGIC) return -ENOTTY; @@ -286,7 +285,7 @@ static int snapshot_ioctl(struct inode * case SNAPSHOT_AVAIL_SWAP: avail = count_swap_pages(data->swap, 1); avail <<= PAGE_SHIFT; - error = put_user(avail, (loff_t __user *)arg); + error = put_user(avail, (u64 __user *)arg); break; case SNAPSHOT_GET_SWAP_PAGE: @@ -304,7 +303,7 @@ static int snapshot_ioctl(struct inode * offset = alloc_swapdev_block(data->swap, data->bitmap); if (offset) { offset <<= PAGE_SHIFT; - error = put_user(offset, (sector_t __user *)arg); + error = put_user(offset, (u64 __user *)arg); } else { error = -ENOSPC; } -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:16 ` Pavel Machek @ 2007-04-26 11:27 ` Johannes Berg 2007-04-26 11:26 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-26 11:27 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 369 bytes --] On Thu, 2007-04-26 at 13:16 +0200, Pavel Machek wrote: > This one should prevent ioctl numbers changing, too. > -#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) > +#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */ Afaict that'll actually change ioctl numbers breaking existing 64-bit userspace. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:27 ` Johannes Berg @ 2007-04-26 11:26 ` Pavel Machek 2007-04-26 11:35 ` Johannes Berg 2007-04-26 15:56 ` Linus Torvalds 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:26 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > This one should prevent ioctl numbers changing, too. > > > -#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) > > +#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */ > > Afaict that'll actually change ioctl numbers breaking existing 64-bit > userspace. Yes, probably will. The other option is to break existing 32-bit userspace, which is a bit more common AFAICT. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:26 ` Pavel Machek @ 2007-04-26 11:35 ` Johannes Berg 2007-04-26 11:33 ` Pavel Machek 2007-04-26 16:14 ` Chris Friesen 2007-04-26 15:56 ` Linus Torvalds 1 sibling, 2 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 11:35 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 480 bytes --] On Thu, 2007-04-26 at 13:26 +0200, Pavel Machek wrote: > Yes, probably will. The other option is to break existing 32-bit > userspace, which is a bit more common AFAICT. Judging from experience with the wext 32/64 bit fiasco it seems to be rather uncommon to use 32-bit userspace on 64-bit machines. Rafael hinted that we could just add these numbers, keep the existing ones and then phase them out over time, but I haven't really given it much thought. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:35 ` Johannes Berg @ 2007-04-26 11:33 ` Pavel Machek 2007-04-26 16:14 ` Chris Friesen 1 sibling, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:33 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > Yes, probably will. The other option is to break existing 32-bit > > userspace, which is a bit more common AFAICT. > > Judging from experience with the wext 32/64 bit fiasco it seems to be > rather uncommon to use 32-bit userspace on 64-bit machines. Well, it would break 32-bit userspace on 32-bit kernel, which is the most common version, AFAICT. > Rafael hinted that we could just add these numbers, keep the existing > ones and then phase them out over time, but I haven't really given it > much thought. We could probably do that... but it is slightly ugly. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:35 ` Johannes Berg 2007-04-26 11:33 ` Pavel Machek @ 2007-04-26 16:14 ` Chris Friesen 2007-04-26 16:27 ` Linus Torvalds 2007-04-26 17:11 ` Johannes Berg 1 sibling, 2 replies; 713+ messages in thread From: Chris Friesen @ 2007-04-26 16:14 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Johannes Berg wrote: > Judging from experience with the wext 32/64 bit fiasco it seems to be > rather uncommon to use 32-bit userspace on 64-bit machines. I disagree...it's quite common. I think its the standard way of doing things for ppc64, for instance. Chris ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 16:14 ` Chris Friesen @ 2007-04-26 16:27 ` Linus Torvalds 2007-04-26 17:11 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 16:27 UTC (permalink / raw) To: Chris Friesen Cc: Johannes Berg, Pavel Machek, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu, 26 Apr 2007, Chris Friesen wrote: > > I disagree...it's quite common. I think its the standard way of doing things > for ppc64, for instance. It is, although most x86-64 installations seem to be 64-bit user space *if*you*install*from*scatch*. Of course, at least some users (yeah, I've done it) started with a 32-bit CD they had lying around, and upgraded just the kernel. And I'm sure some distro out there just defaults to 32-bit binaries just because (in practice, you have to use a 32-bit firefox anyway if you want flash etc, so you need all the 32-bit libraries, so the argument might go that you might as well use 32-bit stuff for all the common stuff, and only 64-bit binaries when actually needed). Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 16:14 ` Chris Friesen 2007-04-26 16:27 ` Linus Torvalds @ 2007-04-26 17:11 ` Johannes Berg 1 sibling, 0 replies; 713+ messages in thread From: Johannes Berg @ 2007-04-26 17:11 UTC (permalink / raw) To: Chris Friesen Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 581 bytes --] On Thu, 2007-04-26 at 10:14 -0600, Chris Friesen wrote: > Johannes Berg wrote: > > > Judging from experience with the wext 32/64 bit fiasco it seems to be > > rather uncommon to use 32-bit userspace on 64-bit machines. > > I disagree...it's quite common. I think its the standard way of doing > things for ppc64, for instance. I know. My only 64-bit machine is ppc64 :) But still nobody noticed the wext 32/64 bit compat bug for like forever. On the other hand, maybe that just means that most 64-bit machines are desktop machines without wireless. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:26 ` Pavel Machek 2007-04-26 11:35 ` Johannes Berg @ 2007-04-26 15:56 ` Linus Torvalds 2007-04-26 21:06 ` Theodore Tso 1 sibling, 1 reply; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 15:56 UTC (permalink / raw) To: Pavel Machek Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu, 26 Apr 2007, Pavel Machek wrote: > > Yes, probably will. The other option is to break existing 32-bit > userspace, which is a bit more common AFAICT. And *this* is why kernel/userspace things simply should not be done. It's simply better to do things entirely in the kernel. Because you add bugs and complications otherwise, and doing it in the kernel allows you to just switch things around. As it is, it appears that user-space suspend is just broken whichever way we turn. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 15:56 ` Linus Torvalds @ 2007-04-26 21:06 ` Theodore Tso 2007-04-26 21:12 ` Nigel Cunningham 0 siblings, 1 reply; 713+ messages in thread From: Theodore Tso @ 2007-04-26 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: Pavel Machek, Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu, Apr 26, 2007 at 08:56:48AM -0700, Linus Torvalds wrote: > > > On Thu, 26 Apr 2007, Pavel Machek wrote: > > > > Yes, probably will. The other option is to break existing 32-bit > > userspace, which is a bit more common AFAICT. > > And *this* is why kernel/userspace things simply should not be done. > > It's simply better to do things entirely in the kernel. Because you add > bugs and complications otherwise, and doing it in the kernel allows you > to just switch things around. > > As it is, it appears that user-space suspend is just broken whichever way > we turn. Well, in that case maybe suspend2 should be very seriously considered, since it has the features of uswsusp --- basic features which every single Microsoft and MacOSX user are used to like, like progress bars --- and it's all done in the kernel. - Ted ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 21:06 ` Theodore Tso @ 2007-04-26 21:12 ` Nigel Cunningham 0 siblings, 0 replies; 713+ messages in thread From: Nigel Cunningham @ 2007-04-26 21:12 UTC (permalink / raw) To: Theodore Tso Cc: Linus Torvalds, Pavel Machek, Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 1372 bytes --] Hi. On Thu, 2007-04-26 at 17:06 -0400, Theodore Tso wrote: > On Thu, Apr 26, 2007 at 08:56:48AM -0700, Linus Torvalds wrote: > > > > > > On Thu, 26 Apr 2007, Pavel Machek wrote: > > > > > > Yes, probably will. The other option is to break existing 32-bit > > > userspace, which is a bit more common AFAICT. > > > > And *this* is why kernel/userspace things simply should not be done. > > > > It's simply better to do things entirely in the kernel. Because you add > > bugs and complications otherwise, and doing it in the kernel allows you > > to just switch things around. > > > > As it is, it appears that user-space suspend is just broken whichever way > > we turn. > > Well, in that case maybe suspend2 should be very seriously considered, > since it has the features of uswsusp --- basic features which every > single Microsoft and MacOSX user are used to like, like progress bars > --- and it's all done in the kernel. Umm. I don't want to be picky, but that's not quite true. The progress bar is done in userspace. There's also the possibility of using a userspace app to manage storage too (I did work on establishing/tearing down an NBD connection as necessary but didn't quite get it finished and have never released it). That said, this bit can be torn out by simply removing a file and the Makefile line. Nigel [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:40 ` Pavel Machek 2007-04-26 11:11 ` Johannes Berg @ 2007-04-26 13:45 ` Johannes Berg 2007-06-29 22:44 ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-26 13:45 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 324 bytes --] By the way. > diff --git a/kernel/power/power.h b/kernel/power/power.h > index eb461b8..dc13af5 100644 > --- a/kernel/power/power.h > +++ b/kernel/power/power.h ^^^^^^^^^^^^^^^^^^^^ Don't these definitions need to be exported to userspace? That definitely is not a header file for userspace. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) 2007-04-26 13:45 ` Johannes Berg @ 2007-06-29 22:44 ` Pavel Machek 2007-06-30 0:06 ` Adrian Bunk 0 siblings, 1 reply; 713+ messages in thread From: Pavel Machek @ 2007-06-29 22:44 UTC (permalink / raw) To: Johannes Berg, Andrew Morton Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > By the way. > > > diff --git a/kernel/power/power.h b/kernel/power/power.h > > index eb461b8..dc13af5 100644 > > --- a/kernel/power/power.h > > +++ b/kernel/power/power.h > ^^^^^^^^^^^^^^^^^^^^ > > Don't these definitions need to be exported to userspace? That > definitely is not a header file for userspace. Yes, they do. Does this look like a fix? Pavel --- Split userinterface part of power.h into separate file. Signed-off-by: Pavel Machek <pavel@suse.cz> diff --git a/include/linux/power.h b/include/linux/power.h new file mode 100644 index 0000000..37bf890 --- /dev/null +++ b/include/linux/power.h @@ -0,0 +1,31 @@ +#ifndef INCLUDE_LINUX_POWER_H +#define INCLUDE_LINUX_POWER_H + +/* + * This structure is used to pass the values needed for the identification + * of the resume swap area from a user space to the kernel via the + * SNAPSHOT_SET_SWAP_AREA ioctl + */ +struct resume_swap_area { + u_int64_t offset; + u_int32_t dev; +} __attribute__((packed)); + +#define SNAPSHOT_IOC_MAGIC '3' +#define SNAPSHOT_FREEZE _IO(SNAPSHOT_IOC_MAGIC, 1) +#define SNAPSHOT_UNFREEZE _IO(SNAPSHOT_IOC_MAGIC, 2) +#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */ +#define SNAPSHOT_ATOMIC_RESTORE _IO(SNAPSHOT_IOC_MAGIC, 4) +#define SNAPSHOT_FREE _IO(SNAPSHOT_IOC_MAGIC, 5) +#define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */ +#define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */ +#define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */ +#define SNAPSHOT_FREE_SWAP_PAGES _IO(SNAPSHOT_IOC_MAGIC, 9) +#define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */ +#define SNAPSHOT_S2RAM _IO(SNAPSHOT_IOC_MAGIC, 11) +#define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */ +#define SNAPSHOT_SET_SWAP_AREA _IOW(SNAPSHOT_IOC_MAGIC, 13, \ + struct resume_swap_area) +#define SNAPSHOT_IOC_MAXNR 13 + +#endif diff --git a/kernel/power/power.h b/kernel/power/power.h index 41d33eb..e68352b 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -1,5 +1,9 @@ +#ifndef KERNEL_POWER_POWER_H +#define KERNEL_POWER_POWER_H + #include <linux/suspend.h> #include <linux/utsname.h> +#include <linux/power.h> struct swsusp_info { struct new_utsname uts; @@ -114,33 +118,6 @@ extern int snapshot_write_next(struct sn extern void snapshot_write_finalize(struct snapshot_handle *handle); extern int snapshot_image_loaded(struct snapshot_handle *handle); -/* - * This structure is used to pass the values needed for the identification - * of the resume swap area from a user space to the kernel via the - * SNAPSHOT_SET_SWAP_AREA ioctl - */ -struct resume_swap_area { - u_int64_t offset; - u_int32_t dev; -} __attribute__((packed)); - -#define SNAPSHOT_IOC_MAGIC '3' -#define SNAPSHOT_FREEZE _IO(SNAPSHOT_IOC_MAGIC, 1) -#define SNAPSHOT_UNFREEZE _IO(SNAPSHOT_IOC_MAGIC, 2) -#define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, u32) /* void * */ -#define SNAPSHOT_ATOMIC_RESTORE _IO(SNAPSHOT_IOC_MAGIC, 4) -#define SNAPSHOT_FREE _IO(SNAPSHOT_IOC_MAGIC, 5) -#define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, u32) /* unsigned long */ -#define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, u32) /* void * */ -#define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, u32) /* void * */ -#define SNAPSHOT_FREE_SWAP_PAGES _IO(SNAPSHOT_IOC_MAGIC, 9) -#define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, u32) /* unsigned int */ -#define SNAPSHOT_S2RAM _IO(SNAPSHOT_IOC_MAGIC, 11) -#define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, u32) /* unsigned int */ -#define SNAPSHOT_SET_SWAP_AREA _IOW(SNAPSHOT_IOC_MAGIC, 13, \ - struct resume_swap_area) -#define SNAPSHOT_IOC_MAXNR 13 - #define PMOPS_PREPARE 1 #define PMOPS_ENTER 2 #define PMOPS_FINISH 3 @@ -165,3 +142,5 @@ extern int suspend_enter(suspend_state_t struct timeval; extern void swsusp_show_speed(struct timeval *, struct timeval *, unsigned int, char *); + +#endif -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 713+ messages in thread
* Re: [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) 2007-06-29 22:44 ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek @ 2007-06-30 0:06 ` Adrian Bunk 0 siblings, 0 replies; 713+ messages in thread From: Adrian Bunk @ 2007-06-30 0:06 UTC (permalink / raw) To: Pavel Machek Cc: Johannes Berg, Andrew Morton, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Sat, Jun 30, 2007 at 12:44:22AM +0200, Pavel Machek wrote: > Hi! > > > By the way. > > > > > diff --git a/kernel/power/power.h b/kernel/power/power.h > > > index eb461b8..dc13af5 100644 > > > --- a/kernel/power/power.h > > > +++ b/kernel/power/power.h > > ^^^^^^^^^^^^^^^^^^^^ > > > > Don't these definitions need to be exported to userspace? That > > definitely is not a header file for userspace. > > Yes, they do. Does this look like a fix? > Pavel > > --- > > Split userinterface part of power.h into separate file. >... You should also add it to include/linux/Kbuild. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:30 ` Pavel Machek 2007-04-26 10:40 ` Pavel Machek @ 2007-04-26 11:04 ` Johannes Berg 2007-04-26 11:09 ` Pavel Machek 1 sibling, 1 reply; 713+ messages in thread From: Johannes Berg @ 2007-04-26 11:04 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 535 bytes --] On Thu, 2007-04-26 at 12:30 +0200, Pavel Machek wrote: > On Thu 2007-04-26 12:17:12, Johannes Berg wrote: > > On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote: > > > > > I believe uswsusp user/kernel separation is clean enough. Kernel > > > provides "snapshot image" and "resume image". (Thanks go to Rafael for > > > very clean interface). > > > > The interface isn't even 64/32-bit compatible... > > Which parts? ioctl numbers last time I talked about it with Rafael. No effort was made to fix it. johannes [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 190 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:04 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg @ 2007-04-26 11:09 ` Pavel Machek 2007-04-26 15:53 ` Linus Torvalds 2007-04-26 18:21 ` Olivier Galibert 0 siblings, 2 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 11:09 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > > > I believe uswsusp user/kernel separation is clean enough. Kernel > > > > provides "snapshot image" and "resume image". (Thanks go to Rafael for > > > > very clean interface). > > > > > > The interface isn't even 64/32-bit compatible... > > > > Which parts? > > ioctl numbers last time I talked about it with Rafael. No effort was > made to fix it. #define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) #define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) #define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, void *) #define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, void *) #define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int) #define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int) Are these a problem? Do we need to just use u32 as a argument to keep ioctl numbers same between 32 and 64bit versions? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:09 ` Pavel Machek @ 2007-04-26 15:53 ` Linus Torvalds 2007-04-26 18:21 ` Olivier Galibert 1 sibling, 0 replies; 713+ messages in thread From: Linus Torvalds @ 2007-04-26 15:53 UTC (permalink / raw) To: Pavel Machek Cc: Johannes Berg, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu, 26 Apr 2007, Pavel Machek wrote: > > #define SNAPSHOT_ATOMIC_SNAPSHOT _IOW(SNAPSHOT_IOC_MAGIC, 3, void *) > #define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) > #define SNAPSHOT_AVAIL_SWAP _IOR(SNAPSHOT_IOC_MAGIC, 7, void *) > #define SNAPSHOT_GET_SWAP_PAGE _IOR(SNAPSHOT_IOC_MAGIC, 8, void *) > #define SNAPSHOT_SET_SWAP_FILE _IOW(SNAPSHOT_IOC_MAGIC, 10, unsigned int) > #define SNAPSHOT_PMOPS _IOW(SNAPSHOT_IOC_MAGIC, 12, unsigned int) > > Are these a problem? Do we need to just use u32 as a argument to keep > ioctl numbers same between 32 and 64bit versions? No, you need to use the *proper* type as an argument, and assuming that type has the same representation in both 32-bit and 64-bit world, the numbers will automatically match. Using "void *" is totally bogus. It's supposed to be the actual argument you pass in, not the pointer to it. If your argument doesn't have a "struct xyz" kind of format, then you could use "int" (or u32 or something: but realistically int is 32-bit for the forseeable future), but it's always wrong to pass in "void *" or "unsigned long", since either of those are just a sign of the interface being either (a) misunderstood or (b) broken. Linus ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:09 ` Pavel Machek 2007-04-26 15:53 ` Linus Torvalds @ 2007-04-26 18:21 ` Olivier Galibert 2007-04-26 21:30 ` Pavel Machek 1 sibling, 1 reply; 713+ messages in thread From: Olivier Galibert @ 2007-04-26 18:21 UTC (permalink / raw) To: Pavel Machek Cc: Johannes Berg, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki On Thu, Apr 26, 2007 at 01:09:53PM +0200, Pavel Machek wrote: > #define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) So I'm not supposed to be able to suspend the 16Gb-ram, 32bits servers I have here? OG. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 18:21 ` Olivier Galibert @ 2007-04-26 21:30 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 21:30 UTC (permalink / raw) To: Olivier Galibert, Johannes Berg, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven, Rafael J. Wysocki Hi! > > #define SNAPSHOT_SET_IMAGE_SIZE _IOW(SNAPSHOT_IOC_MAGIC, 6, unsigned long) > > So I'm not supposed to be able to suspend the 16Gb-ram, 32bits servers > I have here? (You are right, this should have been u64) Snapshot image is by design limited by ammount of lowmem. If you want to change that, this unsigned long limit will be least of your problems. (And no, I'd not expect loaded 6GB box to suspend properly. It will just realize it does not have enough lowmem and refuse to suspend). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 10:17 ` Johannes Berg 2007-04-26 10:30 ` Pavel Machek @ 2007-04-26 11:35 ` Christoph Hellwig 2007-04-26 12:15 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Christoph Hellwig @ 2007-04-26 11:35 UTC (permalink / raw) To: Johannes Berg Cc: Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Ingo Molnar, Andrew Morton, Arjan van de Ven On Thu, Apr 26, 2007 at 12:17:12PM +0200, Johannes Berg wrote: > On Tue, 2007-04-24 at 23:24 +0200, Pavel Machek wrote: > > > I believe uswsusp user/kernel separation is clean enough. Kernel > > provides "snapshot image" and "resume image". (Thanks go to Rafael for > > very clean interface). > > The interface isn't even 64/32-bit compatible... It's not . And it's one of the worst interface I've seen lately. Did anyone actually review this crap before it went in? I completely agree with Linus that these kind of boundaries that lead to horribly complex ioctl interface are totally wrong. Now suspend2 wasn't exactly nice either when I last reviewed it, but we should probably give it another attempt if we can sort out a proper incremental merge. ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 11:35 ` Christoph Hellwig @ 2007-04-26 12:15 ` Ingo Molnar 2007-04-26 12:41 ` Pavel Machek 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-26 12:15 UTC (permalink / raw) To: Christoph Hellwig, Johannes Berg, Pavel Machek, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Andrew Morton, Arjan van de Ven * Christoph Hellwig <hch@infradead.org> wrote: > > The interface isn't even 64/32-bit compatible... > > It's not . And it's one of the worst interface I've seen lately. Did > anyone actually review this crap before it went in? I completely > agree with Linus that these kind of boundaries that lead to horribly > complex ioctl interface are totally wrong. it's a bit hard to see the point of it anyway: the resume binary (much of the focus of the ioctls) fundamentally lives as an 'initrd binary' - and most of the stuff that wants to execute in an initrd is fundamentally tied to the kernel anyway. Perhaps we should allow "in-kernel userspace" that would be allowed to grow ad-hoc interfaces and linking that would only be compatible with the kernel they are embedded into: e.g. the klibc stuff in linux/usr/* could link to the kernel (via whatever method) and just be in essence another type of kernel code - but happening to execute in user-space, having access to the normal user-space facilities and being able to link to (GPL) user-space libraries. Perhaps this would bridge the "i want to tinker in user-space because it's technically easier/cleaner there" and "fine but that needs formalized ABIs for your connection to kernel-space" gap. > Now suspend2 wasn't exactly nice either when I last reviewed it, but > we should probably give it another attempt if we can sort out a proper > incremental merge. yeah, it still has quite a bit of work left, but it looked fundamentally split-uppable. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) 2007-04-26 12:15 ` Ingo Molnar @ 2007-04-26 12:41 ` Pavel Machek 0 siblings, 0 replies; 713+ messages in thread From: Pavel Machek @ 2007-04-26 12:41 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, Johannes Berg, Linus Torvalds, Nick Piggin, Mike Galbraith, linux-kernel, Thomas Gleixner, Con Kolivas, suspend2-devel, Andrew Morton, Arjan van de Ven Hi! > > > The interface isn't even 64/32-bit compatible... > > > > It's not . And it's one of the worst interface I've seen lately. Did > > anyone actually review this crap before it went in? I completely > > agree with Linus that these kind of boundaries that lead to horribly > > complex ioctl interface are totally wrong. > > it's a bit hard to see the point of it anyway: the resume binary (much > of the focus of the ioctls) fundamentally lives as an 'initrd binary' - > and most of the stuff that wants to execute in an initrd is > fundamentally tied to the kernel anyway. Typically... yes, it needs to be in initrd. And yes, klibc would help here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 21:57 ` Christian Hesse 2007-04-18 22:02 ` Ingo Molnar @ 2007-04-18 22:16 ` Ingo Molnar 2007-04-18 23:12 ` Christian Hesse 2007-04-19 6:41 ` Ingo Molnar 1 sibling, 2 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-18 22:16 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Christian Hesse <mail@earthworm.de> wrote: > Linux 2.6.21-rc7 > Suspend2 2.2.9.11 (applies cleanly to -rc7) > CFS v3 (without any additional patches) > > And it still hangs on suspend. i just tried the same and it suspended+resumed just fine: Restarting tasks ... done. Suspend2 debugging info: - Suspend core : 2.2.9.12 - Kernel Version : 2.6.21-rc7-CFS-v3 - Compiler vers. : 4.0 - Attempt number : 2 - Parameters : 0 81920 0 0 0 0 - Overall expected compression percentage: 0. - Compressor is 'lzf'. Compressed 31133696 bytes into 14880587 (52 percent compression). - SwapAllocator active. Swap available for image: 512036 pages. - FileAllocator inactive. - I/O speed: Write 76 MB/s, Read 42 MB/s. - Extra pages : 18 used/500. could you send me your .config? Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:16 ` CFS and suspend2: hang in atomic copy Ingo Molnar @ 2007-04-18 23:12 ` Christian Hesse 2007-04-19 6:28 ` Ingo Molnar 2007-04-19 6:41 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Christian Hesse @ 2007-04-18 23:12 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel [-- Attachment #1.1: Type: text/plain, Size: 1040 bytes --] On Thursday 19 April 2007, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > Linux 2.6.21-rc7 > > Suspend2 2.2.9.11 (applies cleanly to -rc7) > > CFS v3 (without any additional patches) > > > > And it still hangs on suspend. > > i just tried the same and it suspended+resumed just fine: > > Restarting tasks ... done. > Suspend2 debugging info: > - Suspend core : 2.2.9.12 > - Kernel Version : 2.6.21-rc7-CFS-v3 > - Compiler vers. : 4.0 > - Attempt number : 2 > - Parameters : 0 81920 0 0 0 0 > - Overall expected compression percentage: 0. > - Compressor is 'lzf'. > Compressed 31133696 bytes into 14880587 (52 percent compression). > - SwapAllocator active. > Swap available for image: 512036 pages. > - FileAllocator inactive. > - I/O speed: Write 76 MB/s, Read 42 MB/s. > - Extra pages : 18 used/500. > > could you send me your .config? My config is attached. I now got some error message from my system: http://www.eworm.de/tmp/cfs-suspend.jpg -- Regards, Chris [-- Attachment #1.2: config-2.6.21-rc7-r1 --] [-- Type: text/plain, Size: 49289 bytes --] # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc7-r1 # Wed Apr 18 22:25:20 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_IKPATCHES=y CONFIG_IKPATCHES_PROC=y # CONFIG_CPUSETS is not set # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set # CONFIG_BLK_DEV_INITRD is not set # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y # CONFIG_IOSCHED_AS is not set # CONFIG_IOSCHED_DEADLINE is not set CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # # CONFIG_TICK_ONESHOT is not set # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set CONFIG_MPENTIUMM=y # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_X86_GENERIC is not set CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y # CONFIG_HPET_TIMER is not set CONFIG_NR_CPUS=2 # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y # CONFIG_X86_MCE is not set CONFIG_VM86=y # CONFIG_TOSHIBA is not set # CONFIG_I8K is not set # CONFIG_X86_REBOOTFIXUPS is not set CONFIG_MICROCODE=y CONFIG_MICROCODE_OLD_INTERFACE=y # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set # # Firmware Drivers # # CONFIG_EDD is not set # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set CONFIG_NOHIGHMEM=y # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set # CONFIG_VMSPLIT_3G is not set CONFIG_VMSPLIT_3G_OPT=y # CONFIG_VMSPLIT_2G is not set # CONFIG_VMSPLIT_1G is not set CONFIG_PAGE_OFFSET=0xB0000000 CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPARSEMEM_STATIC=y CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_RESOURCES_64BIT is not set CONFIG_ZONE_DMA_FLAG=1 # CONFIG_MATH_EMULATION is not set CONFIG_MTRR=y # CONFIG_EFI is not set CONFIG_IRQBALANCE=y # CONFIG_SECCOMP is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 # CONFIG_KEXEC is not set CONFIG_PHYSICAL_START=0x100000 # CONFIG_RELOCATABLE is not set CONFIG_PHYSICAL_ALIGN=0x100000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set # # Power management options (ACPI, APM) # CONFIG_PM=y # CONFIG_PM_LEGACY is not set # CONFIG_PM_DEBUG is not set # CONFIG_PRINTK_NOSAVE is not set # CONFIG_PM_SYSFS_DEPRECATED is not set # CONFIG_SOFTWARE_SUSPEND is not set CONFIG_SUSPEND_SMP=y CONFIG_SUSPEND2_CORE=y # # Image Storage (you need at least one allocator) # # CONFIG_SUSPEND2_FILE is not set CONFIG_SUSPEND2_SWAP=y # # General Options # CONFIG_SUSPEND2_CRYPTO=y CONFIG_SUSPEND2_DEFAULT_RESUME2="/dev/hda2" # CONFIG_SUSPEND2_KEEP_IMAGE is not set CONFIG_SUSPEND2_REPLACE_SWSUSP=y # CONFIG_SUSPEND2_CHECKSUM is not set CONFIG_SUSPEND_SHARED=y CONFIG_SUSPEND2=y # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set # CONFIG_ACPI_PROCFS is not set CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_FAN=y # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y CONFIG_ACPI_THERMAL=y # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y # CONFIG_ACPI_SBS is not set # # APM (Advanced Power Management) BIOS Support # # CONFIG_APM is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y # CONFIG_CPU_FREQ_DEBUG is not set # CONFIG_CPU_FREQ_STAT is not set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y # CONFIG_CPU_FREQ_GOV_USERSPACE is not set CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y # # CPUFreq processor drivers # CONFIG_X86_ACPI_CPUFREQ=y # CONFIG_X86_POWERNOW_K6 is not set # CONFIG_X86_POWERNOW_K7 is not set # CONFIG_X86_POWERNOW_K8 is not set # CONFIG_X86_GX_SUSPMOD is not set # CONFIG_X86_SPEEDSTEP_CENTRINO is not set # CONFIG_X86_SPEEDSTEP_ICH is not set # CONFIG_X86_SPEEDSTEP_SMI is not set # CONFIG_X86_P4_CLOCKMOD is not set # CONFIG_X86_CPUFREQ_NFORCE2 is not set # CONFIG_X86_LONGRUN is not set # CONFIG_X86_LONGHAUL is not set # CONFIG_X86_E_POWERSAVER is not set # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set # # Bus options (PCI, PCMCIA, EISA, MCA, ISA) # CONFIG_PCI=y # CONFIG_PCI_GOBIOS is not set # CONFIG_PCI_GOMMCONFIG is not set # CONFIG_PCI_GODIRECT is not set CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCIEPORTBUS=y CONFIG_PCIEAER=y # CONFIG_PCI_MSI is not set CONFIG_HT_IRQ=y CONFIG_ISA_DMA_API=y # CONFIG_ISA is not set # CONFIG_MCA is not set # CONFIG_SCx200 is not set # # PCCARD (PCMCIA/CardBus) support # CONFIG_PCCARD=y # CONFIG_PCMCIA_DEBUG is not set CONFIG_PCMCIA=y # CONFIG_PCMCIA_LOAD_CIS is not set # CONFIG_PCMCIA_IOCTL is not set CONFIG_CARDBUS=y # # PC-card bridges # CONFIG_YENTA=y CONFIG_YENTA_O2=y CONFIG_YENTA_RICOH=y CONFIG_YENTA_TI=y CONFIG_YENTA_ENE_TUNE=y CONFIG_YENTA_TOSHIBA=y # CONFIG_PD6729 is not set # CONFIG_I82092 is not set CONFIG_PCCARD_NONSTATIC=y # # PCI Hotplug Support # # CONFIG_HOTPLUG_PCI is not set # # Executable file formats # CONFIG_BINFMT_ELF=y # CONFIG_BINFMT_AOUT is not set # CONFIG_BINFMT_MISC is not set # # Networking # CONFIG_NET=y # # Networking options # # CONFIG_NETDEBUG is not set CONFIG_PACKET=y # CONFIG_PACKET_MMAP is not set CONFIG_UNIX=y # CONFIG_NET_KEY is not set CONFIG_INET=y # CONFIG_IP_MULTICAST is not set # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_ARPD is not set # CONFIG_SYN_COOKIES is not set # CONFIG_INET_AH is not set # CONFIG_INET_ESP is not set # CONFIG_INET_IPCOMP is not set # CONFIG_INET_XFRM_TUNNEL is not set # CONFIG_INET_TUNNEL is not set # CONFIG_INET_XFRM_MODE_TRANSPORT is not set # CONFIG_INET_XFRM_MODE_TUNNEL is not set # CONFIG_INET_XFRM_MODE_BEET is not set CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set # # IP: Virtual Server Configuration # # CONFIG_IP_VS is not set CONFIG_IPV6=y # CONFIG_IPV6_PRIVACY is not set # CONFIG_IPV6_ROUTER_PREF is not set # CONFIG_INET6_AH is not set # CONFIG_INET6_ESP is not set # CONFIG_INET6_IPCOMP is not set # CONFIG_IPV6_MIP6 is not set # CONFIG_INET6_XFRM_TUNNEL is not set # CONFIG_INET6_TUNNEL is not set # CONFIG_INET6_XFRM_MODE_TRANSPORT is not set # CONFIG_INET6_XFRM_MODE_TUNNEL is not set # CONFIG_INET6_XFRM_MODE_BEET is not set # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set # CONFIG_IPV6_SIT is not set # CONFIG_IPV6_TUNNEL is not set # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_NETWORK_SECMARK is not set CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set # # Core Netfilter Configuration # # CONFIG_NETFILTER_NETLINK is not set CONFIG_NF_CONNTRACK_ENABLED=y CONFIG_NF_CONNTRACK_SUPPORT=y # CONFIG_IP_NF_CONNTRACK_SUPPORT is not set CONFIG_NF_CONNTRACK=y # CONFIG_NF_CT_ACCT is not set # CONFIG_NF_CONNTRACK_MARK is not set # CONFIG_NF_CONNTRACK_EVENTS is not set # CONFIG_NF_CT_PROTO_SCTP is not set # CONFIG_NF_CONNTRACK_AMANDA is not set # CONFIG_NF_CONNTRACK_FTP is not set # CONFIG_NF_CONNTRACK_H323 is not set # CONFIG_NF_CONNTRACK_IRC is not set # CONFIG_NF_CONNTRACK_NETBIOS_NS is not set # CONFIG_NF_CONNTRACK_PPTP is not set # CONFIG_NF_CONNTRACK_SANE is not set # CONFIG_NF_CONNTRACK_SIP is not set # CONFIG_NF_CONNTRACK_TFTP is not set CONFIG_NETFILTER_XTABLES=y # CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set # CONFIG_NETFILTER_XT_TARGET_MARK is not set # CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set # CONFIG_NETFILTER_XT_TARGET_NFLOG is not set # CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set # CONFIG_NETFILTER_XT_MATCH_COMMENT is not set # CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set # CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set # CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set # CONFIG_NETFILTER_XT_MATCH_DCCP is not set # CONFIG_NETFILTER_XT_MATCH_DSCP is not set # CONFIG_NETFILTER_XT_MATCH_ESP is not set # CONFIG_NETFILTER_XT_MATCH_HELPER is not set # CONFIG_NETFILTER_XT_MATCH_LENGTH is not set CONFIG_NETFILTER_XT_MATCH_LIMIT=y # CONFIG_NETFILTER_XT_MATCH_MAC is not set # CONFIG_NETFILTER_XT_MATCH_MARK is not set # CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y # CONFIG_NETFILTER_XT_MATCH_QUOTA is not set # CONFIG_NETFILTER_XT_MATCH_REALM is not set # CONFIG_NETFILTER_XT_MATCH_SCTP is not set CONFIG_NETFILTER_XT_MATCH_STATE=y # CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set # CONFIG_NETFILTER_XT_MATCH_STRING is not set # CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set # CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set # # IP: Netfilter Configuration # CONFIG_NF_CONNTRACK_IPV4=y # CONFIG_NF_CONNTRACK_PROC_COMPAT is not set # CONFIG_IP_NF_QUEUE is not set CONFIG_IP_NF_IPTABLES=y # CONFIG_IP_NF_MATCH_IPRANGE is not set # CONFIG_IP_NF_MATCH_TOS is not set CONFIG_IP_NF_MATCH_RECENT=y # CONFIG_IP_NF_MATCH_ECN is not set # CONFIG_IP_NF_MATCH_AH is not set # CONFIG_IP_NF_MATCH_TTL is not set # CONFIG_IP_NF_MATCH_OWNER is not set # CONFIG_IP_NF_MATCH_ADDRTYPE is not set CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_TARGET_REJECT=y CONFIG_IP_NF_TARGET_LOG=y # CONFIG_IP_NF_TARGET_ULOG is not set CONFIG_NF_NAT=y CONFIG_NF_NAT_NEEDED=y CONFIG_IP_NF_TARGET_MASQUERADE=y # CONFIG_IP_NF_TARGET_REDIRECT is not set # CONFIG_IP_NF_TARGET_NETMAP is not set # CONFIG_IP_NF_TARGET_SAME is not set # CONFIG_NF_NAT_SNMP_BASIC is not set # CONFIG_NF_NAT_FTP is not set # CONFIG_NF_NAT_IRC is not set # CONFIG_NF_NAT_TFTP is not set # CONFIG_NF_NAT_AMANDA is not set # CONFIG_NF_NAT_PPTP is not set # CONFIG_NF_NAT_H323 is not set # CONFIG_NF_NAT_SIP is not set # CONFIG_IP_NF_MANGLE is not set # CONFIG_IP_NF_RAW is not set # CONFIG_IP_NF_ARPTABLES is not set # # IPv6: Netfilter Configuration (EXPERIMENTAL) # CONFIG_NF_CONNTRACK_IPV6=y # CONFIG_IP6_NF_QUEUE is not set CONFIG_IP6_NF_IPTABLES=y # CONFIG_IP6_NF_MATCH_RT is not set # CONFIG_IP6_NF_MATCH_OPTS is not set # CONFIG_IP6_NF_MATCH_FRAG is not set # CONFIG_IP6_NF_MATCH_HL is not set # CONFIG_IP6_NF_MATCH_OWNER is not set # CONFIG_IP6_NF_MATCH_IPV6HEADER is not set # CONFIG_IP6_NF_MATCH_AH is not set # CONFIG_IP6_NF_MATCH_MH is not set # CONFIG_IP6_NF_MATCH_EUI64 is not set CONFIG_IP6_NF_FILTER=y CONFIG_IP6_NF_TARGET_LOG=y CONFIG_IP6_NF_TARGET_REJECT=y # CONFIG_IP6_NF_MANGLE is not set # CONFIG_IP6_NF_RAW is not set # # DCCP Configuration (EXPERIMENTAL) # # CONFIG_IP_DCCP is not set # # SCTP Configuration (EXPERIMENTAL) # # CONFIG_IP_SCTP is not set # # TIPC Configuration (EXPERIMENTAL) # # CONFIG_TIPC is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set # CONFIG_LLC2 is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # # QoS and/or fair queueing # # CONFIG_NET_SCHED is not set # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_HAMRADIO is not set # CONFIG_IRDA is not set CONFIG_BT=y CONFIG_BT_L2CAP=y # CONFIG_BT_SCO is not set CONFIG_BT_RFCOMM=y CONFIG_BT_RFCOMM_TTY=y # CONFIG_BT_BNEP is not set CONFIG_BT_HIDP=y # # Bluetooth device drivers # CONFIG_BT_HCIUSB=y # CONFIG_BT_HCIUSB_SCO is not set # CONFIG_BT_HCIUART is not set # CONFIG_BT_HCIBCM203X is not set # CONFIG_BT_HCIBPA10X is not set # CONFIG_BT_HCIBFUSB is not set # CONFIG_BT_HCIDTL1 is not set # CONFIG_BT_HCIBT3C is not set # CONFIG_BT_HCIBLUECARD is not set # CONFIG_BT_HCIBTUART is not set # CONFIG_BT_HCIVHCI is not set CONFIG_CFG80211=y CONFIG_CFG80211_WEXT_COMPAT=y CONFIG_NL80211=y CONFIG_WIRELESS_EXT=y # CONFIG_MAC80211 is not set CONFIG_IEEE80211=y # CONFIG_IEEE80211_DEBUG is not set CONFIG_IEEE80211_CRYPT_WEP=y # CONFIG_IEEE80211_CRYPT_CCMP is not set # CONFIG_IEEE80211_CRYPT_TKIP is not set # CONFIG_IEEE80211_SOFTMAC is not set # CONFIG_IEEE80211_RADIOTAP is not set # # Device Drivers # # # Generic Driver Options # # CONFIG_STANDALONE is not set CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y # CONFIG_SYS_HYPERVISOR is not set # # Connector - unified userspace <-> kernelspace linker # # CONFIG_CONNECTOR is not set # # Memory Technology Devices (MTD) # # CONFIG_MTD is not set # # Parallel port support # # CONFIG_PARPORT is not set # # Plug and Play support # CONFIG_PNP=y # CONFIG_PNP_DEBUG is not set # # Protocols # CONFIG_PNPACPI=y # # Block devices # # CONFIG_BLK_DEV_FD is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=y # CONFIG_BLK_DEV_CRYPTOLOOP is not set # CONFIG_BLK_DEV_NBD is not set # CONFIG_BLK_DEV_SX8 is not set # CONFIG_BLK_DEV_UB is not set # CONFIG_BLK_DEV_RAM is not set # CONFIG_CDROM_PKTCDVD is not set # CONFIG_ATA_OVER_ETH is not set # # Misc devices # # CONFIG_IBM_ASM is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set # CONFIG_SONY_LAPTOP is not set # # ATA/ATAPI/MFM/RLL support # CONFIG_IDE=y CONFIG_BLK_DEV_IDE=y # # Please see Documentation/ide.txt for help/info on IDE drives # # CONFIG_BLK_DEV_IDE_SATA is not set # CONFIG_BLK_DEV_HD_IDE is not set CONFIG_BLK_DEV_IDEDISK=y CONFIG_IDEDISK_MULTI_MODE=y CONFIG_BLK_DEV_IDECS=y # CONFIG_BLK_DEV_DELKIN is not set CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set # CONFIG_BLK_DEV_IDEFLOPPY is not set # CONFIG_BLK_DEV_IDESCSI is not set # CONFIG_BLK_DEV_IDEACPI is not set # CONFIG_IDE_TASK_IOCTL is not set # # IDE chipset support/bugfixes # # CONFIG_IDE_GENERIC is not set # CONFIG_BLK_DEV_CMD640 is not set # CONFIG_BLK_DEV_IDEPNP is not set CONFIG_BLK_DEV_IDEPCI=y # CONFIG_IDEPCI_SHARE_IRQ is not set # CONFIG_BLK_DEV_OFFBOARD is not set # CONFIG_BLK_DEV_GENERIC is not set # CONFIG_BLK_DEV_OPTI621 is not set # CONFIG_BLK_DEV_RZ1000 is not set CONFIG_BLK_DEV_IDEDMA_PCI=y # CONFIG_BLK_DEV_IDEDMA_FORCED is not set # CONFIG_IDEDMA_ONLYDISK is not set # CONFIG_BLK_DEV_AEC62XX is not set # CONFIG_BLK_DEV_ALI15X3 is not set # CONFIG_BLK_DEV_AMD74XX is not set # CONFIG_BLK_DEV_ATIIXP is not set # CONFIG_BLK_DEV_CMD64X is not set # CONFIG_BLK_DEV_TRIFLEX is not set # CONFIG_BLK_DEV_CY82C693 is not set # CONFIG_BLK_DEV_CS5520 is not set # CONFIG_BLK_DEV_CS5530 is not set # CONFIG_BLK_DEV_CS5535 is not set # CONFIG_BLK_DEV_HPT34X is not set # CONFIG_BLK_DEV_HPT366 is not set # CONFIG_BLK_DEV_JMICRON is not set # CONFIG_BLK_DEV_SC1200 is not set CONFIG_BLK_DEV_PIIX=y # CONFIG_BLK_DEV_IT8213 is not set # CONFIG_BLK_DEV_IT821X is not set # CONFIG_BLK_DEV_NS87415 is not set # CONFIG_BLK_DEV_PDC202XX_OLD is not set # CONFIG_BLK_DEV_PDC202XX_NEW is not set # CONFIG_BLK_DEV_SVWKS is not set # CONFIG_BLK_DEV_SIIMAGE is not set # CONFIG_BLK_DEV_SIS5513 is not set # CONFIG_BLK_DEV_SLC90E66 is not set # CONFIG_BLK_DEV_TRM290 is not set # CONFIG_BLK_DEV_VIA82CXXX is not set # CONFIG_BLK_DEV_TC86C001 is not set # CONFIG_IDE_ARM is not set CONFIG_BLK_DEV_IDEDMA=y # CONFIG_IDEDMA_IVB is not set # CONFIG_BLK_DEV_HD is not set # # SCSI device support # # CONFIG_RAID_ATTRS is not set CONFIG_SCSI=y # CONFIG_SCSI_TGT is not set # CONFIG_SCSI_NETLINK is not set # CONFIG_SCSI_PROC_FS is not set # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y # CONFIG_CHR_DEV_ST is not set # CONFIG_CHR_DEV_OSST is not set # CONFIG_BLK_DEV_SR is not set # CONFIG_CHR_DEV_SG is not set # CONFIG_CHR_DEV_SCH is not set # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_MULTI_LUN=y # CONFIG_SCSI_CONSTANTS is not set # CONFIG_SCSI_LOGGING is not set # CONFIG_SCSI_SCAN_ASYNC is not set # # SCSI Transports # # CONFIG_SCSI_SPI_ATTRS is not set # CONFIG_SCSI_FC_ATTRS is not set # CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # # SCSI low-level drivers # # CONFIG_ISCSI_TCP is not set # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_ACARD is not set # CONFIG_SCSI_AACRAID is not set # CONFIG_SCSI_AIC7XXX is not set # CONFIG_SCSI_AIC7XXX_OLD is not set # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_AIC94XX is not set # CONFIG_SCSI_DPT_I2O is not set # CONFIG_SCSI_ADVANSYS is not set # CONFIG_SCSI_ARCMSR is not set # CONFIG_MEGARAID_NEWGEN is not set # CONFIG_MEGARAID_LEGACY is not set # CONFIG_MEGARAID_SAS is not set # CONFIG_SCSI_HPTIOP is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set # CONFIG_SCSI_GDTH is not set # CONFIG_SCSI_IPS is not set # CONFIG_SCSI_INITIO is not set # CONFIG_SCSI_INIA100 is not set # CONFIG_SCSI_STEX is not set # CONFIG_SCSI_SYM53C8XX_2 is not set # CONFIG_SCSI_QLOGIC_1280 is not set # CONFIG_SCSI_QLA_FC is not set # CONFIG_SCSI_QLA_ISCSI is not set # CONFIG_SCSI_LPFC is not set # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_NSP32 is not set # CONFIG_SCSI_DEBUG is not set # CONFIG_SCSI_SRP is not set # # PCMCIA SCSI adapter support # # CONFIG_PCMCIA_AHA152X is not set # CONFIG_PCMCIA_FDOMAIN is not set # CONFIG_PCMCIA_NINJA_SCSI is not set # CONFIG_PCMCIA_QLOGIC is not set # CONFIG_PCMCIA_SYM53C500 is not set # # Serial ATA (prod) and Parallel ATA (experimental) drivers # # CONFIG_ATA is not set # # Multi-device support (RAID and LVM) # CONFIG_MD=y # CONFIG_BLK_DEV_MD is not set CONFIG_BLK_DEV_DM=y # CONFIG_DM_DEBUG is not set CONFIG_DM_CRYPT=y # CONFIG_DM_SNAPSHOT is not set # CONFIG_DM_MIRROR is not set # CONFIG_DM_ZERO is not set # CONFIG_DM_MULTIPATH is not set # # Fusion MPT device support # # CONFIG_FUSION is not set # CONFIG_FUSION_SPI is not set # CONFIG_FUSION_FC is not set # CONFIG_FUSION_SAS is not set # # IEEE 1394 (FireWire) support # CONFIG_IEEE1394=y # # Subsystem Options # # CONFIG_IEEE1394_VERBOSEDEBUG is not set CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y CONFIG_IEEE1394_CONFIG_ROM_IP1394=y # # Device Drivers # # CONFIG_IEEE1394_PCILYNX is not set CONFIG_IEEE1394_OHCI1394=y # # Protocol Drivers # # CONFIG_IEEE1394_VIDEO1394 is not set CONFIG_IEEE1394_SBP2=y CONFIG_IEEE1394_SBP2_PHYS_DMA=y CONFIG_IEEE1394_ETH1394=y # CONFIG_IEEE1394_DV1394 is not set CONFIG_IEEE1394_RAWIO=y # # I2O device support # # CONFIG_I2O is not set # # Macintosh device drivers # # CONFIG_MAC_EMUMOUSEBTN is not set # # Network device support # CONFIG_NETDEVICES=y CONFIG_DUMMY=y # CONFIG_BONDING is not set # CONFIG_EQUALIZER is not set CONFIG_TUN=y # CONFIG_NET_SB1000 is not set # # ARCnet devices # # CONFIG_ARCNET is not set # # PHY device support # # # Ethernet (10 or 100Mbit) # # CONFIG_NET_ETHERNET is not set # # Ethernet (1000 Mbit) # # CONFIG_ACENIC is not set # CONFIG_DL2K is not set # CONFIG_E1000 is not set # CONFIG_NS83820 is not set # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set # CONFIG_R8169 is not set # CONFIG_SIS190 is not set # CONFIG_SKGE is not set # CONFIG_SKY2 is not set # CONFIG_SK98LIN is not set CONFIG_TIGON3=y # CONFIG_BNX2 is not set # CONFIG_QLA3XXX is not set # CONFIG_ATL1 is not set # # Ethernet (10000 Mbit) # # CONFIG_CHELSIO_T1 is not set # CONFIG_CHELSIO_T3 is not set # CONFIG_IXGB is not set # CONFIG_S2IO is not set # CONFIG_MYRI10GE is not set # CONFIG_NETXEN_NIC is not set # # Token Ring devices # # CONFIG_TR is not set # # Wireless LAN (non-hamradio) # CONFIG_NET_RADIO=y # CONFIG_NET_WIRELESS_RTNETLINK is not set # # Obsolete Wireless cards support (pre-802.11) # # CONFIG_STRIP is not set # CONFIG_PCMCIA_WAVELAN is not set # CONFIG_PCMCIA_NETWAVE is not set # # Wireless 802.11 Frequency Hopping cards support # # CONFIG_PCMCIA_RAYCS is not set # # Wireless 802.11b ISA/PCI cards support # # CONFIG_IPW2100 is not set # CONFIG_IPW2200 is not set # CONFIG_AIRO is not set # CONFIG_HERMES is not set # CONFIG_ATMEL is not set # # Wireless 802.11b Pcmcia/Cardbus cards support # # CONFIG_AIRO_CS is not set # CONFIG_PCMCIA_WL3501 is not set # # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support # # CONFIG_PRISM54 is not set # CONFIG_USB_ZD1201 is not set CONFIG_HOSTAP=y # CONFIG_HOSTAP_FIRMWARE is not set # CONFIG_HOSTAP_PLX is not set # CONFIG_HOSTAP_PCI is not set CONFIG_HOSTAP_CS=y CONFIG_NET_WIRELESS=y CONFIG_IPW3945=m # CONFIG_IPW3945_DEBUG is not set # CONFIG_IPW3945_MONITOR is not set # CONFIG_IPW3945_PROMISCUOUS is not set # # PCMCIA network device support # # CONFIG_NET_PCMCIA is not set # # Wan interfaces # # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set CONFIG_PPP=y # CONFIG_PPP_MULTILINK is not set # CONFIG_PPP_FILTER is not set CONFIG_PPP_ASYNC=y # CONFIG_PPP_SYNC_TTY is not set CONFIG_PPP_DEFLATE=y # CONFIG_PPP_BSDCOMP is not set # CONFIG_PPP_MPPE is not set # CONFIG_PPPOE is not set # CONFIG_SLIP is not set CONFIG_SLHC=y # CONFIG_NET_FC is not set # CONFIG_SHAPER is not set CONFIG_NETCONSOLE=m CONFIG_NETPOLL=y # CONFIG_NETPOLL_RX is not set # CONFIG_NETPOLL_TRAP is not set CONFIG_NET_POLL_CONTROLLER=y # # ISDN subsystem # # CONFIG_ISDN is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y # CONFIG_INPUT_MOUSEDEV_PSAUX is not set CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_TSDEV is not set CONFIG_INPUT_EVDEV=y # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set # CONFIG_KEYBOARD_STOWAWAY is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_VSXXXAA is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TOUCHSCREEN is not set CONFIG_INPUT_MISC=y CONFIG_INPUT_PCSPKR=y # CONFIG_INPUT_WISTRON_BTNS is not set # CONFIG_INPUT_ATLAS_BTNS is not set # CONFIG_INPUT_UINPUT is not set # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y # CONFIG_SERIO_SERPORT is not set # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # CONFIG_GAMEPORT is not set # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_VT_HW_CONSOLE_BINDING is not set # CONFIG_SERIAL_NONSTANDARD is not set # # Serial drivers # CONFIG_SERIAL_8250=y # CONFIG_SERIAL_8250_CONSOLE is not set CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y # CONFIG_SERIAL_8250_CS is not set CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set # # Non-8250 serial port support # CONFIG_SERIAL_CORE=y # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y CONFIG_LEGACY_PTYS=y CONFIG_LEGACY_PTY_COUNT=256 # # IPMI # # CONFIG_IPMI_HANDLER is not set # # Watchdog Cards # # CONFIG_WATCHDOG is not set # CONFIG_HW_RANDOM is not set # CONFIG_NVRAM is not set CONFIG_RTC=y # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # CONFIG_SONYPI is not set CONFIG_AGP=y # CONFIG_AGP_ALI is not set # CONFIG_AGP_ATI is not set # CONFIG_AGP_AMD is not set # CONFIG_AGP_AMD64 is not set CONFIG_AGP_INTEL=y # CONFIG_AGP_NVIDIA is not set # CONFIG_AGP_SIS is not set # CONFIG_AGP_SWORKS is not set # CONFIG_AGP_VIA is not set # CONFIG_AGP_EFFICEON is not set CONFIG_DRM=y # CONFIG_DRM_TDFX is not set # CONFIG_DRM_R128 is not set # CONFIG_DRM_RADEON is not set # CONFIG_DRM_I810 is not set # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=y # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set # CONFIG_DRM_SAVAGE is not set # # PCMCIA character devices # # CONFIG_SYNCLINK_CS is not set # CONFIG_CARDMAN_4000 is not set # CONFIG_CARDMAN_4040 is not set # CONFIG_MWAVE is not set # CONFIG_PC8736x_GPIO is not set # CONFIG_NSC_GPIO is not set # CONFIG_CS5535_GPIO is not set # CONFIG_RAW_DRIVER is not set # CONFIG_HPET is not set # CONFIG_HANGCHECK_TIMER is not set # # TPM devices # # CONFIG_TCG_TPM is not set # CONFIG_TELCLOCK is not set # # I2C support # CONFIG_I2C=y # CONFIG_I2C_CHARDEV is not set # # I2C Algorithms # # CONFIG_I2C_ALGOBIT is not set # CONFIG_I2C_ALGOPCF is not set # CONFIG_I2C_ALGOPCA is not set # # I2C Hardware Bus support # # CONFIG_I2C_ALI1535 is not set # CONFIG_I2C_ALI1563 is not set # CONFIG_I2C_ALI15X3 is not set # CONFIG_I2C_AMD756 is not set # CONFIG_I2C_AMD8111 is not set # CONFIG_I2C_I801 is not set # CONFIG_I2C_I810 is not set # CONFIG_I2C_PIIX4 is not set # CONFIG_I2C_NFORCE2 is not set # CONFIG_I2C_OCORES is not set # CONFIG_I2C_PARPORT_LIGHT is not set # CONFIG_I2C_PASEMI is not set # CONFIG_I2C_PROSAVAGE is not set # CONFIG_I2C_SAVAGE4 is not set # CONFIG_SCx200_ACB is not set # CONFIG_I2C_SIS5595 is not set # CONFIG_I2C_SIS630 is not set # CONFIG_I2C_SIS96X is not set # CONFIG_I2C_STUB is not set # CONFIG_I2C_VIA is not set # CONFIG_I2C_VIAPRO is not set # CONFIG_I2C_VOODOO3 is not set # CONFIG_I2C_PCA_ISA is not set # # Miscellaneous I2C Chip support # # CONFIG_SENSORS_DS1337 is not set # CONFIG_SENSORS_DS1374 is not set # CONFIG_SENSORS_EEPROM is not set # CONFIG_SENSORS_PCF8574 is not set # CONFIG_SENSORS_PCA9539 is not set # CONFIG_SENSORS_PCF8591 is not set # CONFIG_SENSORS_MAX6875 is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_I2C_DEBUG_CHIP is not set # # SPI support # # CONFIG_SPI is not set # CONFIG_SPI_MASTER is not set # # Dallas's 1-wire bus # # CONFIG_W1 is not set # # Hardware Monitoring support # # CONFIG_HWMON is not set # CONFIG_HWMON_VID is not set # # Multifunction device drivers # # CONFIG_MFD_SM501 is not set # # Multimedia devices # CONFIG_VIDEO_DEV=y # CONFIG_VIDEO_V4L1 is not set # CONFIG_VIDEO_V4L1_COMPAT is not set CONFIG_VIDEO_V4L2=y # # Video Capture Adapters # # # Video Capture Adapters # # CONFIG_VIDEO_ADV_DEBUG is not set # CONFIG_VIDEO_HELPER_CHIPS_AUTO is not set # # Encoders/decoders and other helper chips # # # Audio decoders # # CONFIG_VIDEO_TDA9840 is not set # CONFIG_VIDEO_TEA6415C is not set # CONFIG_VIDEO_TEA6420 is not set # CONFIG_VIDEO_MSP3400 is not set # CONFIG_VIDEO_CS53L32A is not set # CONFIG_VIDEO_TLV320AIC23B is not set # CONFIG_VIDEO_WM8775 is not set # CONFIG_VIDEO_WM8739 is not set # # Video decoders # # CONFIG_VIDEO_OV7670 is not set # CONFIG_VIDEO_SAA711X is not set # CONFIG_VIDEO_TVP5150 is not set # # Video and audio decoders # # CONFIG_VIDEO_CX25840 is not set # # MPEG video encoders # # CONFIG_VIDEO_CX2341X is not set # # Video encoders # # CONFIG_VIDEO_SAA7127 is not set # # Video improvement chips # # CONFIG_VIDEO_UPD64031A is not set # CONFIG_VIDEO_UPD64083 is not set # CONFIG_VIDEO_VIVI is not set # CONFIG_VIDEO_SAA5246A is not set # CONFIG_VIDEO_SAA5249 is not set # CONFIG_VIDEO_SAA7134 is not set # CONFIG_VIDEO_HEXIUM_ORION is not set # CONFIG_VIDEO_HEXIUM_GEMINI is not set # CONFIG_VIDEO_CX88 is not set # CONFIG_VIDEO_CAFE_CCIC is not set # # V4L USB devices # # CONFIG_VIDEO_PVRUSB2 is not set # CONFIG_VIDEO_USBVISION is not set # # Radio Adapters # # CONFIG_RADIO_GEMTEK_PCI is not set # CONFIG_RADIO_MAXIRADIO is not set # CONFIG_RADIO_MAESTRO is not set # CONFIG_USB_DSBR is not set # # Digital Video Broadcasting Devices # CONFIG_DVB=y CONFIG_DVB_CORE=y # CONFIG_DVB_CORE_ATTACH is not set # # Supported SAA7146 based PCI Adapters # # # Supported USB Adapters # # CONFIG_DVB_USB is not set # CONFIG_DVB_TTUSB_BUDGET is not set # CONFIG_DVB_TTUSB_DEC is not set CONFIG_DVB_CINERGYT2=y # CONFIG_DVB_CINERGYT2_TUNING is not set # # Supported FlexCopII (B2C2) Adapters # # CONFIG_DVB_B2C2_FLEXCOP is not set # # Supported BT878 Adapters # # # Supported Pluto2 Adapters # # CONFIG_DVB_PLUTO2 is not set # # Supported DVB Frontends # # # Customise DVB Frontends # # CONFIG_DVB_FE_CUSTOMISE is not set # # DVB-S (satellite) frontends # # CONFIG_DVB_STV0299 is not set # CONFIG_DVB_CX24110 is not set # CONFIG_DVB_CX24123 is not set # CONFIG_DVB_TDA8083 is not set # CONFIG_DVB_MT312 is not set # CONFIG_DVB_VES1X93 is not set # CONFIG_DVB_S5H1420 is not set # CONFIG_DVB_TDA10086 is not set # # DVB-T (terrestrial) frontends # # CONFIG_DVB_SP8870 is not set # CONFIG_DVB_SP887X is not set # CONFIG_DVB_CX22700 is not set # CONFIG_DVB_CX22702 is not set # CONFIG_DVB_L64781 is not set # CONFIG_DVB_TDA1004X is not set # CONFIG_DVB_NXT6000 is not set # CONFIG_DVB_MT352 is not set # CONFIG_DVB_ZL10353 is not set # CONFIG_DVB_DIB3000MB is not set # CONFIG_DVB_DIB3000MC is not set # CONFIG_DVB_DIB7000M is not set # CONFIG_DVB_DIB7000P is not set # # DVB-C (cable) frontends # # CONFIG_DVB_VES1820 is not set # CONFIG_DVB_TDA10021 is not set # CONFIG_DVB_STV0297 is not set # # ATSC (North American/Korean Terrestrial/Cable DTV) frontends # # CONFIG_DVB_NXT200X is not set # CONFIG_DVB_OR51211 is not set # CONFIG_DVB_OR51132 is not set # CONFIG_DVB_BCM3510 is not set # CONFIG_DVB_LGDT330X is not set # # Tuners/PLL support # # CONFIG_DVB_TDA826X is not set # CONFIG_DVB_TUNER_QT1010 is not set # CONFIG_DVB_TUNER_MT2060 is not set # CONFIG_DVB_TUNER_LGH06XF is not set # # Miscellaneous devices # # CONFIG_DVB_LNBP21 is not set # CONFIG_DVB_ISL6421 is not set # CONFIG_DVB_TUA6100 is not set # CONFIG_USB_DABUSB is not set # # Graphics support # # CONFIG_BACKLIGHT_LCD_SUPPORT is not set CONFIG_FB=y # CONFIG_FIRMWARE_EDID is not set # CONFIG_FB_DDC is not set CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y # CONFIG_FB_SVGALIB is not set # CONFIG_FB_MACMODES is not set # CONFIG_FB_BACKLIGHT is not set CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set # # Frambuffer hardware drivers # # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_VGA16 is not set CONFIG_FB_VESA=y # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set # CONFIG_FB_NVIDIA is not set # CONFIG_FB_RIVA is not set # CONFIG_FB_I810 is not set # CONFIG_FB_INTEL is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON is not set # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_S3 is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_CYBLA is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y # CONFIG_VGACON_SOFT_SCROLLBACK is not set CONFIG_VIDEO_SELECT=y CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set # CONFIG_FONTS is not set CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y # # Logo configuration # CONFIG_LOGO=y # CONFIG_LOGO_LINUX_MONO is not set # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_LOGO_LINUX_CLUT224=y # CONFIG_FB_SPLASH is not set # # Sound # CONFIG_SOUND=y # # Advanced Linux Sound Architecture # CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y CONFIG_SND_HWDEP=y CONFIG_SND_RAWMIDI=y # CONFIG_SND_SEQUENCER is not set CONFIG_SND_OSSEMUL=y # CONFIG_SND_MIXER_OSS is not set CONFIG_SND_PCM_OSS=y # CONFIG_SND_PCM_OSS_PLUGINS is not set CONFIG_SND_RTCTIMER=y # CONFIG_SND_DYNAMIC_MINORS is not set # CONFIG_SND_SUPPORT_OLD_API is not set # CONFIG_SND_VERBOSE_PROCFS is not set # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set # # Generic devices # # CONFIG_SND_DUMMY is not set # CONFIG_SND_MTPAV is not set # CONFIG_SND_SERIAL_U16550 is not set # CONFIG_SND_MPU401 is not set # # PCI devices # # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set # CONFIG_SND_ALS4000 is not set # CONFIG_SND_ALI5451 is not set # CONFIG_SND_ATIIXP is not set # CONFIG_SND_ATIIXP_MODEM is not set # CONFIG_SND_AU8810 is not set # CONFIG_SND_AU8820 is not set # CONFIG_SND_AU8830 is not set # CONFIG_SND_AZT3328 is not set # CONFIG_SND_BT87X is not set # CONFIG_SND_CA0106 is not set # CONFIG_SND_CMIPCI is not set # CONFIG_SND_CS4281 is not set # CONFIG_SND_CS46XX is not set # CONFIG_SND_CS5535AUDIO is not set # CONFIG_SND_DARLA20 is not set # CONFIG_SND_GINA20 is not set # CONFIG_SND_LAYLA20 is not set # CONFIG_SND_DARLA24 is not set # CONFIG_SND_GINA24 is not set # CONFIG_SND_LAYLA24 is not set # CONFIG_SND_MONA is not set # CONFIG_SND_MIA is not set # CONFIG_SND_ECHO3G is not set # CONFIG_SND_INDIGO is not set # CONFIG_SND_INDIGOIO is not set # CONFIG_SND_INDIGODJ is not set # CONFIG_SND_EMU10K1 is not set # CONFIG_SND_EMU10K1X is not set # CONFIG_SND_ENS1370 is not set # CONFIG_SND_ENS1371 is not set # CONFIG_SND_ES1938 is not set # CONFIG_SND_ES1968 is not set # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y # CONFIG_SND_HDSP is not set # CONFIG_SND_HDSPM is not set # CONFIG_SND_ICE1712 is not set # CONFIG_SND_ICE1724 is not set # CONFIG_SND_INTEL8X0 is not set # CONFIG_SND_INTEL8X0M is not set # CONFIG_SND_KORG1212 is not set # CONFIG_SND_MAESTRO3 is not set # CONFIG_SND_MIXART is not set # CONFIG_SND_NM256 is not set # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set # CONFIG_SND_RME32 is not set # CONFIG_SND_RME96 is not set # CONFIG_SND_RME9652 is not set # CONFIG_SND_SONICVIBES is not set # CONFIG_SND_TRIDENT is not set # CONFIG_SND_VIA82XX is not set # CONFIG_SND_VIA82XX_MODEM is not set # CONFIG_SND_VX222 is not set # CONFIG_SND_YMFPCI is not set # # USB devices # CONFIG_SND_USB_AUDIO=y # CONFIG_SND_USB_USX2Y is not set # # PCMCIA devices # # CONFIG_SND_VXPOCKET is not set # CONFIG_SND_PDAUDIOCF is not set # # SoC audio support # # CONFIG_SND_SOC is not set # # Open Sound System # # CONFIG_SOUND_PRIME is not set # # HID Devices # CONFIG_HID=y # CONFIG_HID_DEBUG is not set # # USB support # CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_SUSPEND is not set # CONFIG_USB_OTG is not set # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=y # CONFIG_USB_EHCI_SPLIT_ISO is not set # CONFIG_USB_EHCI_ROOT_HUB_TT is not set # CONFIG_USB_EHCI_TT_NEWSCHED is not set # CONFIG_USB_EHCI_BIG_ENDIAN_MMIO is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_OHCI_HCD is not set CONFIG_USB_UHCI_HCD=y # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # # CONFIG_USB_ACM is not set # CONFIG_USB_PRINTER is not set # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' # # # may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_ISD200 is not set # CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set # CONFIG_USB_STORAGE_JUMPSHOT is not set # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_KARMA is not set # CONFIG_USB_LIBUSUAL is not set # # USB Input Devices # CONFIG_USB_HID=y # CONFIG_USB_HIDINPUT_POWERBOOK is not set # CONFIG_HID_FF is not set # CONFIG_USB_HIDDEV is not set # CONFIG_USB_AIPTEK is not set # CONFIG_USB_WACOM is not set # CONFIG_USB_ACECAD is not set # CONFIG_USB_KBTAB is not set # CONFIG_USB_POWERMATE is not set # CONFIG_USB_TOUCHSCREEN is not set # CONFIG_USB_YEALINK is not set # CONFIG_USB_XPAD is not set # CONFIG_USB_ATI_REMOTE is not set # CONFIG_USB_ATI_REMOTE2 is not set # CONFIG_USB_KEYSPAN_REMOTE is not set # CONFIG_USB_APPLETOUCH is not set # CONFIG_USB_GTCO is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET_MII is not set CONFIG_USB_USBNET=y CONFIG_USB_NET_CDCETHER=y # CONFIG_USB_NET_DM9601 is not set # CONFIG_USB_NET_GL620A is not set # CONFIG_USB_NET_NET1080 is not set # CONFIG_USB_NET_PLUSB is not set # CONFIG_USB_NET_MCS7830 is not set # CONFIG_USB_NET_RNDIS_HOST is not set # CONFIG_USB_NET_CDC_SUBSET is not set # CONFIG_USB_NET_ZAURUS is not set # CONFIG_USB_MON is not set # # USB port drivers # # # USB Serial Converter support # CONFIG_USB_SERIAL=y # CONFIG_USB_SERIAL_CONSOLE is not set # CONFIG_USB_SERIAL_GENERIC is not set # CONFIG_USB_SERIAL_AIRCABLE is not set # CONFIG_USB_SERIAL_AIRPRIME is not set # CONFIG_USB_SERIAL_ARK3116 is not set # CONFIG_USB_SERIAL_BELKIN is not set # CONFIG_USB_SERIAL_WHITEHEAT is not set # CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set # CONFIG_USB_SERIAL_CP2101 is not set # CONFIG_USB_SERIAL_CYPRESS_M8 is not set # CONFIG_USB_SERIAL_EMPEG is not set # CONFIG_USB_SERIAL_FTDI_SIO is not set # CONFIG_USB_SERIAL_FUNSOFT is not set # CONFIG_USB_SERIAL_VISOR is not set # CONFIG_USB_SERIAL_IPAQ is not set # CONFIG_USB_SERIAL_IR is not set # CONFIG_USB_SERIAL_EDGEPORT is not set # CONFIG_USB_SERIAL_EDGEPORT_TI is not set CONFIG_USB_SERIAL_GARMIN=y # CONFIG_USB_SERIAL_IPW is not set # CONFIG_USB_SERIAL_KEYSPAN_PDA is not set # CONFIG_USB_SERIAL_KEYSPAN is not set # CONFIG_USB_SERIAL_KLSI is not set # CONFIG_USB_SERIAL_KOBIL_SCT is not set # CONFIG_USB_SERIAL_MCT_U232 is not set # CONFIG_USB_SERIAL_MOS7720 is not set # CONFIG_USB_SERIAL_MOS7840 is not set # CONFIG_USB_SERIAL_NAVMAN is not set CONFIG_USB_SERIAL_PL2303=y # CONFIG_USB_SERIAL_HP4X is not set # CONFIG_USB_SERIAL_SAFE is not set # CONFIG_USB_SERIAL_SIERRAWIRELESS is not set # CONFIG_USB_SERIAL_TI is not set # CONFIG_USB_SERIAL_CYBERJACK is not set # CONFIG_USB_SERIAL_XIRCOM is not set # CONFIG_USB_SERIAL_OPTION is not set # CONFIG_USB_SERIAL_OMNINET is not set # CONFIG_USB_SERIAL_DEBUG is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set # CONFIG_USB_AUERSWALD is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_BERRY_CHARGE is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGET is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_FTDI_ELAN is not set # CONFIG_USB_APPLEDISPLAY is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set # CONFIG_USB_TRANCEVIBRATOR is not set # CONFIG_USB_IOWARRIOR is not set # CONFIG_USB_TEST is not set # # USB DSL modem support # # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # CONFIG_MMC=y # CONFIG_MMC_DEBUG is not set CONFIG_MMC_BLOCK=y CONFIG_MMC_SDHCI=y # CONFIG_MMC_WBSD is not set # CONFIG_MMC_TIFM_SD is not set # # LED devices # # CONFIG_NEW_LEDS is not set # # LED drivers # # # LED Triggers # # # InfiniBand support # # CONFIG_INFINIBAND is not set # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) # CONFIG_EDAC=y # # Reporting subsystems # # CONFIG_EDAC_DEBUG is not set CONFIG_EDAC_MM_EDAC=y # CONFIG_EDAC_AMD76X is not set # CONFIG_EDAC_E7XXX is not set # CONFIG_EDAC_E752X is not set # CONFIG_EDAC_I82875P is not set # CONFIG_EDAC_I82860 is not set # CONFIG_EDAC_R82600 is not set CONFIG_EDAC_POLL=y # # Real Time Clock # # CONFIG_RTC_CLASS is not set # # DMA Engine support # # CONFIG_DMA_ENGINE is not set # # DMA Clients # # # DMA Devices # # # Auxiliary Display support # # # Virtualization # CONFIG_KVM=y CONFIG_KVM_INTEL=y # CONFIG_KVM_AMD is not set # # File systems # CONFIG_EXT2_FS=y # CONFIG_EXT2_FS_XATTR is not set # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=y # CONFIG_EXT3_FS_XATTR is not set # CONFIG_EXT4DEV_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set # CONFIG_FS_POSIX_ACL is not set CONFIG_XFS_FS=m # CONFIG_XFS_QUOTA is not set # CONFIG_XFS_SECURITY is not set # CONFIG_XFS_POSIX_ACL is not set # CONFIG_XFS_RT is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_ROMFS_FS is not set CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y # CONFIG_QUOTA is not set CONFIG_DNOTIFY=y # CONFIG_AUTOFS_FS is not set # CONFIG_AUTOFS4_FS is not set CONFIG_FUSE_FS=y # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_UDF_FS=y CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=y CONFIG_MSDOS_FS=y CONFIG_VFAT_FS=y CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1" # CONFIG_NTFS_FS is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set CONFIG_RAMFS=y # CONFIG_CONFIGFS_FS is not set # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_HFS_FS is not set # CONFIG_HFSPLUS_FS is not set # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set # CONFIG_CRAMFS is not set # CONFIG_VXFS_FS is not set # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set # CONFIG_UFS_FS is not set # # Network File Systems # # CONFIG_NFS_FS is not set # CONFIG_NFSD is not set # CONFIG_SMB_FS is not set # CONFIG_CIFS is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # CONFIG_9P_FS is not set # # Partition Types # # CONFIG_PARTITION_ADVANCED is not set CONFIG_MSDOS_PARTITION=y # # Native Language Support # CONFIG_NLS=y CONFIG_NLS_DEFAULT="iso8859-1" CONFIG_NLS_CODEPAGE_437=y # CONFIG_NLS_CODEPAGE_737 is not set # CONFIG_NLS_CODEPAGE_775 is not set CONFIG_NLS_CODEPAGE_850=y # CONFIG_NLS_CODEPAGE_852 is not set # CONFIG_NLS_CODEPAGE_855 is not set # CONFIG_NLS_CODEPAGE_857 is not set # CONFIG_NLS_CODEPAGE_860 is not set # CONFIG_NLS_CODEPAGE_861 is not set # CONFIG_NLS_CODEPAGE_862 is not set # CONFIG_NLS_CODEPAGE_863 is not set # CONFIG_NLS_CODEPAGE_864 is not set # CONFIG_NLS_CODEPAGE_865 is not set # CONFIG_NLS_CODEPAGE_866 is not set # CONFIG_NLS_CODEPAGE_869 is not set # CONFIG_NLS_CODEPAGE_936 is not set # CONFIG_NLS_CODEPAGE_950 is not set # CONFIG_NLS_CODEPAGE_932 is not set # CONFIG_NLS_CODEPAGE_949 is not set # CONFIG_NLS_CODEPAGE_874 is not set # CONFIG_NLS_ISO8859_8 is not set # CONFIG_NLS_CODEPAGE_1250 is not set # CONFIG_NLS_CODEPAGE_1251 is not set # CONFIG_NLS_ASCII is not set CONFIG_NLS_ISO8859_1=y # CONFIG_NLS_ISO8859_2 is not set # CONFIG_NLS_ISO8859_3 is not set # CONFIG_NLS_ISO8859_4 is not set # CONFIG_NLS_ISO8859_5 is not set # CONFIG_NLS_ISO8859_6 is not set # CONFIG_NLS_ISO8859_7 is not set # CONFIG_NLS_ISO8859_9 is not set # CONFIG_NLS_ISO8859_13 is not set # CONFIG_NLS_ISO8859_14 is not set CONFIG_NLS_ISO8859_15=y # CONFIG_NLS_KOI8_R is not set # CONFIG_NLS_KOI8_U is not set CONFIG_NLS_UTF8=y # # Distributed Lock Manager # # CONFIG_DLM is not set # # Instrumentation Support # # CONFIG_PROFILING is not set # CONFIG_KPROBES is not set # # Kernel hacking # CONFIG_TRACE_IRQFLAGS_SUPPORT=y # CONFIG_PRINTK_TIME is not set # CONFIG_ENABLE_MUST_CHECK is not set CONFIG_MAGIC_SYSRQ=y # CONFIG_UNUSED_SYMBOLS is not set # CONFIG_DEBUG_FS is not set # CONFIG_HEADERS_CHECK is not set # CONFIG_DEBUG_KERNEL is not set CONFIG_LOG_BUF_SHIFT=15 CONFIG_DEBUG_BUGVERBOSE=y CONFIG_EARLY_PRINTK=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y CONFIG_DOUBLEFAULT=y # # Security options # # CONFIG_KEYS is not set # CONFIG_SECURITY is not set # # Cryptographic options # CONFIG_CRYPTO=y CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_BLKCIPHER=y CONFIG_CRYPTO_MANAGER=y # CONFIG_CRYPTO_HMAC is not set # CONFIG_CRYPTO_XCBC is not set # CONFIG_CRYPTO_NULL is not set # CONFIG_CRYPTO_MD4 is not set CONFIG_CRYPTO_MD5=y # CONFIG_CRYPTO_SHA1 is not set # CONFIG_CRYPTO_SHA256 is not set # CONFIG_CRYPTO_SHA512 is not set # CONFIG_CRYPTO_WP512 is not set # CONFIG_CRYPTO_TGR192 is not set # CONFIG_CRYPTO_GF128MUL is not set CONFIG_CRYPTO_ECB=y CONFIG_CRYPTO_CBC=y # CONFIG_CRYPTO_PCBC is not set # CONFIG_CRYPTO_LRW is not set # CONFIG_CRYPTO_DES is not set # CONFIG_CRYPTO_FCRYPT is not set # CONFIG_CRYPTO_BLOWFISH is not set # CONFIG_CRYPTO_TWOFISH is not set # CONFIG_CRYPTO_TWOFISH_586 is not set # CONFIG_CRYPTO_SERPENT is not set CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_586=y # CONFIG_CRYPTO_CAST5 is not set # CONFIG_CRYPTO_CAST6 is not set # CONFIG_CRYPTO_TEA is not set CONFIG_CRYPTO_ARC4=y # CONFIG_CRYPTO_KHAZAD is not set # CONFIG_CRYPTO_ANUBIS is not set # CONFIG_CRYPTO_DEFLATE is not set CONFIG_CRYPTO_LZF=y CONFIG_CRYPTO_MICHAEL_MIC=y # CONFIG_CRYPTO_CRC32C is not set # CONFIG_CRYPTO_CAMELLIA is not set # CONFIG_CRYPTO_TEST is not set # # Hardware crypto devices # # CONFIG_CRYPTO_DEV_PADLOCK is not set # CONFIG_CRYPTO_DEV_GEODE is not set # # Library routines # CONFIG_BITREVERSE=y CONFIG_CRC_CCITT=y # CONFIG_CRC16 is not set CONFIG_CRC32=y # CONFIG_LIBCRC32C is not set CONFIG_DYN_PAGEFLAGS=y CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y CONFIG_PLIST=y CONFIG_HAS_IOMEM=y CONFIG_HAS_IOPORT=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y CONFIG_KTIME_SCALAR=y [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 23:12 ` Christian Hesse @ 2007-04-19 6:28 ` Ingo Molnar 2007-04-19 20:32 ` Christian Hesse 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 6:28 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Christian Hesse <mail@earthworm.de> wrote: > I now got some error message from my system: > > http://www.eworm.de/tmp/cfs-suspend.jpg ah, this pinpoints a bug: for performance reasons pick_next_task() assumes that the runqueue is not empty - which is true for schedule(), but not in migrate_dead_tasks(). Does the patch below fix the crash for you? Ingo --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned struct task_struct *next; for (;;) { + if (!rq->nr_running) + break; next = pick_next_task(rq, rq->curr); if (!next) break; ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-19 6:28 ` Ingo Molnar @ 2007-04-19 20:32 ` Christian Hesse 0 siblings, 0 replies; 713+ messages in thread From: Christian Hesse @ 2007-04-19 20:32 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel [-- Attachment #1: Type: text/plain, Size: 997 bytes --] On Thursday 19 April 2007, Ingo Molnar wrote: > * Christian Hesse <mail@earthworm.de> wrote: > > I now got some error message from my system: > > > > http://www.eworm.de/tmp/cfs-suspend.jpg > > ah, this pinpoints a bug: for performance reasons pick_next_task() > assumes that the runqueue is not empty - which is true for schedule(), > but not in migrate_dead_tasks(). Does the patch below fix the crash for > you? > > kernel/sched.c | 2 ++ > 1 file changed, 2 insertions(+) > > Index: linux/kernel/sched.c > =================================================================== > --- linux.orig/kernel/sched.c > +++ linux/kernel/sched.c > @@ -4425,6 +4425,8 @@ static void migrate_dead_tasks(unsigned > struct task_struct *next; > > for (;;) { > + if (!rq->nr_running) > + break; > next = pick_next_task(rq, rq->curr); > if (!next) > break; Suspend works perfectly with this patch. Thanks a lot and keep up the good work! -- Regards, Chris [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy 2007-04-18 22:16 ` CFS and suspend2: hang in atomic copy Ingo Molnar 2007-04-18 23:12 ` Christian Hesse @ 2007-04-19 6:41 ` Ingo Molnar 1 sibling, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 6:41 UTC (permalink / raw) To: Christian Hesse Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Ingo Molnar <mingo@elte.hu> wrote: > i just tried the same and it suspended+resumed just fine: > > Restarting tasks ... done. > Suspend2 debugging info: > - Suspend core : 2.2.9.12 > - Kernel Version : 2.6.21-rc7-CFS-v3 the key difference was that i should have attempted to sw-suspend to disk on an SMP box - that's where the bug triggered. Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) 2007-04-18 16:46 ` Ingo Molnar 2007-04-18 20:45 ` CFS and suspend2: hang in atomic copy Christian Hesse @ 2007-04-19 9:32 ` Esben Nielsen 2007-04-19 10:11 ` Ingo Molnar 1 sibling, 1 reply; 713+ messages in thread From: Esben Nielsen @ 2007-04-19 9:32 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel On Wed, 18 Apr 2007, Ingo Molnar wrote: > > * Christian Hesse <mail@earthworm.de> wrote: > >> Hi Ingo and all, >> >> On Friday 13 April 2007, Ingo Molnar wrote: >>> as usual, any sort of feedback, bugreports, fixes and suggestions are >>> more than welcome, >> >> I just gave CFS a try on my system. From a user's point of view it >> looks good so far. Thanks for your work. > > you are welcome! > >> However I found a problem: When trying to suspend a system patched >> with suspend2 2.2.9.11 it hangs with "doing atomic copy". Pressing the >> ESC key results in a message that it tries to abort suspend, but then >> still hangs. > > i took a quick look at suspend2 and it makes some use of yield(). > There's a bug in CFS's yield code, i've attached a patch that should fix > it, does it make any difference to the hang? > > Ingo > > Index: linux/kernel/sched_fair.c > =================================================================== > --- linux.orig/kernel/sched_fair.c > +++ linux/kernel/sched_fair.c > @@ -264,15 +264,26 @@ static void dequeue_task_fair(struct rq > > /* > * sched_yield() support is very simple via the rbtree, we just > - * dequeue and enqueue the task, which causes the task to > - * roundrobin to the end of the tree: > + * dequeue the task and move it to the rightmost position, which > + * causes the task to roundrobin to the end of the tree. > */ > static void requeue_task_fair(struct rq *rq, struct task_struct *p) > { > dequeue_task_fair(rq, p); > p->on_rq = 0; > - enqueue_task_fair(rq, p); > + /* > + * Temporarily insert at the last position of the tree: > + */ > + p->fair_key = LLONG_MAX; > + __enqueue_task_fair(rq, p); > p->on_rq = 1; > + > + /* > + * Update the key to the real value, so that when all other > + * tasks from before the rightmost position have executed, > + * this task is picked up again: > + */ > + p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; I don't think it safe to change the key after inserting the element in the tree. You end up with an unsorted tree giving where new entries end up in wrong places "randomly". I think a better approach would be to keep track of the rightmost entry, set the key to the rightmost's key +1 and then simply insert it there. Esben > ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) 2007-04-19 9:32 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen @ 2007-04-19 10:11 ` Ingo Molnar 2007-04-19 10:18 ` Ingo Molnar 0 siblings, 1 reply; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 10:11 UTC (permalink / raw) To: Esben Nielsen Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > >+ /* > >+ * Temporarily insert at the last position of the tree: > >+ */ > >+ p->fair_key = LLONG_MAX; > >+ __enqueue_task_fair(rq, p); > > p->on_rq = 1; > >+ > >+ /* > >+ * Update the key to the real value, so that when all other > >+ * tasks from before the rightmost position have executed, > >+ * this task is picked up again: > >+ */ > >+ p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; > > I don't think it safe to change the key after inserting the element in > the tree. You end up with an unsorted tree giving where new entries > end up in wrong places "randomly". yeah, indeed. I hoped that once this rightmost entry is removed (as soon as it gets scheduled next time) the tree goes back to a correct shape, but that's not the case - the left sub-tree and the right sub-tree is merged by the rbtree code with the assumption that the entry had a correct key. > I think a better approach would be to keep track of the rightmost > entry, set the key to the rightmost's key +1 and then simply insert it > there. yeah. I had that implemented at a stage but was trying to be too clever for my own good ;-) Ingo ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) 2007-04-19 10:11 ` Ingo Molnar @ 2007-04-19 10:18 ` Ingo Molnar 0 siblings, 0 replies; 713+ messages in thread From: Ingo Molnar @ 2007-04-19 10:18 UTC (permalink / raw) To: Esben Nielsen Cc: Christian Hesse, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, suspend2-devel * Ingo Molnar <mingo@elte.hu> wrote: > > I think a better approach would be to keep track of the rightmost > > entry, set the key to the rightmost's key +1 and then simply insert > > it there. > > yeah. I had that implemented at a stage but was trying to be too > clever for my own good ;-) i have fixed it via the patch below. (I'm using rb_last() because that way the normal scheduling codepaths are not burdened with the maintainance of a rightmost entry.) Ingo --- kernel/sched.c | 3 ++- kernel/sched_fair.c | 24 +++++++++++++----------- 2 files changed, 15 insertions(+), 12 deletions(-) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -3806,7 +3806,8 @@ asmlinkage long sys_sched_yield(void) schedstat_inc(rq, yld_cnt); if (rq->nr_running == 1) schedstat_inc(rq, yld_act_empty); - current->sched_class->yield_task(rq, current); + else + current->sched_class->yield_task(rq, current); /* * Since we are going to call schedule() anyway, there's Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -275,21 +275,23 @@ static void dequeue_task_fair(struct rq */ static void yield_task_fair(struct rq *rq, struct task_struct *p) { + struct rb_node *entry; + struct task_struct *last; + dequeue_task_fair(rq, p); p->on_rq = 0; + /* - * Temporarily insert at the last position of the tree: + * Temporarily insert at the last position of the tree. + * The key will be updated back to (near) its old value + * when the task gets scheduled. */ - p->fair_key = LLONG_MAX; + entry = rb_last(&rq->tasks_timeline); + last = rb_entry(entry, struct task_struct, run_node); + + p->fair_key = last->fair_key + 1; __enqueue_task_fair(rq, p); p->on_rq = 1; - - /* - * Update the key to the real value, so that when all other - * tasks from before the rightmost position have executed, - * this task is picked up again: - */ - p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset; } /* ^ permalink raw reply [flat|nested] 713+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] @ 2007-04-15 18:47 Tim Tassonis 0 siblings, 0 replies; 713+ messages in thread From: Tim Tassonis @ 2007-04-15 18:47 UTC (permalink / raw) To: linux-kernel > + printk("Fair Scheduler: Copyright (c) 2007 Red Hat, Inc., Ingo Molnar\n"); So that's what all the fuss about the staircase scheduler is all about then! At last, I see your point. > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! > How pathetic can you get? Tim, really looking forward to the CL final where Liverpool will beat the shit out of Scum (and there's a lot to be beaten out). ^ permalink raw reply [flat|nested] 713+ messages in thread
end of thread, other threads:[~2007-06-30 0:06 UTC | newest] Thread overview: 713+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-04-13 20:21 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Ingo Molnar 2007-04-13 20:27 ` Bill Huey 2007-04-13 20:55 ` Ingo Molnar 2007-04-13 21:21 ` William Lee Irwin III 2007-04-13 21:35 ` Bill Huey 2007-04-13 21:39 ` Ingo Molnar 2007-04-13 21:50 ` Ingo Molnar 2007-04-13 21:57 ` Michal Piotrowski 2007-04-13 22:15 ` Daniel Walker 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:37 ` Willy Tarreau 2007-04-13 23:59 ` Daniel Walker 2007-04-14 10:55 ` Ingo Molnar 2007-04-13 22:21 ` William Lee Irwin III 2007-04-13 22:52 ` Ingo Molnar 2007-04-13 23:30 ` William Lee Irwin III 2007-04-13 23:44 ` Ingo Molnar 2007-04-13 23:58 ` William Lee Irwin III 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi 2007-04-15 4:01 ` William Lee Irwin III 2007-04-15 4:18 ` Davide Libenzi 2007-04-15 23:09 ` Pavel Pisa 2007-04-16 5:47 ` Davide Libenzi 2007-04-17 0:37 ` Pavel Pisa 2007-04-13 22:31 ` Willy Tarreau 2007-04-13 23:18 ` Ingo Molnar 2007-04-14 18:48 ` Bill Huey 2007-04-13 23:07 ` Gabriel C 2007-04-13 23:25 ` Ingo Molnar 2007-04-13 23:39 ` Gabriel C 2007-04-14 2:04 ` Nick Piggin 2007-04-14 6:32 ` Ingo Molnar 2007-04-14 6:43 ` Ingo Molnar 2007-04-14 8:08 ` Willy Tarreau 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau 2007-04-14 14:45 ` Willy Tarreau 2007-04-14 16:14 ` Ingo Molnar 2007-04-14 16:19 ` Ingo Molnar 2007-04-14 17:15 ` Eric W. Biederman 2007-04-14 17:29 ` Willy Tarreau 2007-04-14 17:44 ` Eric W. Biederman 2007-04-14 17:54 ` Ingo Molnar 2007-04-14 18:18 ` Willy Tarreau 2007-04-14 18:40 ` Eric W. Biederman 2007-04-14 19:01 ` Willy Tarreau 2007-04-15 17:55 ` Ingo Molnar 2007-04-15 18:06 ` Willy Tarreau 2007-04-15 19:20 ` Ingo Molnar 2007-04-15 19:35 ` William Lee Irwin III 2007-04-15 19:57 ` Ingo Molnar 2007-04-15 23:54 ` William Lee Irwin III 2007-04-16 11:24 ` Ingo Molnar 2007-04-16 13:46 ` William Lee Irwin III 2007-04-15 19:37 ` Ingo Molnar 2007-04-14 17:50 ` Linus Torvalds 2007-04-15 7:54 ` Mike Galbraith 2007-04-15 8:58 ` Ingo Molnar 2007-04-15 9:11 ` Mike Galbraith 2007-04-19 9:01 ` Ingo Molnar 2007-04-19 12:54 ` Willy Tarreau 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett 2007-04-19 18:45 ` Willy Tarreau 2007-04-21 10:31 ` Ingo Molnar 2007-04-21 10:38 ` Ingo Molnar 2007-04-21 10:45 ` Ingo Molnar 2007-04-21 11:07 ` Willy Tarreau 2007-04-21 11:29 ` Björn Steinbrink 2007-04-21 11:51 ` Willy Tarreau 2007-04-19 23:52 ` Jan Knutar 2007-04-20 5:05 ` Willy Tarreau 2007-04-19 17:32 ` Gene Heskett 2007-04-14 15:17 ` Mark Lord 2007-04-14 19:48 ` William Lee Irwin III 2007-04-14 20:12 ` Willy Tarreau 2007-04-14 10:36 ` Ingo Molnar 2007-04-14 15:09 ` S.Çağlar Onur 2007-04-14 16:09 ` Ingo Molnar 2007-04-14 16:59 ` S.Çağlar Onur 2007-04-15 16:13 ` Kaffeine problem with CFS Ingo Molnar 2007-04-15 16:25 ` Ingo Molnar 2007-04-15 16:55 ` Christoph Pfister 2007-04-15 22:14 ` S.Çağlar Onur 2007-04-18 8:27 ` Ingo Molnar 2007-04-18 8:57 ` Ingo Molnar 2007-04-18 9:06 ` Ingo Molnar 2007-04-18 8:57 ` Christoph Pfister 2007-04-18 9:01 ` Ingo Molnar 2007-04-18 9:12 ` Mike Galbraith 2007-04-18 9:13 ` Christoph Pfister 2007-04-18 9:17 ` Ingo Molnar 2007-04-18 9:25 ` Christoph Pfister 2007-04-18 9:28 ` Ingo Molnar 2007-04-18 9:52 ` Christoph Pfister 2007-04-18 10:04 ` Christoph Pfister 2007-04-18 10:17 ` Ingo Molnar 2007-04-18 10:32 ` Ingo Molnar 2007-04-18 10:37 ` Ingo Molnar 2007-04-18 10:49 ` Ingo Molnar 2007-04-18 10:53 ` Ingo Molnar [not found] ` <19a3b7a80704180534w3688af87x78ee68cc1c330a5c@mail.gmail.com> [not found] ` <19a3b7a80704180555q4e0b26d5x54bbf34b4cd9d33e@mail.gmail.com> 2007-04-18 13:05 ` S.Çağlar Onur 2007-04-18 13:21 ` Christoph Pfister 2007-04-18 13:25 ` S.Çağlar Onur 2007-04-18 15:48 ` Ingo Molnar 2007-04-18 16:07 ` William Lee Irwin III 2007-04-18 16:14 ` Ingo Molnar 2007-04-18 21:08 ` S.Çağlar Onur 2007-04-18 21:12 ` Ingo Molnar 2007-04-20 19:31 ` Bill Davidsen 2007-04-21 8:36 ` Ingo Molnar 2007-04-18 15:08 ` Ingo Molnar 2007-04-15 3:27 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Con Kolivas 2007-04-15 5:16 ` Bill Huey 2007-04-15 8:44 ` Ingo Molnar 2007-04-15 9:51 ` Bill Huey 2007-04-15 10:39 ` Pekka Enberg 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg 2007-04-15 17:32 ` Mike Galbraith 2007-04-15 17:59 ` Linus Torvalds 2007-04-15 19:00 ` Jonathan Lundell 2007-04-15 22:52 ` Con Kolivas 2007-04-16 2:28 ` Nick Piggin 2007-04-16 3:15 ` Con Kolivas 2007-04-16 3:34 ` Nick Piggin [not found] ` <b21f8390704152257v1d879cc3te0cfee5bf5d2bbf3@mail.gmail.com> 2007-04-16 6:27 ` [ck] " Nick Piggin 2007-04-15 15:26 ` William Lee Irwin III 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III 2007-04-17 0:04 ` Peter Williams 2007-04-17 13:07 ` James Bruce 2007-04-17 20:05 ` William Lee Irwin III 2007-04-15 15:39 ` Ingo Molnar 2007-04-15 15:47 ` William Lee Irwin III 2007-04-16 5:27 ` Peter Williams 2007-04-16 6:23 ` Peter Williams 2007-04-16 6:40 ` Peter Williams 2007-04-16 7:32 ` Ingo Molnar 2007-04-16 8:54 ` Peter Williams 2007-04-15 15:16 ` Gene Heskett 2007-04-15 16:43 ` Con Kolivas 2007-04-15 16:58 ` Gene Heskett 2007-04-15 18:00 ` Mike Galbraith 2007-04-16 0:18 ` Gene Heskett 2007-04-15 16:11 ` Bernd Eckenfels 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith 2007-04-15 9:06 ` Ingo Molnar 2007-04-16 10:00 ` Ingo Molnar 2007-04-15 16:25 ` Arjan van de Ven 2007-04-16 5:36 ` Bill Huey 2007-04-16 6:17 ` Nick Piggin 2007-04-17 0:06 ` Peter Williams 2007-04-17 2:29 ` Mike Galbraith 2007-04-17 3:40 ` Nick Piggin 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 3:43 ` [Announce] [patch] Modular Scheduler Core and Completely FairScheduler [CFS] David Lang 2007-04-17 4:14 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Nick Piggin 2007-04-17 6:26 ` Peter Williams 2007-04-17 9:51 ` Ingo Molnar 2007-04-17 13:44 ` Peter Williams 2007-04-17 23:00 ` Michael K. Edwards 2007-04-17 23:07 ` William Lee Irwin III 2007-04-17 23:52 ` Michael K. Edwards 2007-04-18 0:36 ` Bill Huey 2007-04-18 2:39 ` Peter Williams 2007-04-20 20:47 ` Bill Davidsen 2007-04-21 7:39 ` Nick Piggin 2007-04-21 8:33 ` Ingo Molnar 2007-04-20 20:36 ` Bill Davidsen 2007-04-17 4:17 ` Peter Williams 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau 2007-04-17 6:10 ` Nick Piggin 2007-04-17 6:09 ` William Lee Irwin III 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:26 ` William Lee Irwin III 2007-04-17 7:01 ` Nick Piggin 2007-04-17 8:23 ` William Lee Irwin III 2007-04-17 22:23 ` Davide Libenzi 2007-04-17 21:39 ` Matt Mackall 2007-04-17 23:23 ` Peter Williams 2007-04-17 23:19 ` Matt Mackall 2007-04-18 3:15 ` Nick Piggin 2007-04-18 3:45 ` Mike Galbraith 2007-04-18 3:56 ` Nick Piggin 2007-04-18 4:29 ` Mike Galbraith 2007-04-18 4:38 ` Matt Mackall 2007-04-18 5:00 ` Nick Piggin 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin 2007-04-18 6:55 ` Matt Mackall 2007-04-18 7:24 ` Nick Piggin 2007-04-21 13:33 ` Bill Davidsen 2007-04-18 13:08 ` William Lee Irwin III 2007-04-18 19:48 ` Davide Libenzi 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:48 ` [ck] " Mark Glines 2007-04-18 19:27 ` Chris Friesen 2007-04-19 0:49 ` Peter Williams 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 17:59 ` Ingo Molnar 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar 2007-04-18 20:07 ` Davide Libenzi 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 15:43 ` Davide Libenzi 2007-04-21 14:09 ` Bill Davidsen 2007-04-19 17:39 ` Bernd Eckenfels 2007-04-19 6:52 ` Mike Galbraith 2007-04-19 7:09 ` Ingo Molnar 2007-04-19 7:32 ` Mike Galbraith 2007-04-19 16:55 ` Davide Libenzi 2007-04-20 5:16 ` Mike Galbraith 2007-04-19 7:14 ` Mike Galbraith 2007-04-18 21:04 ` Ingo Molnar 2007-04-18 19:23 ` Linus Torvalds 2007-04-18 19:56 ` Davide Libenzi 2007-04-18 20:11 ` Linus Torvalds 2007-04-19 0:22 ` Davide Libenzi 2007-04-19 0:30 ` Linus Torvalds 2007-04-18 18:02 ` William Lee Irwin III 2007-04-18 18:12 ` Ingo Molnar 2007-04-18 18:36 ` Diego Calleja 2007-04-19 0:37 ` Peter Williams 2007-04-18 19:05 ` Davide Libenzi 2007-04-18 19:13 ` Michael K. Edwards 2007-04-19 3:18 ` Nick Piggin 2007-04-19 5:14 ` Andrew Morton 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III 2007-04-19 11:50 ` Peter Williams 2007-04-20 5:26 ` William Lee Irwin III 2007-04-20 6:16 ` Peter Williams 2007-04-19 8:33 ` Nick Piggin 2007-04-19 11:59 ` Renice X for cpu schedulers Con Kolivas 2007-04-19 12:42 ` Peter Williams 2007-04-19 13:20 ` Peter Williams 2007-04-19 14:22 ` Lee Revell 2007-04-20 1:32 ` Michael K. Edwards 2007-04-20 5:25 ` Bill Huey 2007-04-20 7:12 ` Michael K. Edwards 2007-04-20 8:21 ` Bill Huey 2007-04-19 13:17 ` Mark Lord 2007-04-19 15:10 ` Con Kolivas 2007-04-19 16:15 ` Mark Lord 2007-04-19 18:21 ` Gene Heskett 2007-04-20 0:17 ` Con Kolivas 2007-04-20 1:17 ` Ed Tomlinson 2007-04-20 1:27 ` Linus Torvalds 2007-04-20 3:57 ` Nick Piggin 2007-04-21 14:55 ` Mark Lord 2007-04-22 12:54 ` Mark Lord 2007-04-22 12:58 ` Con Kolivas 2007-04-19 18:16 ` Gene Heskett 2007-04-19 21:35 ` Michael K. Edwards 2007-04-19 22:47 ` Con Kolivas 2007-04-20 2:00 ` Gene Heskett 2007-04-20 2:01 ` Gene Heskett 2007-04-20 5:24 ` Mike Galbraith 2007-04-19 19:26 ` Ray Lee 2007-04-19 22:56 ` Con Kolivas 2007-04-20 0:20 ` Michael K. Edwards 2007-04-20 5:34 ` Bill Huey 2007-04-20 0:56 ` Ray Lee 2007-04-20 4:09 ` Nick Piggin 2007-04-24 15:50 ` Ray Lee 2007-04-24 16:23 ` Matt Mackall 2007-04-21 13:40 ` [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Bill Davidsen 2007-04-17 6:50 ` Davide Libenzi 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams 2007-04-17 7:23 ` Nick Piggin 2007-04-17 7:27 ` Davide Libenzi 2007-04-17 7:33 ` Nick Piggin 2007-04-17 7:33 ` Ingo Molnar 2007-04-17 7:40 ` Nick Piggin 2007-04-17 7:58 ` Ingo Molnar 2007-04-17 9:05 ` William Lee Irwin III 2007-04-17 9:24 ` Ingo Molnar 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 10:01 ` Ingo Molnar 2007-04-17 11:31 ` William Lee Irwin III 2007-04-17 22:08 ` Matt Mackall 2007-04-17 22:32 ` William Lee Irwin III 2007-04-17 22:39 ` Matt Mackall 2007-04-17 22:59 ` William Lee Irwin III 2007-04-17 22:57 ` Matt Mackall 2007-04-18 4:29 ` William Lee Irwin III 2007-04-18 4:42 ` Davide Libenzi 2007-04-18 7:29 ` James Bruce 2007-04-17 7:11 ` Nick Piggin 2007-04-17 7:21 ` Davide Libenzi 2007-04-17 6:23 ` Peter Williams 2007-04-17 6:44 ` Nick Piggin 2007-04-17 7:48 ` Peter Williams 2007-04-17 7:56 ` Nick Piggin 2007-04-17 13:16 ` Peter Williams 2007-04-18 4:46 ` Nick Piggin 2007-04-17 8:44 ` Ingo Molnar 2007-04-19 2:20 ` Peter Williams 2007-04-15 15:05 ` Ingo Molnar 2007-04-15 20:05 ` Matt Mackall 2007-04-15 20:48 ` Ingo Molnar 2007-04-15 21:31 ` Matt Mackall 2007-04-16 3:03 ` Nick Piggin 2007-04-16 14:28 ` Matt Mackall 2007-04-17 3:31 ` Nick Piggin 2007-04-17 17:35 ` Matt Mackall 2007-04-16 15:45 ` William Lee Irwin III 2007-04-15 23:39 ` William Lee Irwin III 2007-04-16 1:06 ` Peter Williams 2007-04-16 3:04 ` William Lee Irwin III 2007-04-16 5:09 ` Peter Williams 2007-04-16 11:04 ` William Lee Irwin III 2007-04-16 12:55 ` Peter Williams 2007-04-16 23:10 ` Michael K. Edwards 2007-04-17 3:55 ` Nick Piggin 2007-04-17 4:25 ` Peter Williams 2007-04-17 4:34 ` Nick Piggin 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III 2007-04-17 6:23 ` Nick Piggin 2007-04-17 9:36 ` Ingo Molnar 2007-04-17 8:24 ` William Lee Irwin III [not found] ` <20070416135915.GK8915@holomorphy.com> [not found] ` <46241677.7060909@bigpond.net.au> [not found] ` <20070417025704.GM8915@holomorphy.com> [not found] ` <462445EC.1060306@bigpond.net.au> [not found] ` <20070417053147.GN8915@holomorphy.com> [not found] ` <46246A7C.8050501@bigpond.net.au> [not found] ` <20070417064109.GP8915@holomorphy.com> 2007-04-17 8:00 ` Peter Williams 2007-04-17 10:41 ` William Lee Irwin III 2007-04-17 13:48 ` Peter Williams 2007-04-18 0:27 ` Peter Williams 2007-04-18 2:03 ` William Lee Irwin III 2007-04-18 2:31 ` Peter Williams 2007-04-16 17:22 ` Chris Friesen 2007-04-17 0:54 ` Peter Williams 2007-04-17 15:52 ` Chris Friesen 2007-04-17 23:50 ` Peter Williams 2007-04-18 5:43 ` Chris Friesen 2007-04-18 13:00 ` Peter Williams 2007-04-16 5:16 ` Con Kolivas 2007-04-16 5:48 ` Gene Heskett 2007-04-15 12:29 ` Esben Nielsen 2007-04-15 13:04 ` Ingo Molnar 2007-04-16 7:16 ` Esben Nielsen 2007-04-15 22:49 ` Ismail Dönmez 2007-04-15 23:23 ` Arjan van de Ven 2007-04-15 23:33 ` Ismail Dönmez 2007-04-16 11:58 ` Ingo Molnar 2007-04-16 12:02 ` Ismail Dönmez 2007-04-16 22:00 ` Andi Kleen 2007-04-16 21:05 ` Ingo Molnar 2007-04-16 21:21 ` Andi Kleen 2007-04-17 7:56 ` Andy Whitcroft 2007-04-17 9:32 ` Nick Piggin 2007-04-17 9:59 ` Ingo Molnar 2007-04-17 11:11 ` Nick Piggin 2007-04-18 8:55 ` Nick Piggin 2007-04-18 9:33 ` Con Kolivas 2007-04-18 12:14 ` Nick Piggin 2007-04-18 12:33 ` Con Kolivas 2007-04-18 21:49 ` Con Kolivas 2007-04-18 9:53 ` Ingo Molnar 2007-04-18 12:13 ` Nick Piggin 2007-04-18 12:49 ` Con Kolivas 2007-04-19 3:28 ` Nick Piggin 2007-04-18 10:22 ` Ingo Molnar 2007-04-18 15:58 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Christian Hesse 2007-04-18 16:46 ` Ingo Molnar 2007-04-18 20:45 ` CFS and suspend2: hang in atomic copy Christian Hesse 2007-04-18 21:16 ` Ingo Molnar 2007-04-18 21:57 ` Christian Hesse 2007-04-18 22:02 ` Ingo Molnar 2007-04-18 22:22 ` Christian Hesse 2007-04-19 1:37 ` [Suspend2-devel] " Nigel Cunningham 2007-04-18 22:56 ` Bob Picco 2007-04-19 1:43 ` [Suspend2-devel] " Nigel Cunningham 2007-04-19 6:29 ` Ingo Molnar 2007-04-19 11:10 ` Bob Picco 2007-04-19 1:52 ` [Suspend2-devel] " Nigel Cunningham 2007-04-19 7:04 ` Ingo Molnar 2007-04-19 9:05 ` Nigel Cunningham 2007-04-24 20:23 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek 2007-04-24 20:41 ` Linus Torvalds 2007-04-24 20:51 ` Hua Zhong 2007-04-24 20:54 ` Ingo Molnar 2007-04-24 21:29 ` Pavel Machek 2007-04-24 22:24 ` Ray Lee 2007-04-25 21:41 ` Matt Mackall 2007-04-26 11:27 ` Pavel Machek 2007-04-26 19:04 ` Bill Davidsen 2007-04-24 21:24 ` Pavel Machek 2007-04-24 23:41 ` Linus Torvalds 2007-04-25 1:06 ` Olivier Galibert 2007-04-25 6:41 ` Ingo Molnar 2007-04-25 7:29 ` Pavel Machek 2007-04-25 7:48 ` Dumitru Ciobarcianu 2007-04-25 8:10 ` Pavel Machek 2007-04-25 8:22 ` Dumitru Ciobarcianu 2007-04-26 11:12 ` Pekka Enberg 2007-04-26 14:48 ` Rafael J. Wysocki 2007-04-26 16:10 ` Pekka Enberg 2007-04-26 19:28 ` Rafael J. Wysocki 2007-04-26 20:16 ` Nigel Cunningham 2007-04-26 20:37 ` Rafael J. Wysocki 2007-04-26 20:49 ` David Lang 2007-04-26 20:55 ` Nigel Cunningham 2007-04-26 21:22 ` Rafael J. Wysocki 2007-04-26 22:08 ` Nigel Cunningham 2007-04-25 8:48 ` Nigel Cunningham 2007-04-25 13:07 ` Federico Heinz 2007-04-25 19:38 ` Kenneth Crudup 2007-04-25 7:23 ` Pavel Machek 2007-04-25 8:48 ` Xavier Bestel 2007-04-25 8:50 ` Nigel Cunningham 2007-04-25 9:07 ` Xavier Bestel 2007-04-25 9:19 ` Nigel Cunningham 2007-04-26 18:18 ` Bill Davidsen 2007-04-25 9:02 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:hang " Romano Giannetti 2007-04-25 19:16 ` suspend2 merge Martin Steigerwald 2007-04-25 15:18 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Adrian Bunk 2007-04-25 17:34 ` Pavel Machek 2007-04-25 18:39 ` Adrian Bunk 2007-04-25 18:50 ` Linus Torvalds 2007-04-25 19:02 ` Hua Zhong 2007-04-25 19:25 ` Adrian Bunk 2007-04-25 19:38 ` Linus Torvalds 2007-04-25 20:08 ` Pavel Machek 2007-04-25 20:33 ` Rafael J. Wysocki 2007-04-25 20:31 ` Pavel Machek 2007-04-27 10:21 ` driver power operations (was Re: suspend2 merge) Johannes Berg 2007-04-27 10:21 ` Johannes Berg 2007-04-27 12:06 ` Rafael J. Wysocki 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:40 ` Pavel Machek 2007-04-27 12:46 ` Johannes Berg 2007-04-27 12:50 ` Pavel Machek 2007-04-27 12:50 ` Pavel Machek 2007-04-27 12:46 ` Johannes Berg 2007-04-27 12:06 ` Rafael J. Wysocki 2007-04-27 14:34 ` [linux-pm] " Alan Stern 2007-04-27 14:34 ` Alan Stern 2007-04-27 14:39 ` [linux-pm] " Johannes Berg 2007-04-27 14:49 ` Johannes Berg 2007-04-27 14:49 ` Johannes Berg 2007-04-27 15:20 ` [linux-pm] " Rafael J. Wysocki 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:27 ` Johannes Berg 2007-04-27 15:52 ` [linux-pm] " Linus Torvalds 2007-04-27 15:52 ` Linus Torvalds 2007-04-27 18:34 ` Rafael J. Wysocki 2007-04-27 18:34 ` [linux-pm] " Rafael J. Wysocki 2007-04-27 15:20 ` Rafael J. Wysocki 2007-04-27 15:41 ` [linux-pm] " Linus Torvalds 2007-04-27 15:41 ` Linus Torvalds 2007-04-27 14:39 ` Johannes Berg 2007-04-27 15:12 ` [linux-pm] " Rafael J. Wysocki 2007-04-27 15:24 ` Johannes Berg 2007-04-27 15:24 ` Johannes Berg 2007-04-27 15:12 ` Rafael J. Wysocki 2007-04-27 15:56 ` David Brownell 2007-04-27 15:56 ` [linux-pm] " David Brownell 2007-04-27 18:31 ` Rafael J. Wysocki 2007-04-27 18:31 ` Rafael J. Wysocki 2007-05-07 12:29 ` Pavel Machek 2007-05-07 12:29 ` Pavel Machek 2007-04-25 22:36 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Manu Abraham 2007-04-25 20:20 ` Rafael J. Wysocki 2007-04-25 20:24 ` Linus Torvalds 2007-04-25 21:30 ` Pavel Machek 2007-04-25 21:40 ` Rafael J. Wysocki 2007-04-25 21:46 ` Pavel Machek 2007-04-25 22:22 ` Nigel Cunningham 2007-04-25 20:23 ` Adrian Bunk 2007-04-25 22:19 ` Kenneth Crudup 2007-04-27 12:36 ` suspend2 merge Martin Steigerwald 2007-04-25 19:41 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Andrew Morton 2007-04-25 19:55 ` Pavel Machek 2007-04-25 22:13 ` Kenneth Crudup 2007-04-26 1:25 ` Antonino A. Daplas 2007-04-25 23:33 ` Olivier Galibert 2007-04-26 1:56 ` Nigel Cunningham 2007-04-26 7:27 ` David Lang 2007-04-26 9:45 ` Nigel Cunningham 2007-04-25 18:52 ` Alon Bar-Lev 2007-04-25 22:11 ` Kenneth Crudup 2007-04-25 19:43 ` Kenneth Crudup 2007-04-25 20:08 ` Linus Torvalds 2007-04-25 20:27 ` Pavel Machek 2007-04-25 20:44 ` Linus Torvalds 2007-04-25 21:07 ` Rafael J. Wysocki 2007-04-25 21:44 ` Pavel Machek 2007-04-25 22:18 ` Linus Torvalds 2007-04-25 22:27 ` Nigel Cunningham 2007-04-25 22:55 ` Linus Torvalds 2007-04-25 23:13 ` Pavel Machek 2007-04-25 23:29 ` Linus Torvalds 2007-04-25 23:45 ` Pavel Machek 2007-04-26 1:48 ` Nigel Cunningham 2007-04-26 1:40 ` Nigel Cunningham 2007-04-26 2:04 ` Linus Torvalds 2007-04-26 2:13 ` Nigel Cunningham 2007-04-26 3:03 ` Linus Torvalds 2007-04-26 3:34 ` Nigel Cunningham 2007-04-26 2:31 ` Nigel Cunningham 2007-04-26 10:39 ` Johannes Berg 2007-04-26 11:30 ` Pavel Machek 2007-04-26 11:41 ` Johannes Berg 2007-04-26 16:31 ` Johannes Berg 2007-04-26 16:31 ` Johannes Berg 2007-04-26 18:40 ` Rafael J. Wysocki 2007-04-26 18:40 ` Rafael J. Wysocki 2007-04-26 18:40 ` Johannes Berg 2007-04-26 19:02 ` Rafael J. Wysocki 2007-04-27 9:41 ` Johannes Berg 2007-04-27 10:09 ` [linux-pm] " Johannes Berg 2007-04-27 10:09 ` Johannes Berg 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:18 ` Rafael J. Wysocki 2007-04-27 10:19 ` Johannes Berg 2007-04-27 10:19 ` Johannes Berg 2007-04-27 12:09 ` Rafael J. Wysocki 2007-04-27 12:07 ` Johannes Berg 2007-04-27 12:07 ` Johannes Berg 2007-04-27 12:09 ` Rafael J. Wysocki 2007-04-27 9:41 ` Johannes Berg 2007-04-26 19:02 ` Rafael J. Wysocki 2007-04-26 18:40 ` Johannes Berg 2007-04-29 12:48 ` [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) R. J. Wysocki 2007-04-29 12:53 ` Rafael J. Wysocki 2007-04-30 8:29 ` Johannes Berg 2007-04-30 14:51 ` Rafael J. Wysocki 2007-04-30 14:59 ` Johannes Berg 2007-05-01 14:05 ` Rafael J. Wysocki 2007-05-01 22:02 ` Rafael J. Wysocki 2007-05-02 5:13 ` Alexey Starikovskiy 2007-05-02 13:42 ` Rafael J. Wysocki 2007-05-02 14:11 ` Alexey Starikovskiy 2007-05-02 19:26 ` ACPI code in platform mode hibernation code paths (was: Re: [PATCH] swsusp: do not use pm_ops) Rafael J. Wysocki 2007-05-02 19:26 ` Rafael J. Wysocki 2007-05-03 22:48 ` Pavel Machek 2007-05-03 22:48 ` Pavel Machek 2007-05-03 23:14 ` Rafael J. Wysocki 2007-05-03 23:14 ` Rafael J. Wysocki 2007-05-04 10:54 ` Johannes Berg 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:08 ` Pavel Machek 2007-05-04 12:29 ` Rafael J. Wysocki 2007-05-04 12:29 ` Rafael J. Wysocki 2007-05-04 10:54 ` Johannes Berg 2007-05-02 8:21 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 2007-05-02 9:02 ` Rafael J. Wysocki 2007-05-02 9:16 ` Pavel Machek 2007-05-02 9:25 ` Johannes Berg 2007-05-03 14:00 ` Alan Stern 2007-05-03 17:17 ` Rafael J. Wysocki 2007-05-03 18:33 ` Alan Stern 2007-05-03 19:47 ` Rafael J. Wysocki 2007-05-03 19:59 ` Alan Stern 2007-05-03 20:21 ` Rafael J. Wysocki 2007-05-04 14:40 ` Alan Stern 2007-05-04 20:20 ` Rafael J. Wysocki 2007-05-04 20:21 ` Johannes Berg 2007-05-04 20:55 ` Pavel Machek 2007-05-04 21:08 ` Johannes Berg 2007-05-04 21:15 ` Pavel Machek 2007-05-04 21:53 ` Rafael J. Wysocki 2007-05-04 21:53 ` Johannes Berg 2007-05-04 22:25 ` Rafael J. Wysocki 2007-05-05 15:52 ` Alan Stern 2007-05-07 1:16 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell 2007-05-07 21:00 ` Rafael J. Wysocki 2007-05-07 21:45 ` David Brownell 2007-05-07 22:16 ` Rafael J. Wysocki 2007-05-09 19:23 ` David Brownell 2007-05-04 21:06 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Rafael J. Wysocki 2007-05-04 20:58 ` Pavel Machek 2007-05-04 21:24 ` Rafael J. Wysocki 2007-05-05 16:19 ` Alan Stern 2007-05-05 17:46 ` Rafael J. Wysocki 2007-05-05 21:42 ` Alan Stern 2007-05-05 22:14 ` Rafael J. Wysocki 2007-05-04 21:40 ` David Brownell 2007-05-04 22:19 ` Rafael J. Wysocki 2007-05-07 1:05 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell 2007-05-05 16:08 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Alan Stern 2007-05-05 17:50 ` Rafael J. Wysocki 2007-05-05 21:43 ` Alan Stern 2007-05-05 22:16 ` Rafael J. Wysocki 2007-05-07 1:31 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...) David Brownell 2007-05-07 16:33 ` Alan Stern 2007-05-07 20:49 ` Pavel Machek 2007-05-07 21:38 ` Alan Stern 2007-05-08 0:30 ` Pavel Machek 2007-05-03 20:33 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) David Brownell 2007-05-03 20:33 ` David Brownell 2007-05-03 20:51 ` Rafael J. Wysocki 2007-05-04 14:51 ` Alan Stern 2007-05-04 14:56 ` Johannes Berg 2007-05-04 20:27 ` Rafael J. Wysocki 2007-05-04 22:00 ` David Brownell 2007-05-05 15:49 ` Alan Stern 2007-05-07 1:10 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ...)) David Brownell 2007-05-07 18:46 ` Alan Stern 2007-05-07 21:29 ` Rafael J. Wysocki 2007-05-07 22:22 ` Alan Stern 2007-05-07 22:47 ` Rafael J. Wysocki 2007-05-08 14:56 ` Alan Stern 2007-05-08 19:59 ` Rafael J. Wysocki 2007-05-08 21:26 ` Alan Stern 2007-05-09 8:17 ` Pavel Machek 2007-05-09 15:21 ` Alan Stern 2007-05-09 19:35 ` David Brownell 2007-05-09 20:04 ` Alan Stern 2007-05-09 20:21 ` David Brownell 2007-05-10 15:17 ` Alan Stern 2007-05-09 21:07 ` Pavel Machek 2007-05-07 21:43 ` David Brownell 2007-05-07 22:41 ` Alan Stern 2007-05-03 22:18 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Pavel Machek 2007-05-04 14:57 ` Alan Stern 2007-05-04 20:50 ` Rafael J. Wysocki 2007-05-04 20:49 ` Johannes Berg 2007-05-04 21:11 ` Rafael J. Wysocki 2007-05-04 21:23 ` Johannes Berg 2007-05-04 21:55 ` Rafael J. Wysocki 2007-05-04 21:54 ` Johannes Berg 2007-05-04 22:21 ` Rafael J. Wysocki 2007-05-05 15:37 ` Alan Stern 2007-05-05 18:49 ` Rafael J. Wysocki 2007-05-05 21:44 ` Alan Stern 2007-05-05 22:36 ` Rafael J. Wysocki 2007-05-06 22:01 ` Alan Stern 2007-05-06 22:31 ` Rafael J. Wysocki 2007-05-07 1:37 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: ..) David Brownell 2007-05-08 2:57 ` Greg KH 2007-05-07 8:51 ` Re: [PATCH] swsusp: do not use pm_ops (was: Re: suspend2 merge (was: Re: CFS and suspend2: hang in atomic copy)) Johannes Berg 2007-05-04 22:12 ` David Brownell 2007-05-04 22:31 ` Rafael J. Wysocki 2007-05-05 16:15 ` Alan Stern 2007-05-02 13:43 ` Rafael J. Wysocki 2007-04-25 22:42 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Pavel Machek 2007-04-25 22:58 ` Linus Torvalds 2007-04-25 22:43 ` Chuck Ebbert 2007-04-25 23:00 ` Linus Torvalds 2007-04-25 22:49 ` Pavel Machek 2007-04-25 23:10 ` Linus Torvalds 2007-04-25 23:28 ` Pavel Machek 2007-04-25 23:57 ` Linus Torvalds 2007-04-25 22:57 ` Alan Cox 2007-04-25 23:20 ` Linus Torvalds 2007-04-25 23:52 ` Pavel Machek 2007-04-26 0:05 ` Linus Torvalds 2007-04-26 0:14 ` Pavel Machek 2007-04-25 23:51 ` David Lang 2007-04-26 0:38 ` Linus Torvalds 2007-04-26 2:04 ` H. Peter Anvin 2007-04-26 2:32 ` Linus Torvalds 2007-04-26 13:14 ` Alan Cox 2007-04-26 16:02 ` Linus Torvalds 2007-04-26 0:34 ` Linus Torvalds 2007-04-26 20:12 ` Rafael J. Wysocki 2007-04-26 0:24 ` Alan Cox 2007-04-26 1:10 ` Linus Torvalds 2007-04-26 14:04 ` Mark Lord 2007-04-26 16:10 ` Linus Torvalds 2007-04-26 21:00 ` Pavel Machek 2007-04-26 7:08 ` Andy Grover 2007-04-26 0:41 ` Thomas Orgis 2007-05-26 17:37 ` Martin Steigerwald 2007-05-26 20:35 ` Rafael J. Wysocki 2007-05-26 22:23 ` Martin Steigerwald 2007-04-26 10:17 ` Johannes Berg 2007-04-26 10:30 ` Pavel Machek 2007-04-26 10:40 ` Pavel Machek 2007-04-26 11:11 ` Johannes Berg 2007-04-26 11:16 ` Pavel Machek 2007-04-26 11:27 ` Johannes Berg 2007-04-26 11:26 ` Pavel Machek 2007-04-26 11:35 ` Johannes Berg 2007-04-26 11:33 ` Pavel Machek 2007-04-26 16:14 ` Chris Friesen 2007-04-26 16:27 ` Linus Torvalds 2007-04-26 17:11 ` Johannes Berg 2007-04-26 15:56 ` Linus Torvalds 2007-04-26 21:06 ` Theodore Tso 2007-04-26 21:12 ` Nigel Cunningham 2007-04-26 13:45 ` Johannes Berg 2007-06-29 22:44 ` [PATCH] move suspend includes into right place (was Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)) Pavel Machek 2007-06-30 0:06 ` Adrian Bunk 2007-04-26 11:04 ` suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy) Johannes Berg 2007-04-26 11:09 ` Pavel Machek 2007-04-26 15:53 ` Linus Torvalds 2007-04-26 18:21 ` Olivier Galibert 2007-04-26 21:30 ` Pavel Machek 2007-04-26 11:35 ` Christoph Hellwig 2007-04-26 12:15 ` Ingo Molnar 2007-04-26 12:41 ` Pavel Machek 2007-04-18 22:16 ` CFS and suspend2: hang in atomic copy Ingo Molnar 2007-04-18 23:12 ` Christian Hesse 2007-04-19 6:28 ` Ingo Molnar 2007-04-19 20:32 ` Christian Hesse 2007-04-19 6:41 ` Ingo Molnar 2007-04-19 9:32 ` CFS and suspend2: hang in atomic copy (was: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]) Esben Nielsen 2007-04-19 10:11 ` Ingo Molnar 2007-04-19 10:18 ` Ingo Molnar 2007-04-15 18:47 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Tim Tassonis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.