All of lore.kernel.org
 help / color / mirror / Atom feed
* Back to the future.
@ 2007-04-26  6:04 Nigel Cunningham
  2007-04-26  7:28 ` Pekka Enberg
                   ` (2 more replies)
  0 siblings, 3 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26  6:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

Hi again.

So - trying to get back to the original discussion - what (if anything)
do you see as the way ahead?

The options I can think of are (starting with things I can do):

1) I stop developing Suspend2, thereby pushing however many current
Suspend2 users to move to [u]swsusp and seek to get that up to speed.

2) I quit my day job, see if Redhat will take me full time and give me
the time to start trying to merge Suspend2 bit by bit. Alternatively,
days suddenly become 8 hours longer and I discover the boundless energy
and alertness needed to do this too :). Ok. Not going to happen.

3) Someone else steps up to the plate and tries to merge Suspend2 one
bit at a time.

4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

5) Everything gets dropped and we start from scratch.

6) The status quo - or some small variant of it - stays. Oh... I said
"way ahead". I guess that rules this one out, even though I'll be very
surprised if it's not the one that wins out.

7) Suspend2 gets merged and people get to choose which they like better.
Nearly forgot this as a conceivable possibility. Yeah, I know you said
you don't want it. I'm just trying to think of what might possibly
happen.

N.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  6:04 Back to the future Nigel Cunningham
@ 2007-04-26  7:28 ` Pekka Enberg
       [not found]   ` <1177573348.50 25.224.camel@nigel.suspend2.net>
  2007-04-26  7:42   ` Nigel Cunningham
  2007-04-26  8:38 ` Jan Engelhardt
  2007-04-28  0:28 ` Bojan Smojver
  2 siblings, 2 replies; 136+ messages in thread
From: Pekka Enberg @ 2007-04-26  7:28 UTC (permalink / raw)
  To: nigel; +Cc: Linus Torvalds, LKML

On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> 3) Someone else steps up to the plate and tries to merge Suspend2 one
> bit at a time.

So which bits do we want to merge? For example, Suspend2
kernel/power/ui.c, kernel/power/compression.c, and
kernel/power/encryption.c seem pointless now that we have uswsusp.
Furthermore, being the shameless Linus cheerleader that I am, I got
the impression that we should fix the snapshot/shutdown logic in the
kernel which Suspend2 doesn't really address?

                                Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  7:28 ` Pekka Enberg
       [not found]   ` <1177573348.50 25.224.camel@nigel.suspend2.net>
@ 2007-04-26  7:42   ` Nigel Cunningham
  2007-04-26  8:17     ` Pekka Enberg
  2007-04-26 16:56     ` Linus Torvalds
  1 sibling, 2 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26  7:42 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 1027 bytes --]

Hi.

On Thu, 2007-04-26 at 10:28 +0300, Pekka Enberg wrote:
> On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> > 3) Someone else steps up to the plate and tries to merge Suspend2 one
> > bit at a time.
> 
> So which bits do we want to merge? For example, Suspend2
> kernel/power/ui.c, kernel/power/compression.c, and
> kernel/power/encryption.c seem pointless now that we have uswsusp.
> Furthermore, being the shameless Linus cheerleader that I am, I got
> the impression that we should fix the snapshot/shutdown logic in the
> kernel which Suspend2 doesn't really address?

I agree that the driver logic could be addressed too, but to answer your
question...

* Doing things in the right order? (Prepare the image, then do the
atomic copy, then save).
* Mulithreaded I/O (might as well use multiple cores to compress the
image, now that we're hotplugging later).
* Support for > 1 swap device.
* Support for ordinary files.
* Full image option.
* Modular design?

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  7:42   ` Nigel Cunningham
@ 2007-04-26  8:17     ` Pekka Enberg
  2007-04-26  9:28       ` Nigel Cunningham
  2007-04-26 16:56     ` Linus Torvalds
  1 sibling, 1 reply; 136+ messages in thread
From: Pekka Enberg @ 2007-04-26  8:17 UTC (permalink / raw)
  To: nigel; +Cc: Linus Torvalds, LKML

Hi Nigel,

On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

As I am a total newbie to the power management code, I am unable to
spot the conceptual difference in uswsusp suspend.c:suspend_system()
and suspend2 kernel/power/suspend.c:suspend_main(). How are they
different?

On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).

I assume this doesn't affect the kernel at all with uswsusp?

On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> * Modular design?

This is too broad. Please be more specific of the problems the current
suspend and snapshot/shutdown code in the kernel has.

Now to add to your list, as far as I can tell, suspend2 provides
better feedback to the user via the netlink mechanism (although the
kernel shouldn't be sending messages such as userui_redraw but instead
let the userspace know of the actual events, for example, that tasks
have now been frozen). However, I am unsure if this is still relevant
as most of the work (snapshot writing) is being done in userspace
where we explicitly know when processes have been frozen, when the
snapshot is finished, and when it's written to disk.

                                              Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  6:04 Back to the future Nigel Cunningham
  2007-04-26  7:28 ` Pekka Enberg
@ 2007-04-26  8:38 ` Jan Engelhardt
  2007-04-26  9:33   ` Nigel Cunningham
  2007-04-28  0:28 ` Bojan Smojver
  2 siblings, 1 reply; 136+ messages in thread
From: Jan Engelhardt @ 2007-04-26  8:38 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML


On Apr 26 2007 16:04, Nigel Cunningham wrote:
>
>Hi again.
>
>So - trying to get back to the original discussion - what (if anything)
>do you see as the way ahead?
>
>The options I can think of are (starting with things I can do):
>
>1) [...]
>2) [...]
>3) [...]
>4) [...]
>5) [...]
>6) [...]
>7) [...]

Perhaps do it the EVMS way? Do as much in userspace as possible, and
trying having a simple kernel API at the same time.
Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)


Jan
-- 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  8:17     ` Pekka Enberg
@ 2007-04-26  9:28       ` Nigel Cunningham
  2007-04-26 17:29         ` Luca Tettamanti
  0 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26  9:28 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

Hi.

On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
> Hi Nigel,
> 
> On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
> 
> As I am a total newbie to the power management code, I am unable to
> spot the conceptual difference in uswsusp suspend.c:suspend_system()
> and suspend2 kernel/power/suspend.c:suspend_main(). How are they
> different?

Will discuss in irc since you've appeared there...

> On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
> 
> I assume this doesn't affect the kernel at all with uswsusp?

Well uswsusp would benefit from using multiple threads - if it can - to
do the work. I saw quite an improvement from implementing it.

> On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
> > * Modular design?
> 
> This is too broad. Please be more specific of the problems the current
> suspend and snapshot/shutdown code in the kernel has.

Did you see the 'Reasons to merge' email I sent? It has more detail on
this.

> Now to add to your list, as far as I can tell, suspend2 provides
> better feedback to the user via the netlink mechanism (although the
> kernel shouldn't be sending messages such as userui_redraw but instead
> let the userspace know of the actual events, for example, that tasks
> have now been frozen). However, I am unsure if this is still relevant
> as most of the work (snapshot writing) is being done in userspace
> where we explicitly know when processes have been frozen, when the
> snapshot is finished, and when it's written to disk.

From uswsusp's point of view, yeah. But I'm still coming from the 'doing
this in kernelspace makes far more sense' perspective.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  8:38 ` Jan Engelhardt
@ 2007-04-26  9:33   ` Nigel Cunningham
  0 siblings, 0 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26  9:33 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

Hi.

On Thu, 2007-04-26 at 10:38 +0200, Jan Engelhardt wrote:
> On Apr 26 2007 16:04, Nigel Cunningham wrote:
> >
> >Hi again.
> >
> >So - trying to get back to the original discussion - what (if anything)
> >do you see as the way ahead?
> >
> >The options I can think of are (starting with things I can do):
> >
> >1) [...]
> >2) [...]
> >3) [...]
> >4) [...]
> >5) [...]
> >6) [...]
> >7) [...]
> 
> Perhaps do it the EVMS way? Do as much in userspace as possible, and
> trying having a simple kernel API at the same time.
> Perhaps (3) would be it, but ask Redhat _first_ before quitting anything :)

:) Well, the EVMS way is swsusp. Personally, I agree with Linus that
think putting suspend to disk code in userspace is just a broken idea.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  7:42   ` Nigel Cunningham
  2007-04-26  8:17     ` Pekka Enberg
@ 2007-04-26 16:56     ` Linus Torvalds
  2007-04-26 17:03       ` Xavier Bestel
                         ` (6 more replies)
  1 sibling, 7 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-26 16:56 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Pekka Enberg, LKML



On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> 
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

I'd actually like to discuss this a bit..

I'm obviously not a huge fan of the whole user/kernel level split and 
interfaces, but I actually do think that there is *one* split that makes 
sense:

 - generate the (whole) snapshot image entirely inside the kernel

 - do nothing else (ie no IO at all), and just export it as a single image 
   to user space (literally just mapping the pages into user space). 
   *one* interface. None of the "pretty UI update" crap. Just a single 
   system call:

	void *snapshot_system(u32 *size);

   which will map in the snapshot, return the mapped address and the size 
   (and if you want to support snapshots > 4GB, be my guest, but I suspect 
   you're actually *better* off just admitting that if you cannot shrink 
   the snapshot to less than 32 bits, it's not worth doing)

User space gets a fully running system, with that one process having that 
one image mapped into its address space. It can then compress/write/do 
whatever to that snapshot.

You need one other system call, of course, which is

	int resume_snapshot(void *snapshot, u32 size);

and for testing, you should be able to basically do

	u32 size;
	void *buffer = snapshot_system(&size);
	if (buffer != MAP_FAILED)
		resume_snapshot(buffer, size);

and it should obviously work.

And btw, the device model changes are a big part of this. Because I don't 
think it's even remotely debuggable with the full suspend/resume of the 
devices being part of generating the image! That freeze/snapshot/unfreeze 
sequence is likely a lot more debuggable, if only because freeze/unfreeze 
is actually a no-op for most devices, and snapshotting is trivial too.

Once you have that snapshot image in user space you can do anything you 
want. And again: you'd hav a fully working system: not any degradation 
*at*all*. If you're in X, then X will continue running etc even after the 
snapshotting, although obviously the snapshotting will have tried to page 
a lot of stuff out in order to make the snapshot smaller, so you'll likely 
be crawling.

> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).
> * Support for > 1 swap device.
> * Support for ordinary files.
> * Full image option.
> * Modular design?

I'd really suggest _just_ the "full image". Nothing else is probably ever 
worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as 
"echo disk > /sys/power/state", but it should not necessarily be much 
worse than

	snapshot_kernel | gzip -9 > /dev/snapshot

either (and resuming from the snapshot would just be the reverse)!

And if you want to send the snapshot over a TCP connection to another 
host, be my guest. With pretty images while it's transferring. Whatever.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
@ 2007-04-26 17:03       ` Xavier Bestel
  2007-04-26 17:34         ` Linus Torvalds
  2007-04-26 17:07       ` Linus Torvalds
                         ` (5 subsequent siblings)
  6 siblings, 1 reply; 136+ messages in thread
From: Xavier Bestel @ 2007-04-26 17:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka Enberg, LKML

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> Once you have that snapshot image in user space you can do anything you 
> want. And again: you'd hav a fully working system: not any degradation 
> *at*all*. If you're in X, then X will continue running etc even after the 
> snapshotting

Won't there be problems if e.g. X tries to write something to its
logfile after snapshot ?

	Xav



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
  2007-04-26 17:03       ` Xavier Bestel
@ 2007-04-26 17:07       ` Linus Torvalds
  2007-04-26 18:22       ` Chase Venters
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-26 17:07 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Pekka Enberg, LKML



On Thu, 26 Apr 2007, Linus Torvalds wrote:
> 
> Once you have that snapshot image in user space you can do anything you 
> want.

Side note: the exception, of course, is page out more. The swap device has 
to be read-only.

We actually have support for that mode (it's how "swapoff" works: it marks 
swap devices as not accepting _new_ entries, even though old entries are 
still valid). So you can have a fully running system, with 99% of memory 
swapped out, and still guarantee that you won't swap out anything *more* 
(which would destroy the swap image, which you don't want, since it's 
where a lot of the memory may end up being, in order to make the snapshot 
itself as small as possible)!

Anybody who cares can look at the code that messes with the the 
SWP_WRITEOK flag. You'd basically swap out enough to make the snapshot 
image fit comfortably in memory, and then you'd clear SWP_WRITEOK on all 
swap devices and return to user space. Or something very close to that.

But the point here is that we should actually really be able to have a 
fully working system, even _after_ we created the snapshot. I don't even 
think you should need any "initrd only" kind of situation.

If somebody can do that, with just those two system calls, I'll remove 
every other suspend-to-disk wannabe from the kernel in a heartbeat. I may 
have missed something subtle, of course, but I really *think* it should be 
doable.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  9:28       ` Nigel Cunningham
@ 2007-04-26 17:29         ` Luca Tettamanti
  0 siblings, 0 replies; 136+ messages in thread
From: Luca Tettamanti @ 2007-04-26 17:29 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Pekka Enberg, Linus Torvalds, linux-kernel

Nigel Cunningham <nigel@nigel.suspend2.net> ha scritto:
> On Thu, 2007-04-26 at 11:17 +0300, Pekka Enberg wrote:
>> On 4/26/07, Nigel Cunningham <nigel@nigel.suspend2.net> wrote:
>> > * Mulithreaded I/O (might as well use multiple cores to compress the
>> > image, now that we're hotplugging later).
>> 
>> I assume this doesn't affect the kernel at all with uswsusp?
> 
> Well uswsusp would benefit from using multiple threads - if it can - to
> do the work. I saw quite an improvement from implementing it.

It's doable[1], but I'm not sure that the added complexity is worth of it.
I'm suprised that you see a big improvement. I'd expect that the image
write is bottlenecked by the disk performance. On my PC (Core2, locked
at 1.6GHz) lzf can compress 250-280MB/s; even with an older CPU that can
do 1/3 it's still more than the disk can handle.

Luca
[1] We may even use MPI to compress over a Beowulf cluster, it's
userspace ;)
-- 
"Ricorda sempre che sei unico, esattamente come tutti gli altri".

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 17:03       ` Xavier Bestel
@ 2007-04-26 17:34         ` Linus Torvalds
  2007-04-26 20:08           ` Nigel Cunningham
  2007-04-27  7:51           ` Pekka Enberg
  0 siblings, 2 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-26 17:34 UTC (permalink / raw)
  To: Xavier Bestel; +Cc: Nigel Cunningham, Pekka Enberg, LKML



On Thu, 26 Apr 2007, Xavier Bestel wrote:
> 
> Won't there be problems if e.g. X tries to write something to its
> logfile after snapshot ?

Sure. But that's a user-level issue.

You do have to allow writing after snapshotting, since at a minimum, you'd 
want the snapshot itself to be written. So the kernel has to be fully 
running, and support full user space. No "degraded mode" like now.

So when I said "fully running user mode", I really meant it from the 
perspective of the kernel - not necessarily from the perspective of the 
"user". You do want to limit _what_ user mode does, but you must not limit 
it by making the kernel less capable.

Remounting mounted filesystems read-only sounds like a good idea, for 
example. We can do that. We have the technology. But we shouldn't limit 
user space from doing other things (for example, it might want to actually 
*mount* a new filesystem for writing the snapshot image).

For example, right now we try to "fix" that with the whole process freezer 
thing. And that code has *caused* more problems than it fixed, since it 
tries to freeze all the kernel threads etc, and you simply don't have a 
truly *working*system*.

I think it's fine to freeze processes if that is what you want to do (for 
example, send them SIGSTOP), but we freeze them *too*much* right now, and 
the suspend stuff has taken over policy way too much. We don't actually 
leave the system in a runnable state. I can almost guarantee that you'd be 
*better* off having the snapshot taking thing do a

	kill(-1, SIGSTOP);

in user space than our current broken process freezer. At least that 
wouldn't have screwed up those kernel threads as badly as swsusp did.

And no, I'm not saying that my suggestion is the only way to do it. Go 
wild. But the *current* situation is just broken. Three different things, 
none of which people can agree on. I'd *much* rather see a conceptually 
simpler approach that then required, but even more important is that right 
now people aren't even discussing alternatives, they're just pushing one 
of the three existing things, and that's simply not viable. Because I'm 
not merging another one.

In fact, I personally feel that I shouldn't even have merged 
userspace-swsusp, but if Andrew thinks it needs to be merged, my personal 
feelings simply don't matter that much. I have to trust people. But yes, 
as far as *I* am personally concerned, I think it was a mistake to merge 
it.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
  2007-04-26 17:03       ` Xavier Bestel
  2007-04-26 17:07       ` Linus Torvalds
@ 2007-04-26 18:22       ` Chase Venters
  2007-04-26 18:50         ` David Lang
  2007-04-26 19:56       ` Nigel Cunningham
                         ` (3 subsequent siblings)
  6 siblings, 1 reply; 136+ messages in thread
From: Chase Venters @ 2007-04-26 18:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka Enberg, LKML

On Thu, 26 Apr 2007, Linus Torvalds wrote:

>
> Once you have that snapshot image in user space you can do anything you
> want. And again: you'd hav a fully working system: not any degradation
> *at*all*. If you're in X, then X will continue running etc even after the
> snapshotting, although obviously the snapshotting will have tried to page
> a lot of stuff out in order to make the snapshot smaller, so you'll likely
> be crawling.
>

In fact... If you're just paging out to make a smaller snapshot (ie, not
to free up memory), couldn't you just swap it out (if it's not backed by a
file) then mark it as "half-released"... ie, the snapshot writing code
ignores it knowing that it will be available on disk at resume, but then
when the snapshot is complete it's still available in physical RAM,
preventing user-space from crawling due to the necessity of paging it all
back in?

Thanks,
Chase



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 18:22       ` Chase Venters
@ 2007-04-26 18:50         ` David Lang
  0 siblings, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-26 18:50 UTC (permalink / raw)
  To: Chase Venters; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Thu, 26 Apr 2007, Chase Venters wrote:

> On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
>> 
>> Once you have that snapshot image in user space you can do anything you
>> want. And again: you'd hav a fully working system: not any degradation
>> *at*all*. If you're in X, then X will continue running etc even after the
>> snapshotting, although obviously the snapshotting will have tried to page
>> a lot of stuff out in order to make the snapshot smaller, so you'll likely
>> be crawling.
>> 
>
> In fact... If you're just paging out to make a smaller snapshot (ie, not
> to free up memory), couldn't you just swap it out (if it's not backed by a
> file) then mark it as "half-released"... ie, the snapshot writing code
> ignores it knowing that it will be available on disk at resume, but then
> when the snapshot is complete it's still available in physical RAM,
> preventing user-space from crawling due to the necessity of paging it all
> back in?

your swap space may end up being re-used before you restore with std

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2007-04-26 18:22       ` Chase Venters
@ 2007-04-26 19:56       ` Nigel Cunningham
  2007-04-27  4:52         ` Pekka J Enberg
  2007-04-28 19:09         ` Bill Davidsen
  2007-04-26 22:40       ` Pavel Machek
                         ` (2 subsequent siblings)
  6 siblings, 2 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26 19:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pekka Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 3879 bytes --]

Hi.

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> > 
> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
> 
> I'd actually like to discuss this a bit..
> 
> I'm obviously not a huge fan of the whole user/kernel level split and 
> interfaces, but I actually do think that there is *one* split that makes 
> sense:
> 
>  - generate the (whole) snapshot image entirely inside the kernel
> 
>  - do nothing else (ie no IO at all), and just export it as a single image 
>    to user space (literally just mapping the pages into user space). 
>    *one* interface. None of the "pretty UI update" crap. Just a single 
>    system call:
> 
> 	void *snapshot_system(u32 *size);
> 
>    which will map in the snapshot, return the mapped address and the size 
>    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
>    you're actually *better* off just admitting that if you cannot shrink 
>    the snapshot to less than 32 bits, it's not worth doing)

That inherently limits the image to half of available ram (you need
somewhere to store the snapshot), so you won't get the full image you
express interest in below.

> User space gets a fully running system, with that one process having that 
> one image mapped into its address space. It can then compress/write/do 
> whatever to that snapshot.

You're describing uswsusp! (At least in so far as I understand it!).

You can't get a fully running system though, because if anything changes
something on disk that was snapshotted (super blocks etc) your snapshot
is invalid and you risk on-disk corruption.

> And btw, the device model changes are a big part of this. Because I don't 
> think it's even remotely debuggable with the full suspend/resume of the 
> devices being part of generating the image! That freeze/snapshot/unfreeze 
> sequence is likely a lot more debuggable, if only because freeze/unfreeze 
> is actually a no-op for most devices, and snapshotting is trivial too.
> 
> Once you have that snapshot image in user space you can do anything you 
> want. And again: you'd hav a fully working system: not any degradation 
> *at*all*. If you're in X, then X will continue running etc even after the 
> snapshotting, although obviously the snapshotting will have tried to page 
> a lot of stuff out in order to make the snapshot smaller, so you'll likely 
> be crawling.

Nooooooo! See above about disk corruption.

> > * Mulithreaded I/O (might as well use multiple cores to compress the
> > image, now that we're hotplugging later).
> > * Support for > 1 swap device.
> > * Support for ordinary files.
> > * Full image option.
> > * Modular design?
> 
> I'd really suggest _just_ the "full image". Nothing else is probably ever 
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as 
> "echo disk > /sys/power/state", but it should not necessarily be much 
> worse than

Please, go apply that logic elsewhere, then cut out (or at least stop
adding) support for users with less common needs in other areas. I fully
acknowledge that most users have only one place to store their image and
it's a swap device. But that doesn't mean one size fits all.

A full image implies that you need to figure out what's not going to
change while you're writing it and save that separately. At the moment,
I'm treating most of the LRU contents as that list. If we're going to
start trying to let every man and his dog run while we're trying to
snapshot the system, that's not going to work anymore - or the logic
will get a lot more complicated.

Sorry. I never thought I'd say this, but I think you're being naive
about how simple the process of snapshotting a system is.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 17:34         ` Linus Torvalds
@ 2007-04-26 20:08           ` Nigel Cunningham
  2007-04-26 20:45             ` Linus Torvalds
                               ` (2 more replies)
  2007-04-27  7:51           ` Pekka Enberg
  1 sibling, 3 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Xavier Bestel, Pekka Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 4165 bytes --]

Hi.

On Thu, 2007-04-26 at 10:34 -0700, Linus Torvalds wrote:
> 
> On Thu, 26 Apr 2007, Xavier Bestel wrote:
> > 
> > Won't there be problems if e.g. X tries to write something to its
> > logfile after snapshot ?
> 
> Sure. But that's a user-level issue.
> 
> You do have to allow writing after snapshotting, since at a minimum, you'd 
> want the snapshot itself to be written. So the kernel has to be fully 
> running, and support full user space. No "degraded mode" like now.

It doesn't need a fully functional userspace (unless you want to write
to a fuse device, and even then that could be worked around - make it
like uswsusp or userui).... can I deverge for a second and say that from
this point of view, fuse is the lamest idea ever invented. Guaranteed to
break your ability to suspend^Wsnapshot.... Anyhow, if the kernel has
bmapped the pages it's going to write to beforehand, it knows where the
image needs to go. No need for userspace at all.

> So when I said "fully running user mode", I really meant it from the 
> perspective of the kernel - not necessarily from the perspective of the 
> "user". You do want to limit _what_ user mode does, but you must not limit 
> it by making the kernel less capable.
> 
> Remounting mounted filesystems read-only sounds like a good idea, for 
> example. We can do that. We have the technology. But we shouldn't limit 
> user space from doing other things (for example, it might want to actually 
> *mount* a new filesystem for writing the snapshot image).

We tried that. It would need some work. IIRC remounting filesystems
read-only makes files become marked read-only. Perfectly sensible,
except that if you then remount the filesystem rw at resume time, all
those files are still marked ro and userspace crashes and burns. Not
unfixable, I'll agree, but there is more work to do there.

As to the example, mounting a new filesystem for writing the snapshot
image should probably be done before we do the snapshot. Then it won't
be in danger of triggering anything that might require one of the other
fses to be rw (eg syslog).

> For example, right now we try to "fix" that with the whole process freezer 
> thing. And that code has *caused* more problems than it fixed, since it 
> tries to freeze all the kernel threads etc, and you simply don't have a 
> truly *working*system*.

Yes, it has been difficult. But so is bringing up a child.

> I think it's fine to freeze processes if that is what you want to do (for 
> example, send them SIGSTOP), but we freeze them *too*much* right now, and 
> the suspend stuff has taken over policy way too much. We don't actually 
> leave the system in a runnable state. I can almost guarantee that you'd be 
> *better* off having the snapshot taking thing do a
> 
> 	kill(-1, SIGSTOP);
> 
> in user space than our current broken process freezer. At least that 
> wouldn't have screwed up those kernel threads as badly as swsusp did.

I don't think it's fair to blame swsusp there. Maybe cpu hotplugging...

> And no, I'm not saying that my suggestion is the only way to do it. Go 
> wild. But the *current* situation is just broken. Three different things, 
> none of which people can agree on. I'd *much* rather see a conceptually 
> simpler approach that then required, but even more important is that right 
> now people aren't even discussing alternatives, they're just pushing one 
> of the three existing things, and that's simply not viable. Because I'm 
> not merging another one.
> 
> In fact, I personally feel that I shouldn't even have merged 
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal 
> feelings simply don't matter that much. I have to trust people. But yes, 
> as far as *I* am personally concerned, I think it was a mistake to merge 
> it.

Perhaps you should try to make an alternative yourself instead of
pushing us into making something we don't believe will work (my case) or
have already done but in a way you don't like (Rafael). Don't talk about
Pavel cutting code. He's just acking/nacking what Rafael sends him.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 20:08           ` Nigel Cunningham
@ 2007-04-26 20:45             ` Linus Torvalds
  2007-04-26 20:50               ` Nigel Cunningham
  2007-04-26 21:38             ` Theodore Tso
  2007-04-26 22:08             ` Rafael J. Wysocki
  2 siblings, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2007-04-26 20:45 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Xavier Bestel, Pekka Enberg, LKML



On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> 
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

I've done that in the past (USB, PCMCIA - screw the maintainers, redo 
it basically from scratch). But the thing is, I'm totally uninterested 
personally in the whole disk-snapshotting, so I'm not likely to do it 
there.

But yes, I'm actually hoping that some new person will come in with a new 
idea. The current people seem to be too set in "their" corners, and I 
don't expect that to really change.

Quite honestly, I don't foresee any of the current tree approaches really 
doing something new and obviously better, unless somebody new steps in.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 20:45             ` Linus Torvalds
@ 2007-04-26 20:50               ` Nigel Cunningham
  2007-04-27  0:10                 ` Olivier Galibert
  0 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26 20:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Xavier Bestel, Pekka Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

Hi.

On Thu, 2007-04-26 at 13:45 -0700, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > 
> > Perhaps you should try to make an alternative yourself instead of
> > pushing us into making something we don't believe will work (my case) or
> > have already done but in a way you don't like (Rafael). Don't talk about
> > Pavel cutting code. He's just acking/nacking what Rafael sends him.
> 
> I've done that in the past (USB, PCMCIA - screw the maintainers, redo 
> it basically from scratch). But the thing is, I'm totally uninterested 
> personally in the whole disk-snapshotting, so I'm not likely to do it 
> there.
> 
> But yes, I'm actually hoping that some new person will come in with a new 
> idea. The current people seem to be too set in "their" corners, and I 
> don't expect that to really change.
> 
> Quite honestly, I don't foresee any of the current tree approaches really 
> doing something new and obviously better, unless somebody new steps in.

That's because there is no other possibility. Sooner or later you have
to do a snapshot, and somehow you have to save it. You're not going to
get a new solution, just one that do those basic things in new and
better ways.

I'm perfectly willing to think through some alternate approach if you
suggest something or prod my thinking in a new direction, but I'm afraid
I just can't see right now how we can achieve what you're after.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 20:08           ` Nigel Cunningham
  2007-04-26 20:45             ` Linus Torvalds
@ 2007-04-26 21:38             ` Theodore Tso
  2007-04-27 10:10               ` Christoph Hellwig
  2007-04-26 22:08             ` Rafael J. Wysocki
  2 siblings, 1 reply; 136+ messages in thread
From: Theodore Tso @ 2007-04-26 21:38 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, Xavier Bestel, Pekka Enberg, LKML

On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> We tried that. It would need some work. IIRC remounting filesystems
> read-only makes files become marked read-only. Perfectly sensible,
> except that if you then remount the filesystem rw at resume time, all
> those files are still marked ro and userspace crashes and burns. Not
> unfixable, I'll agree, but there is more work to do there.

There are other solutions, though.  One is that we could export a
system call interface which freezes a filesystem and prevents any
further I/O.  We mostly have something like that right now (via the
the write_super_lockfs function in the superblock operations
structure), but we haven't exported it to userspace.  And right now
not all filesystems support it, but in theory that could be fixed (or
you only suppor suspend/resume if all filesystems support lockfs).

We would also need a similar interface to freeze any block device I/O,
in case you have a database running and doing direct I/O to a block
device.  (Or again, we could simply not support that case; how many
people will be running running a database accessing a block deivce on
their laptop?)

So in order to do this right, we would have to double the number of
new interfaces needed from the two proposed by Linus --- which is why
I think the userspace suspend solution is fundamentally NOT the right
one.  Rather the right one is the one which Linux ultimately used for
PCMCIA, which is to do it all in the kernel.

						- Ted

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 20:08           ` Nigel Cunningham
  2007-04-26 20:45             ` Linus Torvalds
  2007-04-26 21:38             ` Theodore Tso
@ 2007-04-26 22:08             ` Rafael J. Wysocki
  2007-04-26 22:20               ` Nigel Cunningham
  2007-04-26 23:15               ` Linus Torvalds
  2 siblings, 2 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-26 22:08 UTC (permalink / raw)
  To: nigel, Linus Torvalds, Andrew Morton
  Cc: Xavier Bestel, Pekka Enberg, LKML, Pavel Machek

On Thursday, 26 April 2007 22:08, Nigel Cunningham wrote:
[--snip--]
> > And no, I'm not saying that my suggestion is the only way to do it. Go 
> > wild. But the *current* situation is just broken. Three different things, 
> > none of which people can agree on. I'd *much* rather see a conceptually 
> > simpler approach that then required, but even more important is that right 
> > now people aren't even discussing alternatives, they're just pushing one 
> > of the three existing things, and that's simply not viable. Because I'm 
> > not merging another one.
> > 
> > In fact, I personally feel that I shouldn't even have merged 
> > userspace-swsusp, but if Andrew thinks it needs to be merged, my personal 
> > feelings simply don't matter that much. I have to trust people. But yes, 
> > as far as *I* am personally concerned, I think it was a mistake to merge 
> > it.
> 
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

Well, I think that much of what Linus is saying indicates that he hasn't tried
to write any such thing himself. ;-)

Anyway, I'm tired of all this thing.  Really.  I've just been trying to make
things _work_ more-or-less reliably in a way that Pavel liked and I really
didn't know that much about the kernel when I started.  In fact, I started as a
user who needed certain functionality from the kernel and that was not there
at the time.  I've made some mistakes because of that (like the definitions of
the ioctl numbers in suspend.h - this was just a rookie mistake, and I'm
ashamed of it, but _nobody_ catched it, although I believe many people were
looking at the patch).

Now that I know much more than before, I can say I agree with Linus on his
opinion about the separation of s2ram form the snapshot/restore functionality
(I'll call it 'hibernation' for simplicity from now on).  It should be done,
because it would make things simpler and cleaner.  Still, it will be difficult
to do without screwing users en masse and that's my main concern here.

I don't agree that we don't need the tasks freezer for suspending and
hibernation.  We need it, because we need to be sure that the (other) tasks
will not get us in the way, and that also applies to kernel threads (and I
don't think the tasks freezer is 'screwing' them, BTW).

I agree that the userland interface for swsusp is not very nice and I'm going
to do my best to clean that up.  I hope that someone will help me, but if not,
then that's fine.  OTOH, it's difficult, if not impossible, to do a
userland-driven hibernation in a completely clean way.  I've tried that and I'm
not exactly satisfied with the result, although it works and some distros use
it.  I wouldn't have done it again, but then I'm going to support the existing
users, as I promised.

Now, I think that the hibernation should better be done completely in the
kernel, because that's just conceptually simpler, although some data exchange
with the user land may be acceptable for some optional fancy stuff.  I'm also
tierd of the endless "to merge or not to merge suspend2" discussions that just
lead to nowhere.  For these reasons I declare that I'm ready to cooperate with
Nigel on integrating as much of suspend2 as reasonably possible into the
existing infrastructure, under the following conditions:
- we don't remove the existing user-visible interfaces
- we work on one piece of code at a time
- we avoid code duplication, as much as possible
- we avoid using open-coded things, if possible
- if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

If that's acceptable, we can start tomorrow.  In the process, we can try to
separate the hibernation code paths from the s2ram ones, but that will require
a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
familiar with, like writing device drivers.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:08             ` Rafael J. Wysocki
@ 2007-04-26 22:20               ` Nigel Cunningham
  2007-04-26 23:15               ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-26 22:20 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Andrew Morton, Xavier Bestel, Pekka Enberg, LKML,
	Pavel Machek

[-- Attachment #1: Type: text/plain, Size: 5439 bytes --]

Hi Rafael.

On Fri, 2007-04-27 at 00:08 +0200, Rafael J. Wysocki wrote:
> On Thursday, 26 April 2007 22:08, Nigel Cunningham wrote:
> [--snip--]
> > > And no, I'm not saying that my suggestion is the only way to do it. Go 
> > > wild. But the *current* situation is just broken. Three different things, 
> > > none of which people can agree on. I'd *much* rather see a conceptually 
> > > simpler approach that then required, but even more important is that right 
> > > now people aren't even discussing alternatives, they're just pushing one 
> > > of the three existing things, and that's simply not viable. Because I'm 
> > > not merging another one.
> > > 
> > > In fact, I personally feel that I shouldn't even have merged 
> > > userspace-swsusp, but if Andrew thinks it needs to be merged, my personal 
> > > feelings simply don't matter that much. I have to trust people. But yes, 
> > > as far as *I* am personally concerned, I think it was a mistake to merge 
> > > it.
> > 
> > Perhaps you should try to make an alternative yourself instead of
> > pushing us into making something we don't believe will work (my case) or
> > have already done but in a way you don't like (Rafael). Don't talk about
> > Pavel cutting code. He's just acking/nacking what Rafael sends him.
> 
> Well, I think that much of what Linus is saying indicates that he hasn't tried
> to write any such thing himself. ;-)
> 
> Anyway, I'm tired of all this thing.  Really.  I've just been trying to make
> things _work_ more-or-less reliably in a way that Pavel liked and I really
> didn't know that much about the kernel when I started.  In fact, I started as a
> user who needed certain functionality from the kernel and that was not there
> at the time.  I've made some mistakes because of that (like the definitions of
> the ioctl numbers in suspend.h - this was just a rookie mistake, and I'm
> ashamed of it, but _nobody_ catched it, although I believe many people were
> looking at the patch).
> 
> Now that I know much more than before, I can say I agree with Linus on his
> opinion about the separation of s2ram form the snapshot/restore functionality
> (I'll call it 'hibernation' for simplicity from now on).  It should be done,
> because it would make things simpler and cleaner.  Still, it will be difficult
> to do without screwing users en masse and that's my main concern here.
> 
> I don't agree that we don't need the tasks freezer for suspending and
> hibernation.  We need it, because we need to be sure that the (other) tasks
> will not get us in the way, and that also applies to kernel threads (and I
> don't think the tasks freezer is 'screwing' them, BTW).
> 
> I agree that the userland interface for swsusp is not very nice and I'm going
> to do my best to clean that up.  I hope that someone will help me, but if not,
> then that's fine.  OTOH, it's difficult, if not impossible, to do a
> userland-driven hibernation in a completely clean way.  I've tried that and I'm
> not exactly satisfied with the result, although it works and some distros use
> it.  I wouldn't have done it again, but then I'm going to support the existing
> users, as I promised.
> 
> Now, I think that the hibernation should better be done completely in the
> kernel, because that's just conceptually simpler, although some data exchange
> with the user land may be acceptable for some optional fancy stuff.  I'm also
> tierd of the endless "to merge or not to merge suspend2" discussions that just
> lead to nowhere.  For these reasons I declare that I'm ready to cooperate with
> Nigel on integrating as much of suspend2 as reasonably possible into the
> existing infrastructure, under the following conditions:
> - we don't remove the existing user-visible interfaces

I don't want to remove user visible interfaces either (I understand that
you mean the ioctls by that?). Perhaps we can find a way to make them
still usable with a more in-kernel solution (ie some things become
noops?).

> - we work on one piece of code at a time

Sure. We should spend some time discussing and planning beforehand so we
don't waste time and effort writing and rewriting.

> - we avoid code duplication, as much as possible

No problem there.

> - we avoid using open-coded things, if possible

Regarding open-coded things, I assume you're referring to the extents. I
would argue that they're not open-coded because list.h implements doubly
linked lists, and extents use a singly linked list. That said, I suppose
we could make the extents doubly linked and use list.h, even though that
would be a waste of 4/8 bytes per extent.

> - if we don't agree on something, we ask someone wiser (volunteers welcome ;-))

Absolutely!

> If that's acceptable, we can start tomorrow.  In the process, we can try to
> separate the hibernation code paths from the s2ram ones, but that will require
> a lot of knowledge about things that neither me nor Nigel, AFAICT, are very
> familiar with, like writing device drivers.

Yes.

Thanks for this email. It's really encouraging, and I'm more than glad
to work with you. Unfortunately, as you've seen me keep saying already,
I have very limited time to work on this. Thankfully you seem to have
more, and Pekka has also stepped up to help, so maybe we can make good
forward progress despite my limitations.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:42       ` Pavel Machek
@ 2007-04-26 22:24         ` David Lang
  2007-04-26 23:12           ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: David Lang @ 2007-04-26 22:24 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Fri, 27 Apr 2007, Pavel Machek wrote:

> This is basically the loop above, made complex by the fact that we do
> not want to have separate partition for snapshot; we just want to
> reuse free space in swap partition.

with the size of drives today is it really that bad to require a seperate 
partition for this?

I also don't like the idea of storing this in the swap partition for a couple of 
reasons.

1. on many modern linux systems the swap partition is not large enough.

for example, on my boxes with 16G or ram I only allocate 2G of swap space

2. it's too easy for other things to stomp on your swap partition.

   for example: booting from a live CD that finds and uses swap partitions

if you are needing space for your freeze, allocate it in an unabigous way, not 
by re-useing an existing partition.

David Lang



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
                         ` (3 preceding siblings ...)
  2007-04-26 19:56       ` Nigel Cunningham
@ 2007-04-26 22:40       ` Pavel Machek
  2007-04-27  5:41         ` Pekka Enberg
  2007-04-26 22:42       ` Pavel Machek
  2007-04-27 12:49       ` Pavel Machek
  6 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-26 22:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka Enberg, LKML

Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
> 
> I'd actually like to discuss this a bit..
> 
> I'm obviously not a huge fan of the whole user/kernel level split and 
> interfaces, but I actually do think that there is *one* split that makes 
> sense:
> 
>  - generate the (whole) snapshot image entirely inside the kernel
> 
>  - do nothing else (ie no IO at all), and just export it as a single image 
>    to user space (literally just mapping the pages into user space). 
>    *one* interface. None of the "pretty UI update" crap. Just a single 
>    system call:
> 
> 	void *snapshot_system(u32 *size);
> 
>    which will map in the snapshot, return the mapped address and the size 
>    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
>    you're actually *better* off just admitting that if you cannot shrink 
>    the snapshot to less than 32 bits, it's not worth doing)

This is basically how uswsusp is designed. (We do not use system call,
you just read from /dev/snapshot, and you have to make few ioctls to
stop the other tasks).

> and for testing, you should be able to basically do
> 
> 	u32 size;
> 	void *buffer = snapshot_system(&size);
> 	if (buffer != MAP_FAILED)
> 		resume_snapshot(buffer, size);
> 
> and it should obviously work.

Which is what I did long time ago, during uswsusp development.

> Once you have that snapshot image in user space you can do anything you 
> want. And again: you'd hav a fully working system: not any degradation 
> *at*all*. If you're in X, then X will continue running etc even after the 
> snapshotting, although obviously the snapshotting will have tried to page 
> a lot of stuff out in order to make the snapshot smaller, so you'll likely 
> be crawling.

Well... We decided not to do this in the fully working system. SIGSTOP
is just not strong enough, and we want the snapshot atomic.

Now, it would be _very_ nice to be able to snapshot system and
continue running, but I just don't see how to do it without extensive
filesystem support.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
                         ` (4 preceding siblings ...)
  2007-04-26 22:40       ` Pavel Machek
@ 2007-04-26 22:42       ` Pavel Machek
  2007-04-26 22:24         ` David Lang
  2007-04-27 12:49       ` Pavel Machek
  6 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-26 22:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka Enberg, LKML

Hi!

> I'd really suggest _just_ the "full image". Nothing else is probably ever 
> worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as 
> "echo disk > /sys/power/state", but it should not necessarily be much 
> worse than
> 
> 	snapshot_kernel | gzip -9 > /dev/snapshot

Yep, we "freeze too much", so we can't just use the shell and pipe
it. Too bad.

  218 int write_image(char *resume_dev_name)
  219 {
  220 	static struct swap_map_handle handle;
  221 	struct swsusp_info *header;
  222 	unsigned long start;
  223 	int fd;
  224 	int error;
  225 
  226 	fd = open(resume_dev_name, O_RDWR | O_SYNC);
  227 	if (fd < 0) {
  228 		printf("suspend: Could not open resume device\n");
  229 		return error;
  230 	}
  231 	error = read(dev, buffer, PAGE_SIZE);
  232 	if (error < PAGE_SIZE)
  233 		return error < 0 ? error : -EFAULT;
  234 	header = (struct swsusp_info *)buffer;
  235 	if (!enough_swap(header->pages)) {
  236 		printf("suspend: Not enough free swap\n");
  237 		return -ENOSPC;
  238 	}
  239 	error = init_swap_writer(&handle, fd);
  240 	if (!error) {
  241 		start = handle.cur_swap;
  242 		error = swap_write_page(&handle, header);
  243 	}
  244 	if (!error)
  245 		error = save_image(&handle, header->pages - 1);
  246 	if (!error) {
  247 		flush_swap_writer(&handle);
  248 		printf( "S" );
  249 		error = mark_swap(fd, start);
  250 		printf( "|\n" );
  251 	}
  252 	fsync(fd);
  253 	close(fd);
  254 	return error;
  255 }

This is basically the loop above, made complex by the fact that we do
not want to have separate partition for snapshot; we just want to
reuse free space in swap partition.

I think you've just invented uswsusp.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 23:12           ` Pavel Machek
@ 2007-04-26 22:49             ` David Lang
  2007-04-26 23:27               ` Pavel Machek
  2007-04-27  0:23               ` Olivier Galibert
  0 siblings, 2 replies; 136+ messages in thread
From: David Lang @ 2007-04-26 22:49 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> This is basically the loop above, made complex by the fact that we do
>>> not want to have separate partition for snapshot; we just want to
>>> reuse free space in swap partition.
>>
>> with the size of drives today is it really that bad to require a seperate
>> partition for this?
>
> Yes. You want uswsusp to work in situations where swsusp worked.
>
>> I also don't like the idea of storing this in the swap partition for a
>> couple of reasons.
>>
>> 1. on many modern linux systems the swap partition is not large enough.
>>
>> for example, on my boxes with 16G or ram I only allocate 2G of swap
>> space
>
> WTF? So allocate larger swap partition. You just told me disks are big
> enough.

swap partitions are limited to 2G (or at least they were a couple of months ago 
when I last checked). I also don't want to run the risk of having a box try to 
_use_ 16G worth of swap. I'd rather have the box hit OOM first.

>> 2. it's too easy for other things to stomp on your swap partition.
>>
>>   for example: booting from a live CD that finds and uses swap
>> partitions
>
> That's a feature. If you are booting from live CD, you _want_ to erase
> any hibernation image.

why?

it's been stated that doing a std and booting another OS (including windows) is 
a valid and common useage. saying that if you boot another OS you trash your 
suspended image doesn't sound reasonable.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 23:27               ` Pavel Machek
@ 2007-04-26 22:56                 ` David Lang
  0 siblings, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-26 22:56 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Fri, 27 Apr 2007, Pavel Machek wrote:

> Hi!
>
>>> That's a feature. If you are booting from live CD, you _want_ to erase
>>> any hibernation image.
>>
>> why?
>>
>> it's been stated that doing a std and booting another OS (including
>> windows) is a valid and common useage. saying that if you boot another OS
>> you trash your suspended image doesn't sound reasonable.
>
> If you hibernate your machine, boot from live cd, and change anything
> on any filesystem, you are pretty likely to loose that filesystem.

booting from a live CD doesn't mean that you are going to mount the filesystem, 
let alone change it. but swap is not supposed to be this sensitive.

David Lang

> Doing that with Windows is okay as Windows do not usually write to
> ext3 partitions.
> 									Pavel
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:24         ` David Lang
@ 2007-04-26 23:12           ` Pavel Machek
  2007-04-26 22:49             ` David Lang
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-26 23:12 UTC (permalink / raw)
  To: David Lang; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

Hi!

> >This is basically the loop above, made complex by the fact that we do
> >not want to have separate partition for snapshot; we just want to
> >reuse free space in swap partition.
> 
> with the size of drives today is it really that bad to require a seperate 
> partition for this?

Yes. You want uswsusp to work in situations where swsusp worked.

> I also don't like the idea of storing this in the swap partition for a 
> couple of reasons.
> 
> 1. on many modern linux systems the swap partition is not large enough.
> 
> for example, on my boxes with 16G or ram I only allocate 2G of swap
> space

WTF? So allocate larger swap partition. You just told me disks are big
enough.

> 2. it's too easy for other things to stomp on your swap partition.
> 
>   for example: booting from a live CD that finds and uses swap
> partitions

That's a feature. If you are booting from live CD, you _want_ to erase
any hibernation image.

> if you are needing space for your freeze, allocate it in an unabigous way, 
> not by re-useing an existing partition.

Of course you have that option. Writing image is done in userspace, so
you are free to write it to raw partition (and first versions indeed
done that).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:08             ` Rafael J. Wysocki
  2007-04-26 22:20               ` Nigel Cunningham
@ 2007-04-26 23:15               ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-26 23:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: nigel, Andrew Morton, Xavier Bestel, Pekka Enberg, LKML, Pavel Machek



On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> 
> Well, I think that much of what Linus is saying indicates that he hasn't tried
> to write any such thing himself. ;-)

That's definitely true. The only interaction I ever had with "hibernation" 
(and yes, we should just call it that) is when I was working on s2ram and 
cleaning up the PCI device suspend/resume in particular, and trying 
(_mostly_ successfully - I think I broke it once or twice mainly due to 
interactions with the console, but on the whole I think it mostly worked) 
to not break hibernation in the process without actually running it.

> Now that I know much more than before, I can say I agree with Linus on his
> opinion about the separation of s2ram form the snapshot/restore functionality
> (I'll call it 'hibernation' for simplicity from now on).

So my strong opinion on it literally comes from the other end (ie _not_ 
knowing about hibernation, but trying to work with s2ram, and cursing the 
mixups).

> It should be done, because it would make things simpler and cleaner.  
> Still, it will be difficult to do without screwing users en masse and 
> that's my main concern here.

I do agree. It will inevitably affect a lot of devices. That's always 
painful.

> I don't agree that we don't need the tasks freezer for suspending and
> hibernation.  We need it, because we need to be sure that the (other) tasks
> will not get us in the way, and that also applies to kernel threads (and I
> don't think the tasks freezer is 'screwing' them, BTW).

I actually feel much less strongly about that, because just separating out 
s2ram and hibernate entirely from each other would already really get the 
thing _I_ care about taken care of - being able to work on one of the 
other without fear of breaking the other one.

And besides, I actually came into the whole discussion because I'm not a 
huge fan of thinking that user-land is "better". If the thing can sanely 
be done in kernel, I'm actually all for that. What drives me wild is 
having three different things, and nobody driving.

It needs somebody who (a) cares (b) has good taste and (c) has enough time 
and personal karma to burn that he can actually take the (obviously) 
inevitable heat from just doing things right, and convincing people to 
select *one* implementation.

That kind of person is really really hard to find. And if you're it, 
you're in for some pain ;)

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:49             ` David Lang
@ 2007-04-26 23:27               ` Pavel Machek
  2007-04-26 22:56                 ` David Lang
  2007-04-27  0:23               ` Olivier Galibert
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-26 23:27 UTC (permalink / raw)
  To: David Lang; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

Hi!

> >That's a feature. If you are booting from live CD, you _want_ to erase
> >any hibernation image.
> 
> why?
> 
> it's been stated that doing a std and booting another OS (including 
> windows) is a valid and common useage. saying that if you boot another OS 
> you trash your suspended image doesn't sound reasonable.

If you hibernate your machine, boot from live cd, and change anything
on any filesystem, you are pretty likely to loose that filesystem.

Doing that with Windows is okay as Windows do not usually write to
ext3 partitions.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 20:50               ` Nigel Cunningham
@ 2007-04-27  0:10                 ` Olivier Galibert
  2007-04-27 10:21                   ` Daniel Pittman
  2007-04-27 23:19                   ` Nigel Cunningham
  0 siblings, 2 replies; 136+ messages in thread
From: Olivier Galibert @ 2007-04-27  0:10 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, Xavier Bestel, Pekka Enberg, LKML

On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
> I'm perfectly willing to think through some alternate approach if you
> suggest something or prod my thinking in a new direction, but I'm afraid
> I just can't see right now how we can achieve what you're after.

Ok, what about this approach I've been mulling about for a while:

Suspend-to-disk is pretty much an exercise in state saving.  There are
multiple ways to do state saving, but they tend to end up in two
categories: implicit and explicit.

In implicit state saving, you try to save the state of the
system/application/whatever "under its feet", more or less, and then
fixup what is no saved/saveable correctly.  A well-known example is
the undumping process Emacs goes (went?) where it tries to dump the
state of the memory as a new executable, with a lot of pleasure with
various executable formats and subtleties due to side effects in libc
code you don't control.

In explicit state saving each object saves what is needed from its
state to an independently defined format (instead of "whatever the
memory organization happens to be at that point").  When reloading the
state you have to parse it, and it usually requires
rebuilding/relocating all references/pointers/etc.  XEmacs currently
has a "portable dumper" that pretty much does just that.  We don't
have any redumping problems anymore, they're over.

Which one is the best depends heavily on the application.  The amount
of code in the implicit case depends on the amount of fixups to do.
In the kernel case it happens to be a lot, pretty much everything that
touches hardware has to save to memory the device state and reload it
on resume.  And bugs on hardware handling can be quite annoying to
debug.  And if some driver does not to saving/resume correctly, you
have no way outside of playing with modules to ensure the safety of
the suspend cycle.

The amount of code in the explicit case is an interesting variable in
the case of the kernel.  You have to save what is needed, but how do
you define what is needed?  It is, pretty much, what running processes
can observe from userspace.  Now, what can a process observe:
- its application text and anonymous memory pages
- its file handles
- its mapped files
- its mapped whatever else
- its sys5 IPC stuff
- futex stuff and friends, namespaces, etc
- its intrinsic characteristics it can reach through syscalls
  (i.e. the user-visible parts of current, like pid, uid...)
- its currently running system call, if any

So that's what we'd have to explicitely save.  Anonymous memory, sys5
IPC, futex and current structures, that's easy stuff in practice.  The
fun part are pretty much:
- references to files
- references to active networking links
- references to devices and associated visible state
- currently running system call, aka the kernel stack for the process

The last one is the one I'm the most afraid of.  I hope that the
signal stuff and/or the asynchronous syscall stuff that was discussed
recently would allow to "unwind" blocking system calls back to the
syscall level and then store the parameters for resume-time restart.
The non-blocking calls you can just let finish.

The first one is really interesting.  If you value your filesystems,
you'd rather have them clean after the suspend.  And also you pretty
much know that filesystems can move around when you're not looking, be
it USB hotplug stuff (discovery order is random-ish isn't it?), module
loading order issues or multithreaded device discovery.  So you're way
more happy *not* caching anything from the filesystem you can avoid.

But what is a file reference, really?  With the dcache handy, it's
pretty much a path, since inodes don't always exist reliably.  And if
you have the lists of paths used by the processes on a particular
filesystem, you can easily get an idea of where, if anywhere, the
filesystem is even if you don't have reliable serials.  More
interestingly, you cannot, in any case, instantly corrupt your
filesystem by having a mismatch between the in-memory cache and the
reality.

The processes which referenced files you can't find anywhere will
end-up with EBADF or segfault depending on whether it was fd or mmap,
ala revoke().  They'll probably die horribly.  I'd rather have
processes die than filesystems die, since in any case if the file
isn't here anymore in practice the process could only destroy things.

An interesting things there, nothing in that touches either the
filesystem or the block devices.  Everything is done at the VFS level.
The devices don't need to care.  And the "this filesystem goes there"
can be done in userspace in an initramfs if people want to experiment
with kinky strategies.  After all, why not allow a sysadmin to regroup
two filesystems into one though a suspend, the processes mostly don't
need to care (well, tar may, but heh).  Deleted files would have to be
sillyrenamed or something.  Implementation details ;-)

Active networking links, you can consider them dead for a start.  The
networking guys can play with keepalives and stuff if they want to in
a second step.  Network seldom survives suspend anyway, too many
timeouts involved, especially with dynamic IPs.

That leaves references to devices.  null, ptys, random, log are not a
problem, they're virtual constructs.  In a first approximation you can
revoke() the rest brutally.  On a "standard" system that will kill X
(ouch), GPM and other input-interested devices, and everything with an
opened sound device.  Then you can add explicit state saving support
to the devices you want, one by one.  It may be possible to handle
sound collectively at the ALSA layer level, I don't really know.
Input shouldn't be too hard, not much state to save, X will be a pain
and will probably need special casing.  X is a big special case
anyway, no matter what happens.

For the less directly used devices you can always all explicit support
when you feel like it.  The interesting part is that either the device
supports the suspend and says so explicitely, or the process can't
access the device anymore using the previous fds/mmaps after resume.
No weird half-condition.  If (very) resilient, the process can even
close, reopen, reconfigure and go on its merry way.

And if you design the saving format correctly (attribute name/value
pairs as text work beautifully for such a case), you can be resilient
to extreme things including kernel version change or rsync-ing / and
the state file and resuming in another box.  And if a device gets
something it can't parse as the state to go back to for a given
fd/mmap for a process, it can always revoke() that one and go on.

The main point of that kind of state-saving is to be
trustable-by-design.  For each process, either its environment could
be restored correctly or the incorrect parts can not be accessed
anymore.  And the stability of the system and its filesystems is
ensured pretty much whatever happens.


There are a billion details to take into account in a real
implementation, but I'm sure you can get the gist of the idea.

  OG.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:49             ` David Lang
  2007-04-26 23:27               ` Pavel Machek
@ 2007-04-27  0:23               ` Olivier Galibert
  1 sibling, 0 replies; 136+ messages in thread
From: Olivier Galibert @ 2007-04-27  0:23 UTC (permalink / raw)
  To: David Lang
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Thu, Apr 26, 2007 at 03:49:51PM -0700, David Lang wrote:
> swap partitions are limited to 2G (or at least they were a couple of months 
> ago when I last checked). I also don't want to run the risk of having a box 
> try to _use_ 16G worth of swap. I'd rather have the box hit OOM first.

They aren't limited anymore, I have a number of machines with 20G swap
for experiments.

  OG.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 19:56       ` Nigel Cunningham
@ 2007-04-27  4:52         ` Pekka J Enberg
  2007-04-27  6:08           ` Nigel Cunningham
  2007-04-27 20:44           ` Rafael J. Wysocki
  2007-04-28 19:09         ` Bill Davidsen
  1 sibling, 2 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27  4:52 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML

On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> >    which will map in the snapshot, return the mapped address and the size 
> >    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
> >    you're actually *better* off just admitting that if you cannot shrink 
> >    the snapshot to less than 32 bits, it's not worth doing)
 
On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> That inherently limits the image to half of available ram (you need
> somewhere to store the snapshot), so you won't get the full image you
> express interest in below.

It doesn't. We can make the userspace mapped pages copy-on-write. As long 
as the userspace makes sure there's not much activity during 
snapshot/shutdown, we will be fine. What we probably do need to copy is 
kernel pages.

			Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 22:40       ` Pavel Machek
@ 2007-04-27  5:41         ` Pekka Enberg
  2007-04-27 14:55           ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: Pekka Enberg @ 2007-04-27  5:41 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linus Torvalds, Nigel Cunningham, LKML

On 4/27/07, Pavel Machek <pavel@ucw.cz> wrote:
> Now, it would be _very_ nice to be able to snapshot system and
> continue running, but I just don't see how to do it without extensive
> filesystem support.

So what kind of support do we need from the filesystem?

                                              Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  4:52         ` Pekka J Enberg
@ 2007-04-27  6:08           ` Nigel Cunningham
  2007-04-27  6:18             ` Pekka J Enberg
  2007-04-27 20:44           ` Rafael J. Wysocki
  1 sibling, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27  6:08 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

Hi.

On Fri, 2007-04-27 at 07:52 +0300, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > >    which will map in the snapshot, return the mapped address and the size 
> > >    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
> > >    you're actually *better* off just admitting that if you cannot shrink 
> > >    the snapshot to less than 32 bits, it's not worth doing)
>  
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
> 
> It doesn't. We can make the userspace mapped pages copy-on-write. As long 
> as the userspace makes sure there's not much activity during 
> snapshot/shutdown, we will be fine. What we probably do need to copy is 
> kernel pages.

COW is a possibility, but I understood (perhaps wrongly) that Linus was
thinking of a single syscall or such like to prepare the snapshot. If
you're going to start doing things like this, won't that mean you'd then
have to update/redo the snapshot or somehow nullify the effect of
anything the programs does so that doing it again after the snapshot is
restored doesn't cause problems?

I was going to leave it at that and press send, but perhaps that
wouldn't be wise. I feel I should also ask what you're thinking of as a
means of making sure userspace doesn't do much activity.

Thanks for your labours!

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:08           ` Nigel Cunningham
@ 2007-04-27  6:18             ` Pekka J Enberg
  2007-04-27  6:29               ` Pekka J Enberg
                                 ` (3 more replies)
  0 siblings, 4 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27  6:18 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> COW is a possibility, but I understood (perhaps wrongly) that Linus was
> thinking of a single syscall or such like to prepare the snapshot. If
> you're going to start doing things like this, won't that mean you'd then
> have to update/redo the snapshot or somehow nullify the effect of
> anything the programs does so that doing it again after the snapshot is
> restored doesn't cause problems?

No. The snapshot is just that. A snapshot in time. From kernel point of 
view, it doesn't matter one bit what when you did it or if the state has 
changed before you resume. It's up to userspace to make sure the user 
doesn't do real work while the snapshot is being written to disk and 
machine is shut down.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> I was going to leave it at that and press send, but perhaps that
> wouldn't be wise. I feel I should also ask what you're thinking of as a
> means of making sure userspace doesn't do much activity.

When the snapshot pages are COW, we will run out of memory if userspace 
writes to those pages too much. If userspace is blocked, say like 
displaying a "we are suspending" in X which blocks the user from using 
other programs that could generate new writes and mounting filesystems 
read-only, we don't need to worry about running out of memory.

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:18             ` Pekka J Enberg
@ 2007-04-27  6:29               ` Pekka J Enberg
  2007-04-27  6:34               ` Nigel Cunningham
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27  6:29 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?

On Fri, 27 Apr 2007, Pekka J Enberg wrote:
> No. The snapshot is just that. A snapshot in time. From kernel point of 
> view, it doesn't matter one bit what when you did it or if the state has 
> changed before you resume. It's up to userspace to make sure the user 
> doesn't do real work while the snapshot is being written to disk and 
> machine is shut down.

Btw, obviously we need to break the COW when resuming and not include the 
snapshot mapping. However, that should be trivially doable by snapshotting 
the page mappings before remapping them as COW.

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:18             ` Pekka J Enberg
  2007-04-27  6:29               ` Pekka J Enberg
@ 2007-04-27  6:34               ` Nigel Cunningham
  2007-04-27  6:50                 ` Pekka J Enberg
  2007-04-27  9:50               ` Oliver Neukum
  2007-04-27 21:24               ` Rafael J. Wysocki
  3 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27  6:34 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 2787 bytes --]

Hi.

On Fri, 2007-04-27 at 09:18 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
> 
> No. The snapshot is just that. A snapshot in time. From kernel point of 
> view, it doesn't matter one bit what when you did it or if the state has 
> changed before you resume. It's up to userspace to make sure the user 
> doesn't do real work while the snapshot is being written to disk and 
> machine is shut down.

Sorry Pekka, but that's just broken.

It implies firstly that we tell all userspace programs "I'm sorry, but
I'm suspending at the moment. Can you tip toe quietly around while I do
it?" You can't seriously expect every userspace program to be modified
to adjust it's behaviour according to whether we're writing a snapshot
to disk at the moment or not.

It also implies that we can prepare a snapshot and then happily have the
contents of the disk change so that they don't match the superblock and
other filesystem details we just saved in the snapshot. We can't. At
least not without modifying all the filesystems so that (at a minimum)
they know how to throw away all the metadata they have at resume time
and reread it from disk.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > I was going to leave it at that and press send, but perhaps that
> > wouldn't be wise. I feel I should also ask what you're thinking of as a
> > means of making sure userspace doesn't do much activity.
> 
> When the snapshot pages are COW, we will run out of memory if userspace 
> writes to those pages too much. If userspace is blocked, say like 
> displaying a "we are suspending" in X which blocks the user from using 
> other programs that could generate new writes and mounting filesystems 
> read-only, we don't need to worry about running out of memory.

This sounds feasible, but it's only really acceptable if your willing to
have hibernation fail or restart multiple times. If your battery is
running out or you need to rush to put a lappy in your bag because they
train just came early, that's not an option. It's for that very reason
that I've put a lot of effort into trying to make it work first time,
every time. Not there yet, but it's a priority.

By the way, sorry. This email feels like it is pouring a lot of cold
water on your ideas. I don't want to be negative!

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:34               ` Nigel Cunningham
@ 2007-04-27  6:50                 ` Pekka J Enberg
  2007-04-27  7:03                   ` Nigel Cunningham
  0 siblings, 1 reply; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27  6:50 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Sorry Pekka, but that's just broken.

It certainly isn't.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It implies firstly that we tell all userspace programs "I'm sorry, but
> I'm suspending at the moment. Can you tip toe quietly around while I do
> it?" You can't seriously expect every userspace program to be modified
> to adjust it's behaviour according to whether we're writing a snapshot
> to disk at the moment or not.

You don't need to modify other programs. You just need to display the 
progress bar and block _user input_. I don't even claim to know X, but I 
would be extremely surprised if you technically can't say "don't let 
the user touch any other windows except this one." The user couldn't care 
less whether tasks are frozen or not by the kernel. What matters is that 
the user can't shoot himself in the foot while snapshotting.

Furthermore, we probably do need to do other things to ensure safety, like 
remounting filesystems read-only but again, this has nothing to do with 
snapshotting per se. What the kernel needs to worry about is (1) providing 
an atomic snapshot that is consistent and (2) resuming to that snapshot 
safely. If the _user_ loses data that was generated between snapshot + 
shutdown, it's absolutely no concern for the snapshot operation!

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> It also implies that we can prepare a snapshot and then happily have the
> contents of the disk change so that they don't match the superblock and
> other filesystem details we just saved in the snapshot. We can't. At
> least not without modifying all the filesystems so that (at a minimum)
> they know how to throw away all the metadata they have at resume time
> and reread it from disk.

But you just explained how we can! We shouldn't bend over backwards for 
snapshotting just because the filesystems don't currently support 
something we need.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> By the way, sorry. This email feels like it is pouring a lot of cold
> water on your ideas. I don't want to be negative!

Don't worry, I am used to cold water :-).

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:50                 ` Pekka J Enberg
@ 2007-04-27  7:03                   ` Nigel Cunningham
  2007-04-27  7:24                     ` Pekka J Enberg
  0 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27  7:03 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 3483 bytes --]

Hi.

On Fri, 2007-04-27 at 09:50 +0300, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > Sorry Pekka, but that's just broken.
> 
> It certainly isn't.
> 
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It implies firstly that we tell all userspace programs "I'm sorry, but
> > I'm suspending at the moment. Can you tip toe quietly around while I do
> > it?" You can't seriously expect every userspace program to be modified
> > to adjust it's behaviour according to whether we're writing a snapshot
> > to disk at the moment or not.
> 
> You don't need to modify other programs. You just need to display the 
> progress bar and block _user input_. I don't even claim to know X, but I 
> would be extremely surprised if you technically can't say "don't let 
> the user touch any other windows except this one." The user couldn't care 
> less whether tasks are frozen or not by the kernel. What matters is that 
> the user can't shoot himself in the foot while snapshotting.

User input doesn't account for all system activity. Think of cron jobs
or user initiated jobs that may have started before the cycle began.

> Furthermore, we probably do need to do other things to ensure safety, like 
> remounting filesystems read-only but again, this has nothing to do with 
> snapshotting per se. What the kernel needs to worry about is (1) providing 
> an atomic snapshot that is consistent and (2) resuming to that snapshot 
> safely. If the _user_ loses data that was generated between snapshot + 
> shutdown, it's absolutely no concern for the snapshot operation!

Noooo! If the user looses data, the user will be concerned and we should
be. I for one would do my best to avoid using software that loses my
data for me. I wouldn't care if you said "Well, it's your fault. You
lost the data." From my perspective as a user, I didn't lose the data,
some part of the computer's OS did.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > It also implies that we can prepare a snapshot and then happily have the
> > contents of the disk change so that they don't match the superblock and
> > other filesystem details we just saved in the snapshot. We can't. At
> > least not without modifying all the filesystems so that (at a minimum)
> > they know how to throw away all the metadata they have at resume time
> > and reread it from disk.
> 
> But you just explained how we can! We shouldn't bend over backwards for 
> snapshotting just because the filesystems don't currently support 
> something we need.

Sorry, but I just don't believe filesystems should need to throw away
metadata post resume. If we let data be changed after snapshotting (or
ourselves cause it to be changed), we're the ones that are broken. Our
snapshot is out of date and the expectations of userspace programs that
were snapshotted will be out of date. Just imagine, for example, a
userspace program that is snapshotted, then reads and deletes a
temporary file. After the snapshot restore, it's running again. But
wait, we can't read or delete the file again because it's already gone.
Life just gets more complicated and confusing this way.

> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > By the way, sorry. This email feels like it is pouring a lot of cold
> > water on your ideas. I don't want to be negative!
> 
> Don't worry, I am used to cold water :-).

Maybe, but I'd still rather be encouraging!

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  7:03                   ` Nigel Cunningham
@ 2007-04-27  7:24                     ` Pekka J Enberg
  0 siblings, 0 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27  7:24 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Linus Torvalds, LKML

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> User input doesn't account for all system activity. Think of cron jobs
> or user initiated jobs that may have started before the cycle began.

Yes, but the _user_ did not start them so they didn't lose any work. See, 
it might or might not be important but that's something the _userspace_ 
has much more knowledge than the kernel ever will.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Noooo! If the user looses data, the user will be concerned and we should
> be. I for one would do my best to avoid using software that loses my
> data for me. I wouldn't care if you said "Well, it's your fault. You
> lost the data." From my perspective as a user, I didn't lose the data,
> some part of the computer's OS did.

You are looking at snapshot/shutdown from kernel and user experience point 
of view at the same time which causes confusion here.

Let me repeat: it is _absolutely no concern_ of the _kernel_ whether you 
resume to a snapshot that does not contain all your precious data. The 
kernel doesn't care one bit!

That being said, the _userspace solution_ obviously needs to take this 
into account by blocking user input, making filesystems read-only, and 
maybe even blocking certain background processes (cron and beagle indexing 
come into mind).

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Sorry, but I just don't believe filesystems should need to throw away
> metadata post resume. If we let data be changed after snapshotting (or
> ourselves cause it to be changed), we're the ones that are broken. Our
> snapshot is out of date and the expectations of userspace programs that
> were snapshotted will be out of date. Just imagine, for example, a
> userspace program that is snapshotted, then reads and deletes a
> temporary file. After the snapshot restore, it's running again. But
> wait, we can't read or delete the file again because it's already gone.
> Life just gets more complicated and confusing this way.

It doesn't. We can either make the filesystem read-only or, surprise, 
surprise, make a _snapshot_ of the filesystem!

And while the points you raised are important for the full 
end-user solution, it is absolutely not interesting to snapshot_system(). 
The only thing it needs to guarantee is a consistent snapshot that we can 
resume later.

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> Maybe, but I'd still rather be encouraging!

You are. Perhaps you just don't know it yet. ;-)

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 17:34         ` Linus Torvalds
  2007-04-26 20:08           ` Nigel Cunningham
@ 2007-04-27  7:51           ` Pekka Enberg
  1 sibling, 0 replies; 136+ messages in thread
From: Pekka Enberg @ 2007-04-27  7:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Xavier Bestel, Nigel Cunningham, LKML

On 4/26/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> In fact, I personally feel that I shouldn't even have merged
> userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
> feelings simply don't matter that much. I have to trust people. But yes,
> as far as *I* am personally concerned, I think it was a mistake to merge
> it.

While the ioctl() interface is horrid, I think it's actually in
principle pretty close to your snapshot_system()/resume_snapshot().
The ugliness probably comes from the fact that suspend to RAM and
snapshot/shutdown are interleaved there too.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:18             ` Pekka J Enberg
  2007-04-27  6:29               ` Pekka J Enberg
  2007-04-27  6:34               ` Nigel Cunningham
@ 2007-04-27  9:50               ` Oliver Neukum
  2007-04-27 10:12                 ` Pekka J Enberg
  2007-04-27 21:24               ` Rafael J. Wysocki
  3 siblings, 1 reply; 136+ messages in thread
From: Oliver Neukum @ 2007-04-27  9:50 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Nigel Cunningham, Linus Torvalds, LKML

Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
> 
> No. The snapshot is just that. A snapshot in time. From kernel point of 
> view, it doesn't matter one bit what when you did it or if the state has 
> changed before you resume. It's up to userspace to make sure the user 
> doesn't do real work while the snapshot is being written to disk and 
> machine is shut down.

And where is the benefit in that? How is such user space freezing logic
simpler than having the kernel do the write?
What can you do in user space if all filesystems are r/o that is worth the
hassle?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 21:38             ` Theodore Tso
@ 2007-04-27 10:10               ` Christoph Hellwig
  0 siblings, 0 replies; 136+ messages in thread
From: Christoph Hellwig @ 2007-04-27 10:10 UTC (permalink / raw)
  To: Theodore Tso, Nigel Cunningham, Linus Torvalds, Xavier Bestel,
	Pekka Enberg, LKML

On Thu, Apr 26, 2007 at 05:38:07PM -0400, Theodore Tso wrote:
> On Fri, Apr 27, 2007 at 06:08:01AM +1000, Nigel Cunningham wrote:
> > We tried that. It would need some work. IIRC remounting filesystems
> > read-only makes files become marked read-only. Perfectly sensible,
> > except that if you then remount the filesystem rw at resume time, all
> > those files are still marked ro and userspace crashes and burns. Not
> > unfixable, I'll agree, but there is more work to do there.
> 
> There are other solutions, though.  One is that we could export a
> system call interface which freezes a filesystem and prevents any
> further I/O.  We mostly have something like that right now (via the
> the write_super_lockfs function in the superblock operations
> structure), but we haven't exported it to userspace.

It is exported on XFS ;-)

> We would also need a similar interface to freeze any block device I/O,
> in case you have a database running and doing direct I/O to a block
> device.  (Or again, we could simply not support that case; how many
> people will be running running a database accessing a block deivce on
> their laptop?)

block device I/O uses generic_file*whateveriscurrenthere*_write, which
checks for the freeze flag, so the infrastructure for that is there
aswell.


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  9:50               ` Oliver Neukum
@ 2007-04-27 10:12                 ` Pekka J Enberg
  2007-04-27 19:07                   ` Oliver Neukum
  2007-04-28 10:35                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-27 10:12 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Nigel Cunningham, Linus Torvalds, LKML

Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> > No. The snapshot is just that. A snapshot in time. From kernel point of 
> > view, it doesn't matter one bit what when you did it or if the state has 
> > changed before you resume. It's up to userspace to make sure the user 
> > doesn't do real work while the snapshot is being written to disk and 
> > machine is shut down.

On Fri, 27 Apr 2007, Oliver Neukum wrote:
> And where is the benefit in that? How is such user space freezing logic
> simpler than having the kernel do the write?
>
> What can you do in user space if all filesystems are r/o that is worth the
> hassle?

I am talking about snapshot_system() here. It's not given that the 
filesystems need to be read-only (you can snapshot them too). The benefit 
here is that you can do whatever you want with the snapshot (encrypt, 
compress, send over the network)  and have a clean well-defined interface 
in the kernel. In addition, aborting the snapshot is simpler, simply 
munmap() the snapshot.

The problem with writing in the kernel is obvious: we need to add new code 
to the kernel for compression, encryption, and userspace interaction 
(graphical progress bar) that are important for user experience.

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  0:10                 ` Olivier Galibert
@ 2007-04-27 10:21                   ` Daniel Pittman
  2007-04-27 23:19                   ` Nigel Cunningham
  1 sibling, 0 replies; 136+ messages in thread
From: Daniel Pittman @ 2007-04-27 10:21 UTC (permalink / raw)
  To: Olivier Galibert
  Cc: Nigel Cunningham, Linus Torvalds, Xavier Bestel, Pekka Enberg, LKML

Olivier Galibert <galibert@pobox.com> writes:
> On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote:
>
>> I'm perfectly willing to think through some alternate approach if you
>> suggest something or prod my thinking in a new direction, but I'm
>> afraid I just can't see right now how we can achieve what you're
>> after.
>
> Ok, what about this approach I've been mulling about for a while:
>
> Suspend-to-disk is pretty much an exercise in state saving.  There are
> multiple ways to do state saving, but they tend to end up in two
> categories: implicit and explicit.

[...]

> In explicit state saving each object saves what is needed from its
> state to an independently defined format (instead of "whatever the
> memory organization happens to be at that point").  When reloading the
> state you have to parse it, and it usually requires
> rebuilding/relocating all references/pointers/etc.  

If you are looking seriously at this you might want to start with the
code in the OpenVZ kernel (http://openvz.org) that allows a VE to
"checkpoint" to disk and "restore" on the same or a different machine.

This is, as far as I can tell, a portable implementation of this that
already handles real live userspace applications moving transparently
between two machines.

It has the advantage that it lives in an orderly world where most
devices and the file system are virtual but, hey, it works right now.

Regards,
        Daniel
-- 
Digital Infrastructure Solutions -- making IT simple, stable and secure
Phone: 0401 155 707        email: contact@digital-infrastructure.com.au
                 http://digital-infrastructure.com.au/

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 16:56     ` Linus Torvalds
                         ` (5 preceding siblings ...)
  2007-04-26 22:42       ` Pavel Machek
@ 2007-04-27 12:49       ` Pavel Machek
  2007-04-27 21:26         ` Rafael J. Wysocki
  6 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-27 12:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka Enberg, LKML

Hi!

> > * Doing things in the right order? (Prepare the image, then do the
> > atomic copy, then save).
> 
> I'd actually like to discuss this a bit..
> 
> I'm obviously not a huge fan of the whole user/kernel level split and 
> interfaces, but I actually do think that there is *one* split that makes 
> sense:
> 
>  - generate the (whole) snapshot image entirely inside the kernel
> 
>  - do nothing else (ie no IO at all), and just export it as a single image 
>    to user space (literally just mapping the pages into user space). 
>    *one* interface. None of the "pretty UI update" crap. Just a single 
>    system call:
> 
> 	void *snapshot_system(u32 *size);
> 
>    which will map in the snapshot, return the mapped address and the size 
>    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
>    you're actually *better* off just admitting that if you cannot shrink 
>    the snapshot to less than 32 bits, it's not worth doing)

I think this is very similar to current uswsusp design; except that we
are using read on /dev/snapshot to read the snapshot (not memory
mapping) and that we freeze the system (because I do not think killall
_SIGSTOP is enough).

Can you confirm that it is indeed similar design, or tell me why I'm
wrong? You had some pretty strong words for uswsusp before, so I'd
like to understand your position here. ("Ouch, I do not know, I am out
of time" is still better reply than silence.)
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  5:41         ` Pekka Enberg
@ 2007-04-27 14:55           ` Pavel Machek
  2007-04-27 21:39             ` Nigel Cunningham
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-27 14:55 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Linus Torvalds, Nigel Cunningham, LKML

On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> On 4/27/07, Pavel Machek <pavel@ucw.cz> wrote:
> >Now, it would be _very_ nice to be able to snapshot system and
> >continue running, but I just don't see how to do it without extensive
> >filesystem support.
> 
> So what kind of support do we need from the filesystem?

"forcedremount ro, not telling anyone, not killing processes" would do
the trick. FS snapshots might do.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 10:12                 ` Pekka J Enberg
@ 2007-04-27 19:07                   ` Oliver Neukum
  2007-04-28  9:22                     ` Pekka Enberg
  2007-04-28 10:35                   ` Rafael J. Wysocki
  1 sibling, 1 reply; 136+ messages in thread
From: Oliver Neukum @ 2007-04-27 19:07 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Nigel Cunningham, Linus Torvalds, LKML

Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> I am talking about snapshot_system() here. It's not given that the 
> filesystems need to be read-only (you can snapshot them too). The benefit 
> here is that you can do whatever you want with the snapshot (encrypt, 
> compress, send over the network)  and have a clean well-defined interface 
> in the kernel. In addition, aborting the snapshot is simpler, simply 
> munmap() the snapshot.

But is that worth the trade off?

> The problem with writing in the kernel is obvious: we need to add new code 
> to the kernel for compression, encryption, and userspace interaction 
> (graphical progress bar) that are important for user experience.

The kernel can already do compression and encryption.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  4:52         ` Pekka J Enberg
  2007-04-27  6:08           ` Nigel Cunningham
@ 2007-04-27 20:44           ` Rafael J. Wysocki
  1 sibling, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 20:44 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Nigel Cunningham, Linus Torvalds, LKML

On Friday, 27 April 2007 06:52, Pekka J Enberg wrote:
> On Thu, 2007-04-26 at 09:56 -0700, Linus Torvalds wrote:
> > >    which will map in the snapshot, return the mapped address and the size 
> > >    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
> > >    you're actually *better* off just admitting that if you cannot shrink 
> > >    the snapshot to less than 32 bits, it's not worth doing)
>  
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > That inherently limits the image to half of available ram (you need
> > somewhere to store the snapshot), so you won't get the full image you
> > express interest in below.
> 
> It doesn't. We can make the userspace mapped pages copy-on-write. As long 
> as the userspace makes sure there's not much activity during 
> snapshot/shutdown, we will be fine. What we probably do need to copy is 
> kernel pages.

The user space is (and IMHO should be) frozen way before that and what you're
suggesting here is what I wanted to implement some time ago.  The problem with
this was that the user space pages may be updated, for example, by device
drivers as a result of some deferred I/O after we've snapshotted the system.

I didn't know how to find out which pages owned by the user space could be
updated this way, so I gave up at that time.

Greetings,
Rafael


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  6:18             ` Pekka J Enberg
                                 ` (2 preceding siblings ...)
  2007-04-27  9:50               ` Oliver Neukum
@ 2007-04-27 21:24               ` Rafael J. Wysocki
  2007-04-27 21:44                 ` Linus Torvalds
  3 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 21:24 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Nigel Cunningham, Linus Torvalds, LKML

On Friday, 27 April 2007 08:18, Pekka J Enberg wrote:
> On Fri, 27 Apr 2007, Nigel Cunningham wrote:
> > COW is a possibility, but I understood (perhaps wrongly) that Linus was
> > thinking of a single syscall or such like to prepare the snapshot. If
> > you're going to start doing things like this, won't that mean you'd then
> > have to update/redo the snapshot or somehow nullify the effect of
> > anything the programs does so that doing it again after the snapshot is
> > restored doesn't cause problems?
> 
> No. The snapshot is just that. A snapshot in time. From kernel point of 
> view, it doesn't matter one bit what when you did it or if the state has 
> changed before you resume. It's up to userspace to make sure the user 
> doesn't do real work while the snapshot is being written to disk and 
> machine is shut down.

Why do you think that keeping the user space frozen after 'snapshot' is a bad
idea?  I think that solves many of the problems you're discussing.

Greetings,
Rafael


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 12:49       ` Pavel Machek
@ 2007-04-27 21:26         ` Rafael J. Wysocki
  2007-04-27 22:12           ` David Lang
  0 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 21:26 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Friday, 27 April 2007 14:49, Pavel Machek wrote:
> Hi!
> 
> > > * Doing things in the right order? (Prepare the image, then do the
> > > atomic copy, then save).
> > 
> > I'd actually like to discuss this a bit..
> > 
> > I'm obviously not a huge fan of the whole user/kernel level split and 
> > interfaces, but I actually do think that there is *one* split that makes 
> > sense:
> > 
> >  - generate the (whole) snapshot image entirely inside the kernel
> > 
> >  - do nothing else (ie no IO at all), and just export it as a single image 
> >    to user space (literally just mapping the pages into user space). 
> >    *one* interface. None of the "pretty UI update" crap. Just a single 
> >    system call:
> > 
> > 	void *snapshot_system(u32 *size);
> > 
> >    which will map in the snapshot, return the mapped address and the size 
> >    (and if you want to support snapshots > 4GB, be my guest, but I suspect 
> >    you're actually *better* off just admitting that if you cannot shrink 
> >    the snapshot to less than 32 bits, it's not worth doing)
> 
> I think this is very similar to current uswsusp design; except that we
> are using read on /dev/snapshot to read the snapshot (not memory
> mapping) and that we freeze the system

Yes, it seems so.

> (because I do not think killall _SIGSTOP is enough).

Agreed.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 14:55           ` Pavel Machek
@ 2007-04-27 21:39             ` Nigel Cunningham
  0 siblings, 0 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27 21:39 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Pekka Enberg, Linus Torvalds, LKML

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

Hi.

On Fri, 2007-04-27 at 16:55 +0200, Pavel Machek wrote:
> On Fri 2007-04-27 08:41:56, Pekka Enberg wrote:
> > On 4/27/07, Pavel Machek <pavel@ucw.cz> wrote:
> > >Now, it would be _very_ nice to be able to snapshot system and
> > >continue running, but I just don't see how to do it without extensive
> > >filesystem support.
> > 
> > So what kind of support do we need from the filesystem?
> 
> "forcedremount ro, not telling anyone, not killing processes" would do
> the trick. FS snapshots might do.

It sounds to me more like Pekka is thinking of checkpointing support. If
that's the case, then remounting filesystems isn't going to be an
option. You want to freeze them for just long enough so that you can
determine what needs saving in the checkpoint. You certainly don't want
to make rw file handles ro and so on.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 21:24               ` Rafael J. Wysocki
@ 2007-04-27 21:44                 ` Linus Torvalds
  2007-04-27 22:04                   ` Rafael J. Wysocki
                                     ` (2 more replies)
  0 siblings, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-27 21:44 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pekka J Enberg, Nigel Cunningham, LKML



On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> 
> Why do you think that keeping the user space frozen after 'snapshot' is a bad
> idea?  I think that solves many of the problems you're discussing.

It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do

	gdb -p <snapshotter>

when something goes wrong?) but we also *depend* on user space for various 
things (the same way we depend on kernel threads, and why it has been such 
a total disaster to try to freeze the kernel threads too!). For example, 
if you want to do graphical stuff, just using X would be quite nice, 
wouldn't it?

But I do agree that doing everythign in the kernel is likely to just be a 
hell of a lot simpler for everybody.

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 21:44                 ` Linus Torvalds
@ 2007-04-27 22:04                   ` Rafael J. Wysocki
  2007-04-27 22:08                     ` Linus Torvalds
  2007-04-27 22:07                   ` Nigel Cunningham
  2007-04-28  0:18                   ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 22:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pekka J Enberg, Nigel Cunningham, LKML

On Friday, 27 April 2007 23:44, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea?  I think that solves many of the problems you're discussing.
> 
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
> 
> 	gdb -p <snapshotter>
> 
> when something goes wrong?) but we also *depend* on user space for various 
> things (the same way we depend on kernel threads, and why it has been such 
> a total disaster to try to freeze the kernel threads too!).

We're freezing many of them just fine. ;-)

> For example, if you want to do graphical stuff, just using X would be quite
> nice, wouldn't it?

Yes, it would, but as long as we can't protect mounted filesystems from being
touched, it's just dangerous to let the user space run at that point.

> But I do agree that doing everythign in the kernel is likely to just be a 
> hell of a lot simpler for everybody.

:-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 21:44                 ` Linus Torvalds
  2007-04-27 22:04                   ` Rafael J. Wysocki
@ 2007-04-27 22:07                   ` Nigel Cunningham
  2007-04-28  1:03                     ` Kyle Moffett
  2007-04-28  0:18                   ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27 22:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Pekka J Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 1751 bytes --]

Hi.

On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
> 
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > Why do you think that keeping the user space frozen after 'snapshot' is a bad
> > idea?  I think that solves many of the problems you're discussing.
> 
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
> 
> 	gdb -p <snapshotter>

Make the machine being suspended a VM and you can already do that.

> when something goes wrong?) but we also *depend* on user space for various 
> things (the same way we depend on kernel threads, and why it has been such 
> a total disaster to try to freeze the kernel threads too!). For example, 
> if you want to do graphical stuff, just using X would be quite nice, 
> wouldn't it?

It would be nice, yes.

But in doing so you make the contents of the disk inconsistent with the
state you've just snapshotted, leading to filesystem corruption. Even if
you modify filesystems to do checkpointing (which is what we're really
talking about), you still also have the problem that your snapshot has
to be stored somewhere before you write it to disk, so you also have to
either

1) write some known static memory to disk before the snapshot and reuse
it for the snapshot,
2) ensure up to half the RAM is free for your snapshot or 
3) compress the snapshot as you take it, guessing beforehand how much
memory the compressed snapshot might take and freeing that might
4) reserve memory at boot time for the atomic copy so that 2) or 3) is
still done, but without having to free the memory. (Yuk!).

> But I do agree that doing everythign in the kernel is likely to just be a 
> hell of a lot simpler for everybody.

Indeed.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:04                   ` Rafael J. Wysocki
@ 2007-04-27 22:08                     ` Linus Torvalds
  2007-04-27 22:41                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2007-04-27 22:08 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pekka J Enberg, Nigel Cunningham, LKML



On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> We're freezing many of them just fine. ;-)

And can you name a _single_ advantage of doing so?

It so happens, that most people wouldn't notice or care that kmirrord got 
frozen (kernel thread picked at random - it might be one of the threads 
that has gotten special-cased to not do that), but I have yet to hear a 
single coherent explanation for why it's actually a good idea in the first 
place.

And it has added totally idiotic code to every single kernel thread main 
loop. For _no_ reason, except that the concept was broken, and needed more 
breakage to just make it work.

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 21:26         ` Rafael J. Wysocki
@ 2007-04-27 22:12           ` David Lang
  0 siblings, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-27 22:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Pekka Enberg, LKML

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:

> On Friday, 27 April 2007 14:49, Pavel Machek wrote:
>>
>> I think this is very similar to current uswsusp design; except that we
>> are using read on /dev/snapshot to read the snapshot (not memory
>> mapping) and that we freeze the system
>
> Yes, it seems so.
>
>> (because I do not think killall _SIGSTOP is enough).
>

remember, this is being done inside the kernel. the kernel can do things like 
saving off the scheduler queue to prevent any userspace from running during the 
snapshot, it could then move selected pids over to a new queue to selectivly 
'unfreeze' whatever you need (like the X processes for example) and then proceed 
normally (allowing processes to be spawned, forked, etc without activiating the 
rest of userspace becouse the rest just won't be available to be scheduled) and 
userspace can tell the kernel the list of pids to unfreeze so the kernel doesn't 
need to try and guess.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:41                       ` Rafael J. Wysocki
@ 2007-04-27 22:26                         ` David Lang
  2007-04-27 23:21                           ` Rafael J. Wysocki
  2007-04-27 23:17                         ` Linus Torvalds
  1 sibling, 1 reply; 136+ messages in thread
From: David Lang @ 2007-04-27 22:26 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linus Torvalds, Pekka J Enberg, Nigel Cunningham, LKML

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

>>> We're freezing many of them just fine. ;-)
>>
>> And can you name a _single_ advantage of doing so?
>
> Yes.  We have a lot less interdependencies to worry about during the whole
> operation.
>
>> It so happens, that most people wouldn't notice or care that kmirrord got
>> frozen (kernel thread picked at random - it might be one of the threads
>> that has gotten special-cased to not do that), but I have yet to hear a
>> single coherent explanation for why it's actually a good idea in the first
>> place.
>
> Well, I don't know if that's a 'coherent' explanation from your point of view
> (probably not), but I'll try nevertheless:
> 1) if the kernel threads are frozen, we know that they don't hold any locks
> that could interfere with the freezing of device drivers,

does teh process of freezing really wait until all locks have been released?

> 2) if they are frozen, we know, for example, that they won't call user mode
> helpers or do similar things,

this won't matter unless the user mode helpers are going to do I/O or other 
permanent changes

> 3) if they are frozen, we know that they won't submit I/O to disks and
> potentially damage filesystems (suspend2 has much more problems with that
> than swsusp, but still.  And yes, there have been bug reports related to it,
> so it's not just my fantasy).

if you have the filesystems checkpointed then I/O after the freeze won't matter 
as you just revert to the checkpoint (and since this is going to be thrown away 
it can stay in ram)

if we are willing to make a break with the past to implement the new snapshot 
capability, we should be able to use the LVM snapshot code to handle the 
filesystem

David Lang

> Probably some other people can say more about it.
>
>> And it has added totally idiotic code to every single kernel thread main
>> loop. For _no_ reason, except that the concept was broken, and needed more
>> breakage to just make it work.
>
> It is actually useful for some things other than the hibernation/suspend, the
> code is not idiotic (it's one line of code in the majority of cases) and you
> should take that "I hate everything even remotely related to hibernation" hat
> off, really.
>
> Greetings,
> Rafael
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:08                     ` Linus Torvalds
@ 2007-04-27 22:41                       ` Rafael J. Wysocki
  2007-04-27 22:26                         ` David Lang
  2007-04-27 23:17                         ` Linus Torvalds
  0 siblings, 2 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 22:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Pekka J Enberg, Nigel Cunningham, LKML

On Saturday, 28 April 2007 00:08, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > We're freezing many of them just fine. ;-)
> 
> And can you name a _single_ advantage of doing so?

Yes.  We have a lot less interdependencies to worry about during the whole
operation.

> It so happens, that most people wouldn't notice or care that kmirrord got 
> frozen (kernel thread picked at random - it might be one of the threads 
> that has gotten special-cased to not do that), but I have yet to hear a 
> single coherent explanation for why it's actually a good idea in the first 
> place.

Well, I don't know if that's a 'coherent' explanation from your point of view
(probably not), but I'll try nevertheless:
1) if the kernel threads are frozen, we know that they don't hold any locks
that could interfere with the freezing of device drivers,
2) if they are frozen, we know, for example, that they won't call user mode
helpers or do similar things,
3) if they are frozen, we know that they won't submit I/O to disks and
potentially damage filesystems (suspend2 has much more problems with that
than swsusp, but still.  And yes, there have been bug reports related to it,
so it's not just my fantasy).

Probably some other people can say more about it.

> And it has added totally idiotic code to every single kernel thread main 
> loop. For _no_ reason, except that the concept was broken, and needed more 
> breakage to just make it work.

It is actually useful for some things other than the hibernation/suspend, the
code is not idiotic (it's one line of code in the majority of cases) and you
should take that "I hate everything even remotely related to hibernation" hat
off, really.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:21                           ` Rafael J. Wysocki
@ 2007-04-27 23:01                             ` David Lang
  2007-04-28  0:02                               ` Rafael J. Wysocki
  0 siblings, 1 reply; 136+ messages in thread
From: David Lang @ 2007-04-27 23:01 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linus Torvalds, Pekka J Enberg, Nigel Cunningham, LKML

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

> On Saturday, 28 April 2007 00:26, David Lang wrote:
>> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>>
>>>>> We're freezing many of them just fine. ;-)
>>>>
>>>> And can you name a _single_ advantage of doing so?
>>>
>>> Yes.  We have a lot less interdependencies to worry about during the whole
>>> operation.
>>>
>>>> It so happens, that most people wouldn't notice or care that kmirrord got
>>>> frozen (kernel thread picked at random - it might be one of the threads
>>>> that has gotten special-cased to not do that), but I have yet to hear a
>>>> single coherent explanation for why it's actually a good idea in the first
>>>> place.
>>>
>>> Well, I don't know if that's a 'coherent' explanation from your point of view
>>> (probably not), but I'll try nevertheless:
>>> 1) if the kernel threads are frozen, we know that they don't hold any locks
>>> that could interfere with the freezing of device drivers,
>>
>> does teh process of freezing really wait until all locks have been released?
>
> Yes, it does.
>
>>> 2) if they are frozen, we know, for example, that they won't call user mode
>>> helpers or do similar things,
>>
>> this won't matter unless the user mode helpers are going to do I/O or other
>> permanent changes
>
> Please note that even accessing a file may be a permanent change.

if accessing a file on a read-only filesystem changes that filesystem it's a bug

see the recent thread about ext3 journal replays when mounting read-only as an 
example.

>>> 3) if they are frozen, we know that they won't submit I/O to disks and
>>> potentially damage filesystems (suspend2 has much more problems with that
>>> than swsusp, but still.  And yes, there have been bug reports related to it,
>>> so it's not just my fantasy).
>>
>> if you have the filesystems checkpointed then I/O after the freeze won't matter
>> as you just revert to the checkpoint (and since this is going to be thrown away
>> it can stay in ram)
>
> In that case, I would agree.  Currently, however, we're not even close to this
> point.
>
> The checkpointing of filesystems would be a very welcome feature, but there's
> no anyone working on it right now, AFAICT.
>
>> if we are willing to make a break with the past to implement the new snapshot
>> capability, we should be able to use the LVM snapshot code to handle the
>> filesystem
>
> Yes, we can do that, in principle, and screw all of the current users in the
> process.  And finally we'd end up with something similar to what is done now,
> IMHO.

however, the result may be a lot less 'special case pwoer management' code and a 
lot more re-use of code that's in place for other uses.

if work on the current versions was stopped (other then trying to avoid 
regressions) and a new version (with new userspace tools) was built in a way 
that satisfies everyone the old version could be phased out in a year or two 
(per the normal feture removal process)

> And no, the things are not just totally broken, as it may follow from these
> discussions.  The problem is that the people who are discussing them so
> viciously have never tried to write anything like the hibernation code.
>
> This is as though as I were discussing the design of the CPU schedulers,
> although I only know how they work on a general level.
>
> Actually, the really problematic thing with the hibernation _right_ _now_ is
> what Linus is so concerned about (and rightfully so) - that we use the
> same device drivers' callbacks for the hibernation and suspend (aka s2ram).
> The other things work quite well and are really robust.

if simply splitting the functions cleans everything up enough to satisfy 
everyone then we're almost done right? ;-)

however I think that there are other fundamental disagreements here, and neither 
the 'do absolutly everything in the kernel' or the 'do almost nothing in the 
kernel' approaches are going to fly in the long run. I think the 
userspace<->kernel interface is going to be different then either apprach is 
doing now, and as such it's an oppurtunity to make more drastic changes if they 
are appropriate.

for example, why should we have LVM snapshot code and hibernate 
snapshot/filesystem checkpoint code instead of just useing the LVM code (which 
gets excercised and tested far more then the other code ever would be)? saying 
that if you want to suspend to disk you need to use LVM is a change, but it's 
a change that people could probably live with.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:41                       ` Rafael J. Wysocki
  2007-04-27 22:26                         ` David Lang
@ 2007-04-27 23:17                         ` Linus Torvalds
  2007-04-27 23:45                           ` Rafael J. Wysocki
  2007-05-03 15:25                           ` Pavel Machek
  1 sibling, 2 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-27 23:17 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pekka J Enberg, Nigel Cunningham, LKML



On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> > And can you name a _single_ advantage of doing so?
> 
> Yes.  We have a lot less interdependencies to worry about during the whole
> operation.

That's not an advantage. That's why it has *sucked*.

Trying to freeze kernel threads has _caused_ problems. It has _added_ 
these interdependencies. It hasn't removed a single dependency at any 
time, it has just added new problems!

> 1) if the kernel threads are frozen, we know that they don't hold any locks
> that could interfere with the freezing of device drivers,
> 2) if they are frozen, we know, for example, that they won't call user mode
> helpers or do similar things,
> 3) if they are frozen, we know that they won't submit I/O to disks and
> potentially damage filesystems (suspend2 has much more problems with that
> than swsusp, but still.  And yes, there have been bug reports related to it,
> so it's not just my fantasy).

NONE of these are valid explanations at all. You're listing totally 
theoretical problems, and ignoring all the _real_ problems that trying to 
freeze kernel threads has _caused_.

If you want to control user-mode helpers, you do that - you do not freeze 
kernel threads!

And no, kernel threads do not submit IO to disks on their own. You just 
made that up. Yes, they can be involved in that whole disk submission 
thing, but in a good way - they can be required in order to make disk 
writing work!

The problem that suspend has had is that it's done everything totally the 
wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 
For example, kernel threads can be involved in md etc, but that's a *good* 
thing. The way to shut them up is not to freeze the threads, but to freeze 
the *disk*.

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27  0:10                 ` Olivier Galibert
  2007-04-27 10:21                   ` Daniel Pittman
@ 2007-04-27 23:19                   ` Nigel Cunningham
  1 sibling, 0 replies; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27 23:19 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: Linus Torvalds, Xavier Bestel, Pekka Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 283 bytes --]

Hi.

Just to let you know - I'm not ignoring your message. It's just taking
some time to think through the issues and try to formulate a good reply.
Oh, and of course there are a gazillion other messages flying about at
the moment that need attention too.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:26                         ` David Lang
@ 2007-04-27 23:21                           ` Rafael J. Wysocki
  2007-04-27 23:01                             ` David Lang
  0 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 23:21 UTC (permalink / raw)
  To: David Lang; +Cc: Linus Torvalds, Pekka J Enberg, Nigel Cunningham, LKML

On Saturday, 28 April 2007 00:26, David Lang wrote:
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> >>> We're freezing many of them just fine. ;-)
> >>
> >> And can you name a _single_ advantage of doing so?
> >
> > Yes.  We have a lot less interdependencies to worry about during the whole
> > operation.
> >
> >> It so happens, that most people wouldn't notice or care that kmirrord got
> >> frozen (kernel thread picked at random - it might be one of the threads
> >> that has gotten special-cased to not do that), but I have yet to hear a
> >> single coherent explanation for why it's actually a good idea in the first
> >> place.
> >
> > Well, I don't know if that's a 'coherent' explanation from your point of view
> > (probably not), but I'll try nevertheless:
> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
> 
> does teh process of freezing really wait until all locks have been released?

Yes, it does.

> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
> 
> this won't matter unless the user mode helpers are going to do I/O or other 
> permanent changes

Please note that even accessing a file may be a permanent change.

> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still.  And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
> 
> if you have the filesystems checkpointed then I/O after the freeze won't matter 
> as you just revert to the checkpoint (and since this is going to be thrown away 
> it can stay in ram)

In that case, I would agree.  Currently, however, we're not even close to this
point.

The checkpointing of filesystems would be a very welcome feature, but there's
no anyone working on it right now, AFAICT.

> if we are willing to make a break with the past to implement the new snapshot 
> capability, we should be able to use the LVM snapshot code to handle the 
> filesystem

Yes, we can do that, in principle, and screw all of the current users in the
process.  And finally we'd end up with something similar to what is done now,
IMHO.

And no, the things are not just totally broken, as it may follow from these
discussions.  The problem is that the people who are discussing them so
viciously have never tried to write anything like the hibernation code.

This is as though as I were discussing the design of the CPU schedulers,
although I only know how they work on a general level.

Actually, the really problematic thing with the hibernation _right_ _now_ is
what Linus is so concerned about (and rightfully so) - that we use the
same device drivers' callbacks for the hibernation and suspend (aka s2ram).
The other things work quite well and are really robust.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:17                         ` Linus Torvalds
@ 2007-04-27 23:45                           ` Rafael J. Wysocki
  2007-04-27 23:57                             ` Nigel Cunningham
  2007-04-27 23:59                             ` Linus Torvalds
  2007-05-03 15:25                           ` Pavel Machek
  1 sibling, 2 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-27 23:45 UTC (permalink / raw)
  To: Linus Torvalds, Nigel Cunningham; +Cc: Pekka J Enberg, LKML

On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > > And can you name a _single_ advantage of doing so?
> > 
> > Yes.  We have a lot less interdependencies to worry about during the whole
> > operation.
> 
> That's not an advantage. That's why it has *sucked*.

Actually, the less things happen while we're creating and saving the image,
the less sources of potential problems there are and by freezing the kernel
threads (not all of them), we cause less things to happen at that time.

To make you happy, we could stop doing that, but what actual _advantage_
that would bring?

> Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> these interdependencies. It hasn't removed a single dependency at any 
> time, it has just added new problems!

What problems are you talking about?

> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still.  And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
> 
> NONE of these are valid explanations at all. You're listing totally 
> theoretical problems, and ignoring all the _real_ problems that trying to 
> freeze kernel threads has _caused_.

Example, please?

> If you want to control user-mode helpers, you do that - you do not freeze 
> kernel threads!
> 
> And no, kernel threads do not submit IO to disks on their own. You just 
> made that up.

No, I didn't.  Nigel can confirm, I think.

> Yes, they can be involved in that whole disk submission thing, but in a good
> way - they can be required in order to make disk writing work!

Some of them can be, some other's need not be.  We don't need any fs-related
kernel threads for saving the image, for example.

> The problem that suspend has had is that it's done everything totally the 
> wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 

They can be asked before we do the snapshot and complete the operation
afterwards, no?

> For example, kernel threads can be involved in md etc, but that's a *good* 
> thing.

We don't freeze these threads.

> The way to shut them up is not to freeze the threads, but to freeze the *disk*.

In principle, you're right.  In practice, go and try it.

Anyway, why is it so important that _all_ of the kernel threads be running
while the snapshot is created and saved?

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:57                             ` Nigel Cunningham
@ 2007-04-27 23:50                               ` David Lang
  2007-04-28  0:40                                 ` Linus Torvalds
                                                   ` (2 more replies)
  0 siblings, 3 replies; 136+ messages in thread
From: David Lang @ 2007-04-27 23:50 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Rafael J. Wysocki, Linus Torvalds, Pekka J Enberg, LKML

On Sat, 28 Apr 2007, Nigel Cunningham wrote:

> Hi.
>
> On Sat, 2007-04-28 at 01:45 +0200, Rafael J. Wysocki wrote:
>> On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
>>>
>>> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>>>>
>>>>> And can you name a _single_ advantage of doing so?
>>>>
>>>> Yes.  We have a lot less interdependencies to worry about during the whole
>>>> operation.
>>>
>>> That's not an advantage. That's why it has *sucked*.
>>
>> Actually, the less things happen while we're creating and saving the image,
>> the less sources of potential problems there are and by freezing the kernel
>> threads (not all of them), we cause less things to happen at that time.
>>
>> To make you happy, we could stop doing that, but what actual _advantage_
>> that would bring?
>
> A couple of other advantages to freezing other processes:
>
> 1) It makes predicting how much memory is available for making and
> saving snapshot a tractable problem. It therefore makes hibernation
> _much_ more reliable.
> 2) Racing against other processes would also make hibernation slower,
> increasing the chances of your battery running out before the save is
> complete.
> 3) It makes finding potential memory leaks in the code possible. It was
> ages ago now, but at one stage I could display a table saying exactly
> how many pages had been allocated and freed by different sections of the
> process and compare the number of free pages at the start and end of the
> cycle to ensure there were no memory leaks at all.

nobody is suggesting that you leave peocesses running while you do the snapshot, 
what is being proposed is

1. pause userspace (prevent scheduling)
2. make snapshot image of memory
3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
4. unpause
5. save image (with full userspace available, including network)
6. shutdown system (throw away all userspace memory, no need to do graceful
    shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
    needed)

>>> NONE of these are valid explanations at all. You're listing totally
>>> theoretical problems, and ignoring all the _real_ problems that trying to
>>> freeze kernel threads has _caused_.
>>
>> Example, please?
>
> I agree with Rafael. Freezing processes greatly helps in ensuring we
> have a consistent image. He's right, too, in asserting that it's even
> more important for Suspend2. Freezing processes is essential to being
> able to know that those LRU pages won't change and therefore being able
> to save them separately and then reuse them for the atomic copy.

all that's needed for the snapshot is to prevent userspace from scheduling, and 
prevent media from being written to in a permanent way (writing to a LVM volume 
after invoking a snapshot doesn't count, just revert to the snapshot)

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:45                           ` Rafael J. Wysocki
@ 2007-04-27 23:57                             ` Nigel Cunningham
  2007-04-27 23:50                               ` David Lang
  2007-04-27 23:59                             ` Linus Torvalds
  1 sibling, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-27 23:57 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linus Torvalds, Pekka J Enberg, LKML

[-- Attachment #1: Type: text/plain, Size: 4524 bytes --]

Hi.

On Sat, 2007-04-28 at 01:45 +0200, Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 01:17, Linus Torvalds wrote:
> > 
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > >
> > > > And can you name a _single_ advantage of doing so?
> > > 
> > > Yes.  We have a lot less interdependencies to worry about during the whole
> > > operation.
> > 
> > That's not an advantage. That's why it has *sucked*.
> 
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.
> 
> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

A couple of other advantages to freezing other processes:

1) It makes predicting how much memory is available for making and
saving snapshot a tractable problem. It therefore makes hibernation
_much_ more reliable.
2) Racing against other processes would also make hibernation slower,
increasing the chances of your battery running out before the save is
complete.
3) It makes finding potential memory leaks in the code possible. It was
ages ago now, but at one stage I could display a table saying exactly
how many pages had been allocated and freed by different sections of the
process and compare the number of free pages at the start and end of the
cycle to ensure there were no memory leaks at all.

> > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > these interdependencies. It hasn't removed a single dependency at any 
> > time, it has just added new problems!
> 
> What problems are you talking about?
> 
> > > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > > that could interfere with the freezing of device drivers,
> > > 2) if they are frozen, we know, for example, that they won't call user mode
> > > helpers or do similar things,
> > > 3) if they are frozen, we know that they won't submit I/O to disks and
> > > potentially damage filesystems (suspend2 has much more problems with that
> > > than swsusp, but still.  And yes, there have been bug reports related to it,
> > > so it's not just my fantasy).
> > 
> > NONE of these are valid explanations at all. You're listing totally 
> > theoretical problems, and ignoring all the _real_ problems that trying to 
> > freeze kernel threads has _caused_.
> 
> Example, please?

I agree with Rafael. Freezing processes greatly helps in ensuring we
have a consistent image. He's right, too, in asserting that it's even
more important for Suspend2. Freezing processes is essential to being
able to know that those LRU pages won't change and therefore being able
to save them separately and then reuse them for the atomic copy.

> > If you want to control user-mode helpers, you do that - you do not freeze 
> > kernel threads!
> > 
> > And no, kernel threads do not submit IO to disks on their own. You just 
> > made that up.
> 
> No, I didn't.  Nigel can confirm, I think.

I have had problems with MD threads generating I/O that I couldn't
account for - after userspace had been frozen, filesystems had been
nicely synced and so on. I have to speak with reservations though,
because I haven't yet gotten to the bottom of where the I/O is coming
from... too many things, too small time slices.

> > Yes, they can be involved in that whole disk submission thing, but in a good
> > way - they can be required in order to make disk writing work!
> 
> Some of them can be, some other's need not be.  We don't need any fs-related
> kernel threads for saving the image, for example.

Yeah, so long as we bmap the storage we want to use beforehand (thinking
of swap files and ordinary files).

> > The problem that suspend has had is that it's done everything totally the 
> > wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 
> 
> They can be asked before we do the snapshot and complete the operation
> afterwards, no?
> 
> > For example, kernel threads can be involved in md etc, but that's a *good* 
> > thing.
> 
> We don't freeze these threads.
> 
> > The way to shut them up is not to freeze the threads, but to freeze the *disk*.
> 
> In principle, you're right.  In practice, go and try it.

I have to disagree here. Freezing the disk instead of the threads is
dealing with the symptoms instead of the cause.

Regards,

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:45                           ` Rafael J. Wysocki
  2007-04-27 23:57                             ` Nigel Cunningham
@ 2007-04-27 23:59                             ` Linus Torvalds
  2007-04-28  0:18                               ` Linus Torvalds
                                                 ` (2 more replies)
  1 sibling, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-27 23:59 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Nigel Cunningham, Pekka J Enberg, LKML



On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.

That makes no sense.

You have to create the snapshot image with interrupts disabled *anyway*.

I really don't see how you can say that stopping threads etc can make any 
difference what-so-ever. If you don't create the snapshot with interrupts 
disabled (and just with a single CPU running) you have so many other 
problems that it's not even remotely funny.

So there's *by*definition* nothing at all that can happen while you 
snapshot the system. Claiming otherwise is just silly.

> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

Like getting rid of all the magic "I don't want you to freeze me" crud? 

Or getting rid of this horribly idiotic "three times widdershins" kind of 
black magic mentality! It looks like the main reason for the process 
freezing has nothing to do with technology, but some irrational fear of 
other things happening at the same time, even though they CANNOT happen if 
you do things even half-way sanely.

The "let's stop all kernel threads" is superstition. It's the same kind of 
superstition that made people write "sync" three times before turning off 
the power in the olden times. It's the kind of superstition that comes 
from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
that it works when we are beign quiet".

That's bad.

It's doubly bad, because that idiocy has also infected s2ram. Again, 
another thing that really makes no sense at all - and we do it not just 
for snapshotting, but for s2ram too. Can you tell me *why*?

> > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > these interdependencies. It hasn't removed a single dependency at any 
> > time, it has just added new problems!
> 
> What problems are you talking about?

Like you wouldn't know. Look at commit b43376927a that you yourself are 
credited with, just a month ago. 

Then, do something as simple as

	git grep create_freezeable_workthread

and ponder the end results of that grep. If you don't see something wrong, 
you're blind.

> > NONE of these are valid explanations at all. You're listing totally 
> > theoretical problems, and ignoring all the _real_ problems that trying to 
> > freeze kernel threads has _caused_.
> 
> Example, please?

Who do you think you are kidding? See above.

And if you think that's an isolated example, look again. And start 
grepping for PF_NOFREEZE, and other examples.

The fact is, there is not a *single* reason to freeze kernel threads. But 
some rocket scientist decided to, and then screwed everybody else over.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:01                             ` David Lang
@ 2007-04-28  0:02                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  0:02 UTC (permalink / raw)
  To: David Lang; +Cc: Linus Torvalds, Pekka J Enberg, Nigel Cunningham, LKML

On Saturday, 28 April 2007 01:01, David Lang wrote:
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> > On Saturday, 28 April 2007 00:26, David Lang wrote:
> >> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >>
> >>>>> We're freezing many of them just fine. ;-)
> >>>>
> >>>> And can you name a _single_ advantage of doing so?
> >>>
> >>> Yes.  We have a lot less interdependencies to worry about during the whole
> >>> operation.
> >>>
> >>>> It so happens, that most people wouldn't notice or care that kmirrord got
> >>>> frozen (kernel thread picked at random - it might be one of the threads
> >>>> that has gotten special-cased to not do that), but I have yet to hear a
> >>>> single coherent explanation for why it's actually a good idea in the first
> >>>> place.
> >>>
> >>> Well, I don't know if that's a 'coherent' explanation from your point of view
> >>> (probably not), but I'll try nevertheless:
> >>> 1) if the kernel threads are frozen, we know that they don't hold any locks
> >>> that could interfere with the freezing of device drivers,
> >>
> >> does teh process of freezing really wait until all locks have been released?
> >
> > Yes, it does.
> >
> >>> 2) if they are frozen, we know, for example, that they won't call user mode
> >>> helpers or do similar things,
> >>
> >> this won't matter unless the user mode helpers are going to do I/O or other
> >> permanent changes
> >
> > Please note that even accessing a file may be a permanent change.
> 
> if accessing a file on a read-only filesystem changes that filesystem it's a bug
> 
> see the recent thread about ext3 journal replays when mounting read-only as an 
> example.

Oh well.  Is this really wrong to protect users from such bugs, if we can do
that?

> >>> 3) if they are frozen, we know that they won't submit I/O to disks and
> >>> potentially damage filesystems (suspend2 has much more problems with that
> >>> than swsusp, but still.  And yes, there have been bug reports related to it,
> >>> so it's not just my fantasy).
> >>
> >> if you have the filesystems checkpointed then I/O after the freeze won't matter
> >> as you just revert to the checkpoint (and since this is going to be thrown away
> >> it can stay in ram)
> >
> > In that case, I would agree.  Currently, however, we're not even close to this
> > point.
> >
> > The checkpointing of filesystems would be a very welcome feature, but there's
> > no anyone working on it right now, AFAICT.
> >
> >> if we are willing to make a break with the past to implement the new snapshot
> >> capability, we should be able to use the LVM snapshot code to handle the
> >> filesystem
> >
> > Yes, we can do that, in principle, and screw all of the current users in the
> > process.  And finally we'd end up with something similar to what is done now,
> > IMHO.
> 
> however, the result may be a lot less 'special case pwoer management' code and a 

Are you referring to some specific code?

> lot more re-use of code that's in place for other uses.

This already is happening.

> if work on the current versions was stopped (other then trying to avoid 
> regressions) and a new version (with new userspace tools) was built in a way 
> that satisfies everyone the old version could be phased out in a year or two 
> (per the normal feture removal process)

May I say it's not realistic?

> > And no, the things are not just totally broken, as it may follow from these
> > discussions.  The problem is that the people who are discussing them so
> > viciously have never tried to write anything like the hibernation code.
> >
> > This is as though as I were discussing the design of the CPU schedulers,
> > although I only know how they work on a general level.
> >
> > Actually, the really problematic thing with the hibernation _right_ _now_ is
> > what Linus is so concerned about (and rightfully so) - that we use the
> > same device drivers' callbacks for the hibernation and suspend (aka s2ram).
> > The other things work quite well and are really robust.
> 
> if simply splitting the functions cleans everything up enough to satisfy 
> everyone then we're almost done right? ;-)

Practically, yes.  Theoretically, there's no software you can't improve
(except, probably, TeX), but that might not be worth the effort.

> however I think that there are other fundamental disagreements here, and neither 
> the 'do absolutly everything in the kernel' or the 'do almost nothing in the 
> kernel' approaches are going to fly in the long run.

I think we'll have an agreement, though.

> I think the userspace<->kernel interface is going to be different then
> either apprach is doing now,

You're probably right

> and as such it's an oppurtunity to make more drastic changes if they are
> appropriate.

Well, maybe.

> for example, why should we have LVM snapshot code and hibernate 
> snapshot/filesystem checkpoint code instead of just useing the LVM code (which 
> gets excercised and tested far more then the other code ever would be)? saying 
> that if you want to suspend to disk you need to use LVM is a change, but it's 
> a change that people could probably live with.

Well, that's a theory.  Probably a good one, but still. :-)

The positive aspect of all this is that people have started to pay attention to
what we're doing, and gradually they will learn about the problems that they're
just not seeing right now.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 21:44                 ` Linus Torvalds
  2007-04-27 22:04                   ` Rafael J. Wysocki
  2007-04-27 22:07                   ` Nigel Cunningham
@ 2007-04-28  0:18                   ` Jeremy Fitzhardinge
  2007-04-28  1:00                     ` Matthew Garrett
  2 siblings, 1 reply; 136+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-28  0:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Pekka J Enberg, Nigel Cunningham, LKML

Linus Torvalds wrote:
> On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>   
>> Why do you think that keeping the user space frozen after 'snapshot' is a bad
>> idea?  I think that solves many of the problems you're discussing.
>>     
>
> It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do
>
> 	gdb -p <snapshotter>
>
> when something goes wrong?)

Yeah, or gdb vmlinux snapshot

Then you could use kexec for resume...

    J

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:59                             ` Linus Torvalds
@ 2007-04-28  0:18                               ` Linus Torvalds
  2007-05-05 11:42                                 ` Pavel Machek
  2007-04-28  0:50                               ` Paul Mackerras
  2007-04-28  1:00                               ` Rafael J. Wysocki
  2 siblings, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28  0:18 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Nigel Cunningham, Pekka J Enberg, LKML



On Fri, 27 Apr 2007, Linus Torvalds wrote:
> 
> The "let's stop all kernel threads" is superstition. It's the same kind of 
> superstition that made people write "sync" three times before turning off 
> the power in the olden times. It's the kind of superstition that comes 
> from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
> that it works when we are beign quiet".

Side note: while I think things should probably *work* even with user 
processes going full bore while a snapshot it taken, I'll freely admit 
that I'll follow that superstition far enough that I think it's probably a 
good idea to try to quiesce the system to _some_ degree, and that stopping 
user programs is a good idea. Partly because the whole memory shrinking 
thing, and partly just because we should do the snapshot with hw IO queues 
empty.

But I don't think it would necessarily be wrong (and in many ways it would 
probably be *right*) to do that IO queue stopping at the queue level 
rather than at a process level. Why stop processes just becasue you want 
to clean out IO queues? They are two totally different things!

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26  6:04 Back to the future Nigel Cunningham
  2007-04-26  7:28 ` Pekka Enberg
  2007-04-26  8:38 ` Jan Engelhardt
@ 2007-04-28  0:28 ` Bojan Smojver
  2 siblings, 0 replies; 136+ messages in thread
From: Bojan Smojver @ 2007-04-28  0:28 UTC (permalink / raw)
  To: linux-kernel

Nigel Cunningham <nigel <at> nigel.suspend2.net> writes:

> 4) uswsusp and swsusp get dropped and Suspend2 goes into mainline.

After reading most of this thread, it seems that Linus is of the view that all
three of these suck in one way or another. Suspend2 has the most features and is
the fastest of the lot. It can behave like swsusp from the user's point of view
(i.e. echo disk > /sys/power/state), so the migration should be seamless for
most distros. It isn't complicated to set up. It's been proven in the field. It
looks pretty.

So, while we're waiting for the next STD technology, why not have the best and
develop from there?

--
Bojan


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:50                               ` David Lang
@ 2007-04-28  0:40                                 ` Linus Torvalds
  2007-04-28  6:58                                 ` Oliver Neukum
  2007-05-03 17:18                                 ` Pavel Machek
  2 siblings, 0 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28  0:40 UTC (permalink / raw)
  To: David Lang; +Cc: Nigel Cunningham, Rafael J. Wysocki, Pekka J Enberg, LKML



On Fri, 27 Apr 2007, David Lang wrote:
> 
> all that's needed for the snapshot is to prevent userspace from scheduling,

Strictly speaking, all you *really* want to make sure is not so much that 
user-space isn't scheduling, as the fact that all device IO buffers must 
be empty.

We can trivially snapshot an active user-space, and in fact it would 
probably be hard to do a snapshot in a way that it could even *know* or 
care about whether there are user-space processes running at the time of 
the snapshot.

So that's not the real problem.

What we obviously *cannot* snapshot is if some particular device is in the 
middle of being written to or read from, and has outstanding commands on 
the device itself (as opposed to just queued to the driver). So what we do 
want to make sure happens is that there are no IO queues that are active.

And the best way to make sure that there are no IO queues active is to 
make sure that there are no new read or write-requests. And *that* you can 
do two ways:

 - actually intercepting the read/write requests. Probably not too hard, 
   we could literally do it in the IO scheduler (and probably much more 
   easily than doing it in the process scheduler), but the easy cases will 
   only cover the block device layer, and character devices don't have the 
   same kind of scheduler you can trap IO in.

 - we also don't want to generate new data that needs to be snapshotted, 
   so we want to trap people who write even just to the page cache and 
   turn pages dirty. Again, we could probably do it at *that* point (ie 
   trapping them when they try to dirty a page), and it would be more 
   logical, but again, there are other cases of people who generate more 
   data (just any memory allocation obviously is a special case of 
   generating more data to be snapshotted),

so I do agree that we want to stop producing new data to be snapshotted, 
and we want to stop producing new read-requests. But kernel threads really 
do neither: in an idle system, kernel threads are idle too. A kernel 
thread is not like a user program that actually generates data - they only 
tend to act on behalf of other processes' needs.

So I think that what snapshotting really *wants* to stop is not schedulign 
per se, but IO. And stopping user processes (as opposed to kernel threads) 
is probably a good way to get there.

In fact, I'd argue that you want to stop user space and then encourage 
some kernel threads to *start* running, notably things like bdflush should 
probably be kicked to clean up some dirty stuff as part of the "shrink 
data to be snapshotted" part. Trying to free memory will do that on its 
own, of course.

			Linus


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:59                             ` Linus Torvalds
  2007-04-28  0:18                               ` Linus Torvalds
@ 2007-04-28  0:50                               ` Paul Mackerras
  2007-04-28  1:00                               ` Rafael J. Wysocki
  2 siblings, 0 replies; 136+ messages in thread
From: Paul Mackerras @ 2007-04-28  0:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Nigel Cunningham, Pekka J Enberg, LKML

Linus Torvalds writes:

> I really don't see how you can say that stopping threads etc can make any 
> difference what-so-ever. If you don't create the snapshot with interrupts 
> disabled (and just with a single CPU running) you have so many other 
> problems that it's not even remotely funny.

I agree.  I don't like the freezer.  We have had working
kernel-controlled suspend to RAM on powerbooks for almost 10 years
now, and we never needed to freeze processes.

That said, I can see two attractions in freezing processes:

1. It provides a way to stop new I/O requests coming in, and thus
   somewhat makes up for the lack of a way to freeze device request
   queues (at least, we didn't have one last time I looked).

2. Systems do sometimes die while suspended (e.g. run out of battery,
   or the resume process fails), and to make the next boot painless,
   you want the filesystems on disk to be as clean as possible.
   Freezing processes and then doing a sync provides one way to
   achieve that.  Of course, you have to make sure you don't freeze
   any kernel threads that are needed for doing the sync...  And if
   one of your filesystems is using FUSE, it's not going to get very
   far.

Paul.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:15                       ` Rafael J. Wysocki
@ 2007-04-28  0:51                         ` David Lang
  2007-04-28  1:25                         ` Kyle Moffett
  1 sibling, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-28  0:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kyle Moffett, nigel, Linus Torvalds, Pekka J Enberg, LKML

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

> On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:
>> On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
>>> Hi.
>>>
>>> On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
>>>> It makes it harder to debug (wouldn't it be *nice* to just ssh in,
>>>> and do
>>>> 	gdb -p <snapshotter>
>>>
>>> Make the machine being suspended a VM and you can already do that.
>>
>>>> when something goes wrong?) but we also *depend* on user space for
>>>> various things (the same way we depend on kernel threads, and why
>>>> it has been such a total disaster to try to freeze the kernel
>>>> threads too!). For example, if you want to do graphical stuff,
>>>> just using X would be quite nice,  wouldn't it?
>>>
>>> But in doing so you make the contents of the disk inconsistent with
>>> the state you've just snapshotted, leading to filesystem
>>> corruption. Even if you modify filesystems to do checkpointing
>>> (which is what we're really talking about), you still also have the
>>> problem that your snapshot has to be stored somewhere before you
>>> write it to disk, so you also have to either [snip]
>>
>> Actually, it's a lot simpler than that.  We can just combine the
>> device-mapper snapshot with a VM+kernel snapshot system call and be
>> almost done:
>>
>>    sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);
>>
>> When sys_snapshot is run, the kernel does:
>>
>> 1)  Sequentially freeze mounted filesystems using blockdev freezing.
>> If it's an fs that doesn't support freezing then either fail or force-
>> remount-ro that fs and downgrade all its filedescriptors to RO.
>> Doesn't need extra locking since process which try to do IO either
>> succeed before the freeze call returns for that blockdev or sleep on
>> the unfreeze of that blockdev.  Filesystems are synchronized and made
>> clean.
>> 2)  Iterate over the userspace process list, freezing each process
>> and remapping all of its pages copy-on-write.  Any device-specific
>> pages need to have state saved by that device.
>
> Why do you want to do 2) after 1) and not vice versa?

it doesn't really need to matter. if you care, just arrange to not schedule user 
processes while you are doing both steps.

>> 3)  All processes (except kernel threads) are now frozen.
>> 4)  Kernel should save internal state corresponding to current
>> userspace state.  The kernel also swaps out excess pages to free up
>> enough RAM and prepares the snapshot file-descriptor with copies of
>> kernel memory and the original (pre-COW) mapped userspace pages.
>> 5)  Kernel substitutes filesystems for either a device-mapper
>> snapshot with snapblockdev as backing storage or union with tmpfs and
>> remounts the underlying filesystems as read-only.
>> 6)  Kernel unfreezes all userspace processes and returns the snapshot
>> FD to userspace (where it can be read from).
>
> Okay, but how do we do the error recovery if, for example, the image cannot
> be saved?

give the user an error message telling him this, wait for confirmation, and then 
jump directly to the restore step. revert everything to the snapshot image(s), 
restart it.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:12                                 ` Linus Torvalds
@ 2007-04-28  0:54                                   ` David Lang
  2007-04-28  1:44                                   ` Rafael J. Wysocki
  1 sibling, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-28  0:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Nigel Cunningham, Pekka J Enberg, LKML

On Fri, 27 Apr 2007, Linus Torvalds wrote:

> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>>
>>> It's doubly bad, because that idiocy has also infected s2ram. Again,
>>> another thing that really makes no sense at all - and we do it not just
>>> for snapshotting, but for s2ram too. Can you tell me *why*?
>>
>> Why we freeze tasks at all or why we freeze kernel threads?
>
> In many ways, "at all".
>
> I _do_ realize the IO request queue issues, and that we cannot actually do
> s2ram with some devices in the middle of a DMA. So we want to be able to
> avoid *that*, there's no question about that. And I suspect that stopping
> user threads and then waiting for a sync is practically one of the easier
> ways to do so.
>
> So in practice, the "at all" may become a "why freeze kernel threads?" and
> freezing user threads I don't find really objectionable.

there was a thread last week (or so) about splitting up the process list, one 
list for normal user processes, one for kernel threads, and one for dead 
processes waiting to be reaped.

it almost sounds like what you want to do is to act as if the normal user 
threads weren't there for a short time (while you make the snapshot) and then 
recover them to continue and save the snapshot.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:59                             ` Linus Torvalds
  2007-04-28  0:18                               ` Linus Torvalds
  2007-04-28  0:50                               ` Paul Mackerras
@ 2007-04-28  1:00                               ` Rafael J. Wysocki
  2007-04-28  1:12                                 ` Linus Torvalds
  2 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  1:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nigel Cunningham, Pekka J Enberg, LKML

On Saturday, 28 April 2007 01:59, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > Actually, the less things happen while we're creating and saving the image,
> > the less sources of potential problems there are and by freezing the kernel
> > threads (not all of them), we cause less things to happen at that time.
> 
> That makes no sense.
> 
> You have to create the snapshot image with interrupts disabled *anyway*.
> 
> I really don't see how you can say that stopping threads etc can make any 
> difference what-so-ever. If you don't create the snapshot with interrupts 
> disabled (and just with a single CPU running) you have so many other 
> problems that it's not even remotely funny.
> 
> So there's *by*definition* nothing at all that can happen while you 
> snapshot the system. Claiming otherwise is just silly.

For creating the snapshot alone, it doesn't matter.  Except that the restore is
cleaner a bit (we know exactly what all of these threads will be doing when
we restore the image and enable the IRQs after that).

Still, I think that kernel threads can potentailly hold locks accross the
freezing of devices and image creation and that is fishy.  Also I believe,
although I'm not 100% sure, that some of them may cause problems to
appear after we've created the image and while we are saving it.

> > To make you happy, we could stop doing that, but what actual _advantage_
> > that would bring?
> 
> Like getting rid of all the magic "I don't want you to freeze me" crud?

And what exactly is wrong with it?

> Or getting rid of this horribly idiotic "three times widdershins" kind of 
> black magic mentality! It looks like the main reason for the process 
> freezing has nothing to do with technology, but some irrational fear of 
> other things happening at the same time, even though they CANNOT happen if 
> you do things even half-way sanely.
> 
> The "let's stop all kernel threads" is superstition. It's the same kind of 
> superstition that made people write "sync" three times before turning off 
> the power in the olden times. It's the kind of superstition that comes 
> from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
> that it works when we are beign quiet".
>
> That's bad.

Okay.  Accidentally, I'm working on a freezer patch, so I'll probably drop
the freezing of kernel threads from swsusp in it and we'll see what happens.

Let's do the experiment, shall we?

> It's doubly bad, because that idiocy has also infected s2ram. Again, 
> another thing that really makes no sense at all - and we do it not just 
> for snapshotting, but for s2ram too. Can you tell me *why*?

Why we freeze tasks at all or why we freeze kernel threads?

> > > Trying to freeze kernel threads has _caused_ problems. It has _added_ 
> > > these interdependencies. It hasn't removed a single dependency at any 
> > > time, it has just added new problems!
> > 
> > What problems are you talking about?
> 
> Like you wouldn't know. Look at commit b43376927a that you yourself are 
> credited with, just a month ago. 
> 
> Then, do something as simple as
> 
> 	git grep create_freezeable_workthread

s/workthread/workqueue/

> and ponder the end results of that grep. If you don't see something wrong, 
> you're blind.

This was a mistake, quite unrelated to the point you're making.  And actually,
I was trying to fix a problem with two kernel threads that we thought might
submit I/O to disk after the image had been created.  Otherwise I wouldn't
have thought of doing that change.

> > > NONE of these are valid explanations at all. You're listing totally 
> > > theoretical problems, and ignoring all the _real_ problems that trying to 
> > > freeze kernel threads has _caused_.
> > 
> > Example, please?
> 
> Who do you think you are kidding? See above.

Well, if someone does something in a wrong way, that need not mean the
thing he was trying to do was wrong.

Somehow, I knew you would point at this ...

> And if you think that's an isolated example, look again. And start 
> grepping for PF_NOFREEZE, and other examples.

May I say I'm not convinced?

> The fact is, there is not a *single* reason to freeze kernel threads. But 
> some rocket scientist decided to, and then screwed everybody else over.

At least _that_ wasn't me. :-)

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  0:18                   ` Jeremy Fitzhardinge
@ 2007-04-28  1:00                     ` Matthew Garrett
  2007-04-28  1:05                       ` Jeremy Fitzhardinge
  2007-04-28  1:08                       ` Rafael J. Wysocki
  0 siblings, 2 replies; 136+ messages in thread
From: Matthew Garrett @ 2007-04-28  1:00 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linus Torvalds, Rafael J. Wysocki, Pekka J Enberg,
	Nigel Cunningham, LKML

On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:

> Then you could use kexec for resume...

While that would certainly be nifty, I think we're arguably starting 
from the wrong point here. Why are we booting a kernel, trying to poke 
the hardware back into some sort of mock-quiescent state, freeing memory 
and then (finally) overwriting the entire contents of RAM rather than 
just doing all of this from the bootloader? Given the time spent in 
kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd 
still be faster even if you're stuck using int 13 on x86.

http://apcmag.com/5873/page14 suggests that Intel is looking into this, 
but I haven't heard anything more yet. To the best of my knowledge, this 
is also how Windows manages things.
-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 22:07                   ` Nigel Cunningham
@ 2007-04-28  1:03                     ` Kyle Moffett
  2007-04-28  1:15                       ` Rafael J. Wysocki
  2007-05-03 15:10                       ` Pavel Machek
  0 siblings, 2 replies; 136+ messages in thread
From: Kyle Moffett @ 2007-04-28  1:03 UTC (permalink / raw)
  To: nigel; +Cc: Linus Torvalds, Rafael J. Wysocki, Pekka J Enberg, LKML

On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
> Hi.
>
> On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
>> It makes it harder to debug (wouldn't it be *nice* to just ssh in,  
>> and do
>> 	gdb -p <snapshotter>
>
> Make the machine being suspended a VM and you can already do that.

>> when something goes wrong?) but we also *depend* on user space for  
>> various things (the same way we depend on kernel threads, and why  
>> it has been such a total disaster to try to freeze the kernel  
>> threads too!). For example, if you want to do graphical stuff,  
>> just using X would be quite nice,  wouldn't it?
>
> But in doing so you make the contents of the disk inconsistent with  
> the state you've just snapshotted, leading to filesystem  
> corruption. Even if you modify filesystems to do checkpointing  
> (which is what we're really talking about), you still also have the  
> problem that your snapshot has to be stored somewhere before you  
> write it to disk, so you also have to either [snip]

Actually, it's a lot simpler than that.  We can just combine the  
device-mapper snapshot with a VM+kernel snapshot system call and be  
almost done:

   sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);

When sys_snapshot is run, the kernel does:

1)  Sequentially freeze mounted filesystems using blockdev freezing.   
If it's an fs that doesn't support freezing then either fail or force- 
remount-ro that fs and downgrade all its filedescriptors to RO.   
Doesn't need extra locking since process which try to do IO either  
succeed before the freeze call returns for that blockdev or sleep on  
the unfreeze of that blockdev.  Filesystems are synchronized and made  
clean.
2)  Iterate over the userspace process list, freezing each process  
and remapping all of its pages copy-on-write.  Any device-specific  
pages need to have state saved by that device.
3)  All processes (except kernel threads) are now frozen.
4)  Kernel should save internal state corresponding to current  
userspace state.  The kernel also swaps out excess pages to free up  
enough RAM and prepares the snapshot file-descriptor with copies of  
kernel memory and the original (pre-COW) mapped userspace pages.
5)  Kernel substitutes filesystems for either a device-mapper  
snapshot with snapblockdev as backing storage or union with tmpfs and  
remounts the underlying filesystems as read-only.
6)  Kernel unfreezes all userspace processes and returns the snapshot  
FD to userspace (where it can be read from).

Then userspace can do whatever it wants.  Any changes to filesystems  
mounted at the time of snapshot will be discarded at shutdown.   
Freshly mounted filesystems won't have the union or COW thing done,  
and so you can write your snapshot to a compressed encrypted file on  
a USB key if you want to, you just have to unmount it before the  
snapshot() syscall and remount it right afterwards.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:00                     ` Matthew Garrett
@ 2007-04-28  1:05                       ` Jeremy Fitzhardinge
  2007-05-03 15:14                         ` Pavel Machek
  2007-04-28  1:08                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 136+ messages in thread
From: Jeremy Fitzhardinge @ 2007-04-28  1:05 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Linus Torvalds, Rafael J. Wysocki, Pekka J Enberg,
	Nigel Cunningham, LKML

Matthew Garrett wrote:
> While that would certainly be nifty, I think we're arguably starting 
> from the wrong point here. Why are we booting a kernel, trying to poke 
> the hardware back into some sort of mock-quiescent state, freeing memory 
> and then (finally) overwriting the entire contents of RAM rather than 
> just doing all of this from the bootloader?

Sure, you could make suspend generate a complete bootable kernel image
containing all RAM.  Doesn't sound too hard to me.  You know, from over
here on the sidelines.

    J

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:00                     ` Matthew Garrett
  2007-04-28  1:05                       ` Jeremy Fitzhardinge
@ 2007-04-28  1:08                       ` Rafael J. Wysocki
  1 sibling, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  1:08 UTC (permalink / raw)
  To: Matthew Garrett
  Cc: Jeremy Fitzhardinge, Linus Torvalds, Pekka J Enberg,
	Nigel Cunningham, LKML

On Saturday, 28 April 2007 03:00, Matthew Garrett wrote:
> On Fri, Apr 27, 2007 at 05:18:16PM -0700, Jeremy Fitzhardinge wrote:
> 
> > Then you could use kexec for resume...
> 
> While that would certainly be nifty, I think we're arguably starting 
> from the wrong point here. Why are we booting a kernel, trying to poke 
> the hardware back into some sort of mock-quiescent state, freeing memory 
> and then (finally) overwriting the entire contents of RAM rather than 
> just doing all of this from the bootloader? Given the time spent in 
> kernel setup and unpacking initramfs nowadays, I'm willing to bet it'd 
> still be faster even if you're stuck using int 13 on x86.

Yes, that would be faster.

> http://apcmag.com/5873/page14 suggests that Intel is looking into this, 
> but I haven't heard anything more yet. To the best of my knowledge, this 
> is also how Windows manages things.

I think you're right.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:00                               ` Rafael J. Wysocki
@ 2007-04-28  1:12                                 ` Linus Torvalds
  2007-04-28  0:54                                   ` David Lang
  2007-04-28  1:44                                   ` Rafael J. Wysocki
  0 siblings, 2 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28  1:12 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Nigel Cunningham, Pekka J Enberg, LKML



On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> 
> > It's doubly bad, because that idiocy has also infected s2ram. Again, 
> > another thing that really makes no sense at all - and we do it not just 
> > for snapshotting, but for s2ram too. Can you tell me *why*?
> 
> Why we freeze tasks at all or why we freeze kernel threads?

In many ways, "at all".

I _do_ realize the IO request queue issues, and that we cannot actually do 
s2ram with some devices in the middle of a DMA. So we want to be able to 
avoid *that*, there's no question about that. And I suspect that stopping 
user threads and then waiting for a sync is practically one of the easier 
ways to do so.

So in practice, the "at all" may become a "why freeze kernel threads?" and 
freezing user threads I don't find really objectionable.

But as Paul pointed out, Linux on the old powerpc Mac hardware was 
actually rather famous for having working (and reliable) suspend long 
before it worked even remotely reliably on PC's. And they didn't do even
that.

(They didn't have ACPI, and they had a much more limited set of devices, 
but the whole process freezer is really about neither of those issues. The 
wild and wacky PC hardware has its problems, but that's _one_ thing we 
can't blame PC hardware for ;)

> > 	git grep create_freezeable_workthread
> 
> s/workthread/workqueue/

Yes.

> > and ponder the end results of that grep. If you don't see something wrong, 
> > you're blind.
> 
> This was a mistake, quite unrelated to the point you're making.

Did you actually _do_ the "grep" (with the fixed argument)?

I had two totally independent points. #1 was that you yourself have been 
fixing bugs in this area. #2 was the result of that grep. It's absolutely 
_empty_ except for the define to add that interface.

NOBODY USES IT!

Now, grep for the same interface that creates _non_freezeable workqueues.

Put another way:

	[torvalds@woody linux]$ git grep create_workqueue | wc -l
	35

	[torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
	1

and that _one_ hit you get for the "freezeable" case is not actually a 
user, it's the definition!

Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

Yet we have all this support for freezing them (or rather, we freeze them 
by default, and then we have all this support for _not_ doing that wrong 
default thing!)

So yes, I think it would be interesting to just stop freezing kernel 
threads. Totally.

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:03                     ` Kyle Moffett
@ 2007-04-28  1:15                       ` Rafael J. Wysocki
  2007-04-28  0:51                         ` David Lang
  2007-04-28  1:25                         ` Kyle Moffett
  2007-05-03 15:10                       ` Pavel Machek
  1 sibling, 2 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  1:15 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: nigel, Linus Torvalds, Pekka J Enberg, LKML

On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:
> On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
> > Hi.
> >
> > On Fri, 2007-04-27 at 14:44 -0700, Linus Torvalds wrote:
> >> It makes it harder to debug (wouldn't it be *nice* to just ssh in,  
> >> and do
> >> 	gdb -p <snapshotter>
> >
> > Make the machine being suspended a VM and you can already do that.
> 
> >> when something goes wrong?) but we also *depend* on user space for  
> >> various things (the same way we depend on kernel threads, and why  
> >> it has been such a total disaster to try to freeze the kernel  
> >> threads too!). For example, if you want to do graphical stuff,  
> >> just using X would be quite nice,  wouldn't it?
> >
> > But in doing so you make the contents of the disk inconsistent with  
> > the state you've just snapshotted, leading to filesystem  
> > corruption. Even if you modify filesystems to do checkpointing  
> > (which is what we're really talking about), you still also have the  
> > problem that your snapshot has to be stored somewhere before you  
> > write it to disk, so you also have to either [snip]
> 
> Actually, it's a lot simpler than that.  We can just combine the  
> device-mapper snapshot with a VM+kernel snapshot system call and be  
> almost done:
> 
>    sys_snapshot(dev_t snapblockdev, int __user *snapshotfd);
> 
> When sys_snapshot is run, the kernel does:
> 
> 1)  Sequentially freeze mounted filesystems using blockdev freezing.   
> If it's an fs that doesn't support freezing then either fail or force- 
> remount-ro that fs and downgrade all its filedescriptors to RO.   
> Doesn't need extra locking since process which try to do IO either  
> succeed before the freeze call returns for that blockdev or sleep on  
> the unfreeze of that blockdev.  Filesystems are synchronized and made  
> clean.
> 2)  Iterate over the userspace process list, freezing each process  
> and remapping all of its pages copy-on-write.  Any device-specific  
> pages need to have state saved by that device.

Why do you want to do 2) after 1) and not vice versa?

> 3)  All processes (except kernel threads) are now frozen.
> 4)  Kernel should save internal state corresponding to current  
> userspace state.  The kernel also swaps out excess pages to free up  
> enough RAM and prepares the snapshot file-descriptor with copies of  
> kernel memory and the original (pre-COW) mapped userspace pages.
> 5)  Kernel substitutes filesystems for either a device-mapper  
> snapshot with snapblockdev as backing storage or union with tmpfs and  
> remounts the underlying filesystems as read-only.
> 6)  Kernel unfreezes all userspace processes and returns the snapshot  
> FD to userspace (where it can be read from).

Okay, but how do we do the error recovery if, for example, the image cannot
be saved?

> Then userspace can do whatever it wants.  Any changes to filesystems  
> mounted at the time of snapshot will be discarded at shutdown.   
> Freshly mounted filesystems won't have the union or COW thing done,  
> and so you can write your snapshot to a compressed encrypted file on  
> a USB key if you want to, you just have to unmount it before the  
> snapshot() syscall and remount it right afterwards.

This seems to be a good idea.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:15                       ` Rafael J. Wysocki
  2007-04-28  0:51                         ` David Lang
@ 2007-04-28  1:25                         ` Kyle Moffett
  1 sibling, 0 replies; 136+ messages in thread
From: Kyle Moffett @ 2007-04-28  1:25 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: nigel, Linus Torvalds, Pekka J Enberg, LKML

On Apr 27, 2007, at 21:15:28, Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 03:03, Kyle Moffett wrote:
>> On Apr 27, 2007, at 18:07:46, Nigel Cunningham wrote:
>>> But in doing so you make the contents of the disk inconsistent  
>>> with the state you've just snapshotted, leading to filesystem  
>>> corruption. Even if you modify filesystems to do checkpointing  
>>> (which is what we're really talking about), you still also have  
>>> the problem that your snapshot has to be stored somewhere before  
>>> you write it to disk, so you also have to either [snip]
>>
>> When sys_snapshot is run, the kernel does:
>>
>> 1)  Sequentially freeze mounted filesystems using blockdev  
>> freezing.  If it's an fs that doesn't support freezing then either  
>> fail or force-remount-ro that fs and downgrade all its  
>> filedescriptors to RO. Doesn't need extra locking since process  
>> which try to do IO either succeed before the freeze call returns  
>> for that blockdev or sleep on the unfreeze of that blockdev.   
>> Filesystems are synchronized and made clean.
>> 2)  Iterate over the userspace process list, freezing each process  
>> and remapping all of its pages copy-on-write.  Any device-specific  
>> pages need to have state saved by that device.
>
> Why do you want to do 2) after 1) and not vice versa?

(1) can be done without extra locking.  Device-mapper already has  
code to freeze filesystems and that makes a natural process-stopping  
point.  Any threads doing IO will very quickly put themselves to  
sleep at (1) and save us some effort during step 2.

>> 6)  Kernel unfreezes all userspace processes and returns the  
>> snapshot FD to userspace (where it can be read from).
>
> Okay, but how do we do the error recovery if, for example, the  
> image cannot be saved?

If the image can't be saved then there are 2 options:
   (1)  Call sys_restore() with the image
   (2)  Pass your snapshot file-descriptor to sys_unsnapshot()

In the former case, the system will be restored to the state it was  
at a few seconds earlier, right as it took the snapshot.  In the  
latter case the modified-in-memory snapshot pages will be synced back  
to the disk filesystems, the copy-on-write data-structures torn down  
(think of merging an LVM snapshot back into its base device), and the  
memory allocated for the snapshot will be freed.  Either way the  
system is properly in sync with disk again, the only difference is  
whether you want to preserve the userspace state from during the  
attempted snapshot (IE: any error status).  You could also save the  
error state in case (1) by just auto-posting a bug-report on http:// 
bugs.$VENDOR.com/ of course :-D.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:12                                 ` Linus Torvalds
  2007-04-28  0:54                                   ` David Lang
@ 2007-04-28  1:44                                   ` Rafael J. Wysocki
  2007-04-28  2:51                                     ` Daniel Hazelton
  2007-04-28  8:50                                     ` Back to the future Pavel Machek
  1 sibling, 2 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  1:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov, Pavel Machek

On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > > It's doubly bad, because that idiocy has also infected s2ram. Again, 
> > > another thing that really makes no sense at all - and we do it not just 
> > > for snapshotting, but for s2ram too. Can you tell me *why*?
> > 
> > Why we freeze tasks at all or why we freeze kernel threads?
> 
> In many ways, "at all".
> 
> I _do_ realize the IO request queue issues, and that we cannot actually do 
> s2ram with some devices in the middle of a DMA. So we want to be able to 
> avoid *that*, there's no question about that. And I suspect that stopping 
> user threads and then waiting for a sync is practically one of the easier 
> ways to do so.
> 
> So in practice, the "at all" may become a "why freeze kernel threads?" and 
> freezing user threads I don't find really objectionable.
> 
> But as Paul pointed out, Linux on the old powerpc Mac hardware was 
> actually rather famous for having working (and reliable) suspend long 
> before it worked even remotely reliably on PC's. And they didn't do even
> that.
> 
> (They didn't have ACPI, and they had a much more limited set of devices, 
> but the whole process freezer is really about neither of those issues. The 
> wild and wacky PC hardware has its problems, but that's _one_ thing we 
> can't blame PC hardware for ;)

We freeze user space processes for the reasons that you have quoted above.

Why we freeze kernel threads in there too is a good question, but not for me to
answer.  I don't know.  Pavel should know, I think.

> > > 	git grep create_freezeable_workthread
> > 
> > s/workthread/workqueue/
> 
> Yes.
> 
> > > and ponder the end results of that grep. If you don't see something wrong, 
> > > you're blind.
> > 
> > This was a mistake, quite unrelated to the point you're making.
> 
> Did you actually _do_ the "grep" (with the fixed argument)?
> 
> I had two totally independent points. #1 was that you yourself have been 
> fixing bugs in this area. #2 was the result of that grep. It's absolutely 
> _empty_ except for the define to add that interface.
> 
> NOBODY USES IT!

The reason is pretty simple.

We wanted to drop that interface altogether, because it was broken (my fault),
but Oleg suggested that we keep it so that we could fix and use it in the
future (for purposes other than the hibernation, though).

> Now, grep for the same interface that creates _non_freezeable workqueues.
> 
> Put another way:
> 
> 	[torvalds@woody linux]$ git grep create_workqueue | wc -l
> 	35
> 
> 	[torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
> 	1
> 
> and that _one_ hit you get for the "freezeable" case is not actually a 
> user, it's the definition!
> 
> Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

That's freezable workqueues only. :-)

> Yet we have all this support for freezing them (or rather, we freeze them 
> by default, and then we have all this support for _not_ doing that wrong 
> default thing!)
> 
> So yes, I think it would be interesting to just stop freezing kernel 
> threads. Totally.

Okay, I'll do that.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:44                                   ` Rafael J. Wysocki
@ 2007-04-28  2:51                                     ` Daniel Hazelton
  2007-04-28  7:00                                       ` progress meter in s2disk (was Re: Back to the future.) Pavel Machek
  2007-04-28  8:50                                     ` Back to the future Pavel Machek
  1 sibling, 1 reply; 136+ messages in thread
From: Daniel Hazelton @ 2007-04-28  2:51 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML,
	Oleg Nesterov, Pavel Machek

On Friday 27 April 2007 21:44:48 Rafael J. Wysocki wrote:
> On Saturday, 28 April 2007 03:12, Linus Torvalds wrote:
> > On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > > > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > > > another thing that really makes no sense at all - and we do it not
> > > > just for snapshotting, but for s2ram too. Can you tell me *why*?
> > >
> > > Why we freeze tasks at all or why we freeze kernel threads?
> >
> > In many ways, "at all".
> >
> > I _do_ realize the IO request queue issues, and that we cannot actually
> > do s2ram with some devices in the middle of a DMA. So we want to be able
> > to avoid *that*, there's no question about that. And I suspect that
> > stopping user threads and then waiting for a sync is practically one of
> > the easier ways to do so.
> >
<snip>

Apparently I *CANNOT* wrap my head around this - if just because my laptop, 
running a vendor 2.6.17 kernel does s2ram perfectly, at least, it does when 
using the "Upstart" init system rather than the classical SysV init system. I 
have tried it with the classical init and the suspend isn't triggered by the 
buttons that used to do it. I didn't try 'echo ram > /sys/power/state', but I 
have a feeling that would have worked as well. I have problems with s2disk, 
but thats because I keep my swap partition small - I try to keep it at or 
around 256M when I have more than half a gig of Ram in a system. Perhaps one 
of these days I'll grab a multi-gig flash disk, set it up as a swap partition 
and try it again. (every time I've tried s2disk I wind up running out of disk 
space - and this is with nothing but X running. Any kind of progress meter 
for when the system is doing s2disk would be nice - every time I've tried it 
all I see for the nearly 2 minutes before the s2disk attempt ends is a black 
screen. I say 2 minutes because thats how long it takes for it to learn that 
there isn't enough space on the swap-partition to save the image)

DRH

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:50                               ` David Lang
  2007-04-28  0:40                                 ` Linus Torvalds
@ 2007-04-28  6:58                                 ` Oliver Neukum
  2007-04-28  9:16                                   ` Pekka J Enberg
  2007-04-28 18:28                                   ` David Lang
  2007-05-03 17:18                                 ` Pavel Machek
  2 siblings, 2 replies; 136+ messages in thread
From: Oliver Neukum @ 2007-04-28  6:58 UTC (permalink / raw)
  To: David Lang
  Cc: Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

Am Samstag, 28. April 2007 01:50 schrieb David Lang:
> 3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
> 4. unpause
> 5. save image (with full userspace available, including network)
> 6. shutdown system (throw away all userspace memory, no need to do graceful
>     shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
>     needed)

And then you'll have people wonder why the server which sent out all
those files has no log entries. You'd have to selectively unfreeze user
space, which is a cure worse than the desease.

Simply throwing away user space work is a bug. And no, you cannot say that
it'll be redone away, as you are throwing away accepted input, too.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 136+ messages in thread

* progress meter in s2disk (was Re: Back to the future.)
  2007-04-28  2:51                                     ` Daniel Hazelton
@ 2007-04-28  7:00                                       ` Pavel Machek
  0 siblings, 0 replies; 136+ messages in thread
From: Pavel Machek @ 2007-04-28  7:00 UTC (permalink / raw)
  To: Daniel Hazelton
  Cc: Rafael J. Wysocki, Linus Torvalds, Nigel Cunningham,
	Pekka J Enberg, LKML, Oleg Nesterov

Hi!

> > > > > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > > > > another thing that really makes no sense at all - and we do it not
> > > > > just for snapshotting, but for s2ram too. Can you tell me *why*?
> > > >
> > > > Why we freeze tasks at all or why we freeze kernel threads?
> > >
> > > In many ways, "at all".
> > >
> > > I _do_ realize the IO request queue issues, and that we cannot actually
> > > do s2ram with some devices in the middle of a DMA. So we want to be able
> > > to avoid *that*, there's no question about that. And I suspect that
> > > stopping user threads and then waiting for a sync is practically one of
> > > the easier ways to do so.
> > >
> <snip>
> 
> Apparently I *CANNOT* wrap my head around this - if just because my laptop, 
> running a vendor 2.6.17 kernel does s2ram perfectly, at least, it does when 
> using the "Upstart" init system rather than the classical SysV init system. I 
> have tried it with the classical init and the suspend isn't triggered by the 
> buttons that used to do it. I didn't try 'echo ram > /sys/power/state', but I 
> have a feeling that would have worked as well. I have problems with s2disk, 
> but thats because I keep my swap partition small - I try to keep it at or 
> around 256M when I have more than half a gig of Ram in a system. Perhaps one 
> of these days I'll grab a multi-gig flash disk, set it up as a swap partition 
> and try it again. (every time I've tried s2disk I wind up running out of disk 
> space - and this is with nothing but X running. Any kind of progress meter 
> for when the system is doing s2disk would be nice - every time I've tried it 
> all I see for the nearly 2 minutes before the s2disk attempt ends is a black 
> screen. I say 2 minutes because thats how long it takes for it to learn that 
> there isn't enough space on the swap-partition to save the image)

Just turn up console loglevel to see the messages.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:44                                   ` Rafael J. Wysocki
  2007-04-28  2:51                                     ` Daniel Hazelton
@ 2007-04-28  8:50                                     ` Pavel Machek
  2007-04-28  9:24                                       ` Rafael J. Wysocki
                                                         ` (2 more replies)
  1 sibling, 3 replies; 136+ messages in thread
From: Pavel Machek @ 2007-04-28  8:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

Hi!

> > In many ways, "at all".
> > 
> > I _do_ realize the IO request queue issues, and that we cannot actually do 
> > s2ram with some devices in the middle of a DMA. So we want to be able to 
> > avoid *that*, there's no question about that. And I suspect that stopping 
> > user threads and then waiting for a sync is practically one of the easier 
> > ways to do so.
> > 
> > So in practice, the "at all" may become a "why freeze kernel threads?" and 
> > freezing user threads I don't find really objectionable.
> > 
> > But as Paul pointed out, Linux on the old powerpc Mac hardware was 
> > actually rather famous for having working (and reliable) suspend long 
> > before it worked even remotely reliably on PC's. And they didn't do even
> > that.
> > 
> > (They didn't have ACPI, and they had a much more limited set of devices, 
> > but the whole process freezer is really about neither of those issues. The 
> > wild and wacky PC hardware has its problems, but that's _one_ thing we 
> > can't blame PC hardware for ;)
> 
> We freeze user space processes for the reasons that you have quoted above.
> 
> Why we freeze kernel threads in there too is a good question, but not for me to
> answer.  I don't know.  Pavel should know, I think.

We do not want kernel threads running:

a) they may hold some locks and deadlock suspend

b) they may do some writes to disk, leading to corruption

We could solve a) by carefully auditing suspend lock usage to make
sure deadlocks are impossible even with kernel threads running.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  6:58                                 ` Oliver Neukum
@ 2007-04-28  9:16                                   ` Pekka J Enberg
  2007-04-28 18:28                                   ` David Lang
  1 sibling, 0 replies; 136+ messages in thread
From: Pekka J Enberg @ 2007-04-28  9:16 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: David Lang, Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds, LKML

On Sat, 28 Apr 2007, Oliver Neukum wrote:
> And then you'll have people wonder why the server which sent out all
> those files has no log entries. You'd have to selectively unfreeze user
> space, which is a cure worse than the desease.
> 
> Simply throwing away user space work is a bug. And no, you cannot say that
> it'll be redone away, as you are throwing away accepted input, too.

It's not a bug, it's a feature =). While I totally agree with you that for 
the common case, you probably do want to avoid work in the userspace after 
taking the snapshot, it is something that should be solved separately. 
There is absolutely nothing wrong with taking a snapshot, doing some work, 
and then resuming to the snapshot and thus "losing" some the work (this 
is useful for debugging, for example).

				Pekka

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 19:07                   ` Oliver Neukum
@ 2007-04-28  9:22                     ` Pekka Enberg
  2007-04-28 13:37                       ` Oliver Neukum
  0 siblings, 1 reply; 136+ messages in thread
From: Pekka Enberg @ 2007-04-28  9:22 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Nigel Cunningham, Linus Torvalds, LKML

Hi Oliver,

Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> > The problem with writing in the kernel is obvious: we need to add new code
> > to the kernel for compression, encryption, and userspace interaction
> > (graphical progress bar) that are important for user experience.

On 4/27/07, Oliver Neukum <oliver@neukum.org> wrote:
> The kernel can already do compression and encryption.

Yes, if we all could agree on _which_ compression and encryption
algorithm(s) we want to use. It goes beyond that too, where do you
want to save the image? In the swap device or a regular file? And
don't forget about debuggability either. It's faster to do a
snapshot/resume without shutdown/restart in the middle or just do a
snapshot, and examine its contents.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  8:50                                     ` Back to the future Pavel Machek
@ 2007-04-28  9:24                                       ` Rafael J. Wysocki
  2007-04-28 16:28                                       ` Linus Torvalds
  2007-04-28 18:32                                       ` David Lang
  2 siblings, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28  9:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Saturday, 28 April 2007 10:50, Pavel Machek wrote:
> Hi!
> 
> > > In many ways, "at all".
> > > 
> > > I _do_ realize the IO request queue issues, and that we cannot actually do 
> > > s2ram with some devices in the middle of a DMA. So we want to be able to 
> > > avoid *that*, there's no question about that. And I suspect that stopping 
> > > user threads and then waiting for a sync is practically one of the easier 
> > > ways to do so.
> > > 
> > > So in practice, the "at all" may become a "why freeze kernel threads?" and 
> > > freezing user threads I don't find really objectionable.
> > > 
> > > But as Paul pointed out, Linux on the old powerpc Mac hardware was 
> > > actually rather famous for having working (and reliable) suspend long 
> > > before it worked even remotely reliably on PC's. And they didn't do even
> > > that.
> > > 
> > > (They didn't have ACPI, and they had a much more limited set of devices, 
> > > but the whole process freezer is really about neither of those issues. The 
> > > wild and wacky PC hardware has its problems, but that's _one_ thing we 
> > > can't blame PC hardware for ;)
> > 
> > We freeze user space processes for the reasons that you have quoted above.
> > 
> > Why we freeze kernel threads in there too is a good question, but not for me to
> > answer.  I don't know.  Pavel should know, I think.
> 
> We do not want kernel threads running:
> 
> a) they may hold some locks and deadlock suspend

Yeah, the same issue as with the hibernation and I do think it's _real_.

> b) they may do some writes to disk, leading to corruption

Hmm, is that an issue in the suspend (aka s2ram) case?

> We could solve a) by carefully auditing suspend lock usage to make
> sure deadlocks are impossible even with kernel threads running.

Yes, we can, but for now it's not been done yet.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 10:12                 ` Pekka J Enberg
  2007-04-27 19:07                   ` Oliver Neukum
@ 2007-04-28 10:35                   ` Rafael J. Wysocki
  2007-04-28 18:43                     ` David Lang
  1 sibling, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28 10:35 UTC (permalink / raw)
  To: Pekka J Enberg; +Cc: Oliver Neukum, Nigel Cunningham, Linus Torvalds, LKML

On Friday, 27 April 2007 12:12, Pekka J Enberg wrote:
> Am Freitag, 27. April 2007 08:18 schrieb Pekka J Enberg:
> > > No. The snapshot is just that. A snapshot in time. From kernel point of 
> > > view, it doesn't matter one bit what when you did it or if the state has 
> > > changed before you resume. It's up to userspace to make sure the user 
> > > doesn't do real work while the snapshot is being written to disk and 
> > > machine is shut down.
> 
> On Fri, 27 Apr 2007, Oliver Neukum wrote:
> > And where is the benefit in that? How is such user space freezing logic
> > simpler than having the kernel do the write?
> >
> > What can you do in user space if all filesystems are r/o that is worth the
> > hassle?
> 
> I am talking about snapshot_system() here. It's not given that the 
> filesystems need to be read-only (you can snapshot them too). The benefit 
> here is that you can do whatever you want with the snapshot (encrypt, 
> compress, send over the network)  and have a clean well-defined interface 
> in the kernel. In addition, aborting the snapshot is simpler, simply 
> munmap() the snapshot.

Well, swsusp currently does almost the same, except that you can read the
image from the kernel as a stream of bytes, using read() and, during the
restore phase, upload the same image using write().  The advantage of this
is that the interface is symmetrical from the user space's point of view.
[You're cancelling the hibernation by closing /dev/snapshot, which also is
quite natural.]

If you look at the interface in user.c, there are only two ioctls really needed
for that in there, SNAPSHOT_ATOMIC_SNAPSHOT and
SNAPSHOT_ATOMIC_RESTORE.  Two more are handy for freezing
tasks, SNAPSHOT_FREEZE and SNAPSHOT_UNFREEZE.  The others were added
later, to make the user space part simpler or capable of doing some fancy
stuff, which I am ready to admit was a mistake.

> The problem with writing in the kernel is obvious: we need to add new code 
> to the kernel for compression, encryption, and userspace interaction 
> (graphical progress bar) that are important for user experience.

Yes, and that's why we wanted to introduce the userland part.  The problem
with this approach, as it's turned out, is that the userland part must be a
very specialized piece of software, really careful of what it's doing, mainly
because of the inability to checkpoint filesystems.  If we could checkpoint
filesystems and were able to unfreeze the user space after creating the
snapshot without the risk of corrupting filesystems in the restore phase,
the userland part could be much simpler (even as simple as Linus suggested).

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  9:22                     ` Pekka Enberg
@ 2007-04-28 13:37                       ` Oliver Neukum
  2007-05-03 12:06                         ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: Oliver Neukum @ 2007-04-28 13:37 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Nigel Cunningham, Linus Torvalds, LKML

Am Samstag, 28. April 2007 11:22 schrieb Pekka Enberg:
> Hi Oliver,
> 
> Am Freitag, 27. April 2007 12:12 schrieb Pekka J Enberg:
> > > The problem with writing in the kernel is obvious: we need to add new code
> > > to the kernel for compression, encryption, and userspace interaction
> > > (graphical progress bar) that are important for user experience.
> 
> On 4/27/07, Oliver Neukum <oliver@neukum.org> wrote:
> > The kernel can already do compression and encryption.
> 
> Yes, if we all could agree on _which_ compression and encryption

Any of those available in the kernel. Where's the problem?

> algorithm(s) we want to use. It goes beyond that too, where do you
> want to save the image? In the swap device or a regular file? And

A swap device is doubtlessly easier. But isn't the problem of using
a swap file already fixed? The writeout seems the easiest part of
hibernation.

> don't forget about debuggability either. It's faster to do a
> snapshot/resume without shutdown/restart in the middle or just do a
> snapshot, and examine its contents.

Then use a "fake reboot" option and save the image to a ramdisk.
It isn't that hard. You must be able to survive that, as io errors during
write out are possible.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  8:50                                     ` Back to the future Pavel Machek
  2007-04-28  9:24                                       ` Rafael J. Wysocki
@ 2007-04-28 16:28                                       ` Linus Torvalds
  2007-04-28 17:50                                         ` Rafael J. Wysocki
  2007-04-28 18:32                                       ` David Lang
  2 siblings, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28 16:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov



On Sat, 28 Apr 2007, Pavel Machek wrote:
> 
> We do not want kernel threads running:
> 
> a) they may hold some locks and deadlock suspend
> 
> b) they may do some writes to disk, leading to corruption

You're really just making both of those up.

If a kernel thread holds a lock and deadlocks suspend, that would deadlock 
anythign else _too_. Suspend isn't *that* special. Everything it does are 
things other people do too.

And no, kernel threads do not write to disk on their own. Name one. They 
help *others* write to disk, but those disk writes need to happen.

The freezer has *caused* those deadlocks (eg by stopping threads that were 
needed for the suspend writeouts to succeed!), not solved them.

So stop making these totally bogus arguments up.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 16:28                                       ` Linus Torvalds
@ 2007-04-28 17:50                                         ` Rafael J. Wysocki
  2007-04-28 21:25                                           ` Linus Torvalds
  0 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Saturday, 28 April 2007 18:28, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Pavel Machek wrote:
> > 
> > We do not want kernel threads running:
> > 
> > a) they may hold some locks and deadlock suspend
> > 
> > b) they may do some writes to disk, leading to corruption
> 
> You're really just making both of those up.
> 
> If a kernel thread holds a lock and deadlocks suspend, that would deadlock 
> anythign else _too_. Suspend isn't *that* special. Everything it does are 
> things other people do too.
> 
> And no, kernel threads do not write to disk on their own. Name one.

xfssyncd , or at least it seems so at a quick look.

> They help *others* write to disk, but those disk writes need to happen.
> 
> The freezer has *caused* those deadlocks (eg by stopping threads that were 
> needed for the suspend writeouts to succeed!), not solved them.

I can't remember anything like this, but I believe you have a specific test
case in mind.

> So stop making these totally bogus arguments up.

Well, they may be bogus, but there's something else.

I have reviewed some kernel threads used by device drivers that currently are
frozen to see if it would be safe not to freeze them, and I'm worried.

What, for example, if such a thread schedules a timeout and waits for
something to happen (eg. the airo driver does something like this), but instead
the hibernation/suspend happens and the device is frozen/suspended under it?

Shouldn't the thread be notified by the driver's freeze/suspend callback?

Moreover, what if after the restore the device is not present (for example, it
may be a pcmcia card that the user has removed) and the thread is scheduled
before the device's unfreeze callback has a chance to run?  Shouldn't the
thread check that the device is present?  In that case it would have to be
notified by someone that the check is necessary, but who can do that?

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  6:58                                 ` Oliver Neukum
  2007-04-28  9:16                                   ` Pekka J Enberg
@ 2007-04-28 18:28                                   ` David Lang
  1 sibling, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-28 18:28 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1902 bytes --]

On Sat, 28 Apr 2007, Oliver Neukum wrote:

> Am Samstag, 28. April 2007 01:50 schrieb David Lang:
>> 3. make mounted filesystems read-only (possibly with snapshot/checkpoint)
>> 4. unpause
>> 5. save image (with full userspace available, including network)
>> 6. shutdown system (throw away all userspace memory, no need to do graceful
>>     shutdown or nice kill signals, revert filesystem to snapshot/checkpoint if
>>     needed)
>
> And then you'll have people wonder why the server which sent out all
> those files has no log entries. You'd have to selectively unfreeze user
> space, which is a cure worse than the desease.
>
> Simply throwing away user space work is a bug. And no, you cannot say that
> it'll be redone away, as you are throwing away accepted input, too.

when you are doing a suspend-to-disk I disagree with you. whoever is doing the 
suspend knows what is going on, and they can decide what needs to be done.

the only case where you have 'unexpected' work being thrown away is if you are 
suspending a network server, and the process of suspending it is going to cut 
all the network connections anyway so it's not a seamless process. In this case 
it's fair to let the sysadmin choose between loosing some logs or doing some 
other step to prevent this from happening (which could be to shutdown the 
network service, or load a iptables rule to block the service)

however, most of the uses of suspend-to-disk are going to be single-user 
machines and in that case telling the user that anything that they do after 
issuing the suspend is going to be lost is a perfectly sane thing to do.

and for that matter, if the snapshot is cheap enough, some people may choose to 
cron the snapshot portion of a suspend-to-disk evvery few min as a safety net 
for something going wrong. In this case they really do want all of userspace to 
keep working after the snapshot.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  8:50                                     ` Back to the future Pavel Machek
  2007-04-28  9:24                                       ` Rafael J. Wysocki
  2007-04-28 16:28                                       ` Linus Torvalds
@ 2007-04-28 18:32                                       ` David Lang
  2007-04-28 19:14                                         ` Rafael J. Wysocki
  2 siblings, 1 reply; 136+ messages in thread
From: David Lang @ 2007-04-28 18:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Rafael J. Wysocki, Linus Torvalds, Nigel Cunningham,
	Pekka J Enberg, LKML, Oleg Nesterov

On Sat, 28 Apr 2007, Pavel Machek wrote:

>>
>> We freeze user space processes for the reasons that you have quoted above.
>>
>> Why we freeze kernel threads in there too is a good question, but not for me to
>> answer.  I don't know.  Pavel should know, I think.
>
> We do not want kernel threads running:
>
> a) they may hold some locks and deadlock suspend
>
> b) they may do some writes to disk, leading to corruption
>
> We could solve a) by carefully auditing suspend lock usage to make
> sure deadlocks are impossible even with kernel threads running.

remember that we are doing suspend-to-disk, after we do the snapshot we will be 
doing a shutdown. that should simplify the locking issues

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 10:35                   ` Rafael J. Wysocki
@ 2007-04-28 18:43                     ` David Lang
  2007-04-28 19:37                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 136+ messages in thread
From: David Lang @ 2007-04-28 18:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pekka J Enberg, Oliver Neukum, Nigel Cunningham, Linus Torvalds, LKML

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> On Friday, 27 April 2007 12:12, Pekka J Enberg wrote:
>> The problem with writing in the kernel is obvious: we need to add new code
>> to the kernel for compression, encryption, and userspace interaction
>> (graphical progress bar) that are important for user experience.
>
> Yes, and that's why we wanted to introduce the userland part.  The problem
> with this approach, as it's turned out, is that the userland part must be a
> very specialized piece of software, really careful of what it's doing, mainly
> because of the inability to checkpoint filesystems.  If we could checkpoint
> filesystems and were able to unfreeze the user space after creating the
> snapshot without the risk of corrupting filesystems in the restore phase,
> the userland part could be much simpler (even as simple as Linus suggested).

this sounds like a really good argument for having a useable userspace running. 
we already have the LVM snapshot code in the kernel, so we have the pieces 
available to protect the filesystems, we just need to figure out how to put them 
togeather. (the simpliest way would be to make a new suspend package that 
required the user to use LVM so that snapshots are available, but this is also 
the most disruptive approach)

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 19:14                                         ` Rafael J. Wysocki
@ 2007-04-28 18:44                                           ` David Lang
  0 siblings, 0 replies; 136+ messages in thread
From: David Lang @ 2007-04-28 18:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Pekka J Enberg,
	LKML, Oleg Nesterov

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:

> On Saturday, 28 April 2007 20:32, David Lang wrote:
>> On Sat, 28 Apr 2007, Pavel Machek wrote:
>>
>>>>
>>>> We freeze user space processes for the reasons that you have quoted above.
>>>>
>>>> Why we freeze kernel threads in there too is a good question, but not for me to
>>>> answer.  I don't know.  Pavel should know, I think.
>>>
>>> We do not want kernel threads running:
>>>
>>> a) they may hold some locks and deadlock suspend
>>>
>>> b) they may do some writes to disk, leading to corruption
>>>
>>> We could solve a) by carefully auditing suspend lock usage to make
>>> sure deadlocks are impossible even with kernel threads running.
>>
>> remember that we are doing suspend-to-disk, after we do the snapshot we will be
>> doing a shutdown. that should simplify the locking issues
>
> That's assuming that we won't need to cancel the hibernation.

true, but if we cancel the hibernation then why are the locks an issue? they are 
appropriate for the system state.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-26 19:56       ` Nigel Cunningham
  2007-04-27  4:52         ` Pekka J Enberg
@ 2007-04-28 19:09         ` Bill Davidsen
  1 sibling, 0 replies; 136+ messages in thread
From: Bill Davidsen @ 2007-04-28 19:09 UTC (permalink / raw)
  To: nigel; +Cc: Linus Torvalds, Pekka Enberg, LKML

Nigel Cunningham wrote:

> Please, go apply that logic elsewhere, then cut out (or at least stop
> adding) support for users with less common needs in other areas. I fully
> acknowledge that most users have only one place to store their image and
> it's a swap device. But that doesn't mean one size fits all.
> 
I think to some extent that's part of the problem. Consider for a moment 
that a /dev/hibernate would be required, and that it must be (a) a disk, 
or (b) a partition, or (c) other devices in the future, like an nbd, USB 
flash or DVD.

Don't have a device like that, then can't hibernate. Stop trying to be 
smart and use swap for two different things. Stop trying to have an 
interface between user space and kernel which does things not required 
to preserve the system. A progress indicator is not needed, power off is 
my progress indicator, and should be the sole valid end of a hibernate.

> A full image implies that you need to figure out what's not going to
> change while you're writing it and save that separately. At the moment,
> I'm treating most of the LRU contents as that list. If we're going to
> start trying to let every man and his dog run while we're trying to
> snapshot the system, that's not going to work anymore - or the logic
> will get a lot more complicated.
> 
> Sorry. I never thought I'd say this, but I think you're being naive
> about how simple the process of snapshotting a system is.

Hibernate is useful to avoid complex boot, it's useful as the UPS gets 
tired, and putting features in the process beyond saving the snap 
(possibly compressed and/or encrypted) just adds complexity. Put it all 
in the kernel and use /sys/power/state as the user interface. Stop 
oversolving the problem.

No, that doesn't avoid other hard issues, but for the most part suspend2 
has addressed them.


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 18:32                                       ` David Lang
@ 2007-04-28 19:14                                         ` Rafael J. Wysocki
  2007-04-28 18:44                                           ` David Lang
  0 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28 19:14 UTC (permalink / raw)
  To: David Lang
  Cc: Pavel Machek, Linus Torvalds, Nigel Cunningham, Pekka J Enberg,
	LKML, Oleg Nesterov

On Saturday, 28 April 2007 20:32, David Lang wrote:
> On Sat, 28 Apr 2007, Pavel Machek wrote:
> 
> >>
> >> We freeze user space processes for the reasons that you have quoted above.
> >>
> >> Why we freeze kernel threads in there too is a good question, but not for me to
> >> answer.  I don't know.  Pavel should know, I think.
> >
> > We do not want kernel threads running:
> >
> > a) they may hold some locks and deadlock suspend
> >
> > b) they may do some writes to disk, leading to corruption
> >
> > We could solve a) by carefully auditing suspend lock usage to make
> > sure deadlocks are impossible even with kernel threads running.
> 
> remember that we are doing suspend-to-disk, after we do the snapshot we will be 
> doing a shutdown. that should simplify the locking issues

That's assuming that we won't need to cancel the hibernation.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 18:43                     ` David Lang
@ 2007-04-28 19:37                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28 19:37 UTC (permalink / raw)
  To: David Lang
  Cc: Pekka J Enberg, Oliver Neukum, Nigel Cunningham, Linus Torvalds, LKML

On Saturday, 28 April 2007 20:43, David Lang wrote:
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > On Friday, 27 April 2007 12:12, Pekka J Enberg wrote:
> >> The problem with writing in the kernel is obvious: we need to add new code
> >> to the kernel for compression, encryption, and userspace interaction
> >> (graphical progress bar) that are important for user experience.
> >
> > Yes, and that's why we wanted to introduce the userland part.  The problem
> > with this approach, as it's turned out, is that the userland part must be a
> > very specialized piece of software, really careful of what it's doing, mainly
> > because of the inability to checkpoint filesystems.  If we could checkpoint
> > filesystems and were able to unfreeze the user space after creating the
> > snapshot without the risk of corrupting filesystems in the restore phase,
> > the userland part could be much simpler (even as simple as Linus suggested).
> 
> this sounds like a really good argument for having a useable userspace running. 
> we already have the LVM snapshot code in the kernel, so we have the pieces 
> available to protect the filesystems, we just need to figure out how to put them 
> togeather. (the simpliest way would be to make a new suspend package that 
> required the user to use LVM so that snapshots are available, but this is also 
> the most disruptive approach)

Yes.  I personally know very little about the LVM snapshot code and I wasn't
aware of its capabilities.  If we can make it possible to run the user space
safely after we've created the memory snapshot, I'm all for it.

As far as the package is concerned, we can just add the new user space tools
to the suspend package containing our existing userland part.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 17:50                                         ` Rafael J. Wysocki
@ 2007-04-28 21:25                                           ` Linus Torvalds
  2007-04-28 23:03                                             ` Rafael J. Wysocki
  2007-04-29  8:23                                             ` Pavel Machek
  0 siblings, 2 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28 21:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov



On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > The freezer has *caused* those deadlocks (eg by stopping threads that were 
> > needed for the suspend writeouts to succeed!), not solved them.
> 
> I can't remember anything like this, but I believe you have a specific test
> case in mind.

Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?

Rafael, you really don't know what you're talking about, do you?

Just _look_ at them. It's the IO threads etc that shouldn't be frozen, 
exactly *because* they do IO. You claim that kernel threads shouldn't do 
IO, but that's the point: if you cannot do IO when snapshotting to disk, 
here's a damn big clue for you: how do you think that snapshot is going to 
get written?

I *guarantee* you that we've had a lot more problems with threads that 
should *not* have been frozen than with those hypothetical threads that 
you think should have been frozen.

			Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 21:25                                           ` Linus Torvalds
@ 2007-04-28 23:03                                             ` Rafael J. Wysocki
  2007-04-28 23:45                                               ` Linus Torvalds
  2007-04-29  8:23                                             ` Pavel Machek
  1 sibling, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-28 23:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Saturday, 28 April 2007 23:25, Linus Torvalds wrote:
> 
> On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> > > 
> > > The freezer has *caused* those deadlocks (eg by stopping threads that were 
> > > needed for the suspend writeouts to succeed!), not solved them.
> > 
> > I can't remember anything like this, but I believe you have a specific test
> > case in mind.
> 
> Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?

Well, I don't know why exactly it had been originally introduced.  Currently,
it is used by the threads that should be running after the snapshot is done
(they are not only I/O threads).

> Rafael, you really don't know what you're talking about, do you?

I think I know.

> Just _look_ at them. It's the IO threads etc that shouldn't be frozen, 
> exactly *because* they do IO. You claim that kernel threads shouldn't do 
> IO, but that's the point: if you cannot do IO when snapshotting to disk,  
> here's a damn big clue for you: how do you think that snapshot is going to 
> get written?

OK, more precisely: fs-related threads should not try to process their queues,
etc., after the snapshot is done, because that may cause some fs data to be
written at that time and then the fs in question may be corrupted after the
restore.  Not all of the I/O in general, fs data.

Still, that alone probably is not a good enough reason for freezing all kernel
threads.

> I *guarantee* you that we've had a lot more problems with threads that 
> should *not* have been frozen than with those hypothetical threads that 
> you think should have been frozen.

Well, I'm not sure whether or not that still would have been the case if we had
stopped to freeze kernel threads for the hibernation/suspend.  I just see
potential problems that I've mentioned in the previous message and I don't see
any evidence that they cannot occur.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 23:03                                             ` Rafael J. Wysocki
@ 2007-04-28 23:45                                               ` Linus Torvalds
  2007-04-29  0:01                                                 ` Nigel Cunningham
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2007-04-28 23:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Pavel Machek, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov



On Sun, 29 Apr 2007, Rafael J. Wysocki wrote:
> 
> OK, more precisely: fs-related threads should not try to process their queues,
> etc., after the snapshot is done, because that may cause some fs data to be
> written at that time and then the fs in question may be corrupted after the
> restore.  Not all of the I/O in general, fs data.

But that's not true _either_. That's only true because right now I think 
we cannot even suspend to a swapfile (I might be wrong). 

If you have a swapfile on a filesystem, you'd need those fs queues 
running!

> Well, I'm not sure whether or not that still would have been the case if we had
> stopped to freeze kernel threads for the hibernation/suspend.

Did you miss the email where Paul pointed out that Mac/PowerPC didn't use 
to do any of this? And apparently never had any issues with it? And 
probably worked more reliably several years ago than suspend/hibernation 
does _today_?

Ie we do have history of _not_ freezing things.  The freezing came later, 
and came with the subsystem that had more problems..

		Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 23:45                                               ` Linus Torvalds
@ 2007-04-29  0:01                                                 ` Nigel Cunningham
  2007-04-29  5:01                                                   ` Bojan Smojver
  2007-04-29  3:43                                                 ` Kyle Moffett
  2007-04-29  8:57                                                 ` Rafael J. Wysocki
  2 siblings, 1 reply; 136+ messages in thread
From: Nigel Cunningham @ 2007-04-29  0:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Pavel Machek, Pekka J Enberg, LKML, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

Hi.

On Sat, 2007-04-28 at 16:45 -0700, Linus Torvalds wrote:
> 
> On Sun, 29 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > OK, more precisely: fs-related threads should not try to process their queues,
> > etc., after the snapshot is done, because that may cause some fs data to be
> > written at that time and then the fs in question may be corrupted after the
> > restore.  Not all of the I/O in general, fs data.
> 
> But that's not true _either_. That's only true because right now I think 
> we cannot even suspend to a swapfile (I might be wrong). 
> 
> If you have a swapfile on a filesystem, you'd need those fs queues 
> running!

For Suspend2, and I think for swsusp too, we bmap the locations when
allocating the storage, and then submit our own bios. Even if swsusp
isn't using this method, I'm pretty sure the swap code does bmapping at
swapon time to avoid raciness later.

> > Well, I'm not sure whether or not that still would have been the case if we had
> > stopped to freeze kernel threads for the hibernation/suspend.
> 
> Did you miss the email where Paul pointed out that Mac/PowerPC didn't use 
> to do any of this? And apparently never had any issues with it? And 
> probably worked more reliably several years ago than suspend/hibernation 
> does _today_?
> 
> Ie we do have history of _not_ freezing things.  The freezing came later, 
> and came with the subsystem that had more problems..

It also came because of problems. Not working perfectly isn't
necessarily a sign of a faulty reason for being added in the first
place.

I should also add, not freezing things is fine if you're happy with
getting half an image at most. If you want a full
just-as-if-I'd-never-turned-the-power-off image, you need freezing so
that you can have some pages which can be saved before others are
atomically copied, to ensure the whole image is consistent.

Nigel

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 23:45                                               ` Linus Torvalds
  2007-04-29  0:01                                                 ` Nigel Cunningham
@ 2007-04-29  3:43                                                 ` Kyle Moffett
  2007-04-29  8:57                                                 ` Rafael J. Wysocki
  2 siblings, 0 replies; 136+ messages in thread
From: Kyle Moffett @ 2007-04-29  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Pavel Machek, Nigel Cunningham,
	Pekka J Enberg, LKML, Oleg Nesterov

On Apr 28, 2007, at 19:45:01, Linus Torvalds wrote:
> On Sun, 29 Apr 2007, Rafael J. Wysocki wrote:
>> Well, I'm not sure whether or not that still would have been the  
>> case if we had stopped to freeze kernel threads for the  
>> hibernation/suspend.
>
> Did you miss the email where Paul pointed out that Mac/PowerPC  
> didn't use to do any of this? And apparently never had any issues  
> with it? And probably worked more reliably several years ago than  
> suspend/hibernation
> does _today_?

Still works pretty reliably; the last time my PowerBook G4 was  
rebooted was 6 weeks ago.  Once every 60 suspends or so the kernel  
USB driver gets really confused and doesn't wake up the USB  
controller properly, leading to dead keyboard/mouse, but other than  
that I never have problems.  I wouldn't be surprised if I could  
comment out 90% of the "suspend" code and still have it work, the  
hardware in is is incredibly robust.  I can even swap batteries while  
it's in suspend-to-RAM, as long as I do it in less than 45 sec or so;  
I get around 6-7 days of suspend-to-RAM time on a full charge.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-29  0:01                                                 ` Nigel Cunningham
@ 2007-04-29  5:01                                                   ` Bojan Smojver
  0 siblings, 0 replies; 136+ messages in thread
From: Bojan Smojver @ 2007-04-29  5:01 UTC (permalink / raw)
  To: linux-kernel

Nigel Cunningham <nigel <at> nigel.suspend2.net> writes:

> If you want a full
> just-as-if-I'd-never-turned-the-power-off image,

Which (full images save) makes the system most responsive on resume. Coupled
with compression and async I/O also keeps Suspend2 very, very fast, even with a
slow disk and large amounts of RAM (as tested on one of my crappy old
notebooks). From my (user) point of view, this is a brilliant feature to have.

--
Bojan


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 21:25                                           ` Linus Torvalds
  2007-04-28 23:03                                             ` Rafael J. Wysocki
@ 2007-04-29  8:23                                             ` Pavel Machek
  2007-04-29  9:22                                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-29  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

Hi!

> > > The freezer has *caused* those deadlocks (eg by stopping threads that were 
> > > needed for the suspend writeouts to succeed!), not solved them.
> > 
> > I can't remember anything like this, but I believe you have a specific test
> > case in mind.
> 
> Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?
> 
> Rafael, you really don't know what you're talking about, do you?
> 
> Just _look_ at them. It's the IO threads etc that shouldn't be frozen, 
> exactly *because* they do IO. You claim that kernel threads shouldn't do 
> IO, but that's the point: if you cannot do IO when snapshotting to disk, 
> here's a damn big clue for you: how do you think that snapshot is going to 
> get written?
> 
> I *guarantee* you that we've had a lot more problems with threads that 
> should *not* have been frozen than with those hypothetical threads that 
> you think should have been frozen.

Well, we had nasty corruption on XFS, caused by thread that was not
frozen and should be. (While the other case leads "only" to deadlocks,
so it is easier to debug.)

The locking point.. when I added freezing to swsusp, I knew very
little about kernel locking, so I "simply" decided to avoid the
problem altogether... using the freezer.

You may be right that locks are not a big problem for the hibernation
after all; I just do not know.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 23:45                                               ` Linus Torvalds
  2007-04-29  0:01                                                 ` Nigel Cunningham
  2007-04-29  3:43                                                 ` Kyle Moffett
@ 2007-04-29  8:57                                                 ` Rafael J. Wysocki
  2007-04-29  8:59                                                   ` Pavel Machek
  2 siblings, 1 reply; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-29  8:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pavel Machek, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Sunday, 29 April 2007 01:45, Linus Torvalds wrote:
> 
> On Sun, 29 Apr 2007, Rafael J. Wysocki wrote:
> > 
> > OK, more precisely: fs-related threads should not try to process their queues,
> > etc., after the snapshot is done, because that may cause some fs data to be
> > written at that time and then the fs in question may be corrupted after the
> > restore.  Not all of the I/O in general, fs data.
> 
> But that's not true _either_. That's only true because right now I think 
> we cannot even suspend to a swapfile (I might be wrong). 

You are.
 
> If you have a swapfile on a filesystem, you'd need those fs queues 
> running!

No, I don't.  It's done by bmapping the file and writing directly to the
underlying blockdev.  Otherwise we'd have corrupted filesystems after the
restore.

Swapfiles are handled this way anyway, so we just use the same code.

> > Well, I'm not sure whether or not that still would have been the case if we had
> > stopped to freeze kernel threads for the hibernation/suspend.
> 
> Did you miss the email where Paul pointed out that Mac/PowerPC didn't use 
> to do any of this?

No, I didn't.

> And apparently never had any issues with it?

On one platform with a limited subset of device drivers.

> And probably worked more reliably several years ago than suspend/hibernation 
> does _today_?

I have no problems with the hibernation on my test boxes (six of them), except
for one network driver that doesn't bother to define a .suspend() callback.

There are problems with the suspend (s2ram), but they are _not_ related to the
freezing of kernel threads.  Some of them are related to the other issue that
you have risen, which is that the same callbacks should not be used for the
suspend and hibernation, and which I think is absolutely valid.  The remaining
ones are related to the fact that graphic card vendors don't care for us at
all.

> Ie we do have history of _not_ freezing things.  The freezing came later, 
> and came with the subsystem that had more problems..

It doesn't have that many problems as you are trying to suggest.  At present,
the only problems with it happen if someone tries to "improve" it in the way
I did with the workqueues.

Anyway, the freezing of tasks, including kernel threads, is one of the few
things on which Pavel, Nigel and me completely agree that they should be done,
so perhaps you could accept that?

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-29  8:57                                                 ` Rafael J. Wysocki
@ 2007-04-29  8:59                                                   ` Pavel Machek
  2007-04-29  9:32                                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-04-29  8:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

Hi!

> > Ie we do have history of _not_ freezing things.  The freezing came later, 
> > and came with the subsystem that had more problems..
> 
> It doesn't have that many problems as you are trying to suggest.  At present,
> the only problems with it happen if someone tries to "improve" it in the way
> I did with the workqueues.
> 
> Anyway, the freezing of tasks, including kernel threads, is one of the few
> things on which Pavel, Nigel and me completely agree that they should be done,
> so perhaps you could accept that?

Actually, if we want to support OLPC _nicely_, we'll need to get rid
of freezer from suspend-to-RAM. Of course, that _will_ put more
pressure at the drivers -- and break few of them...

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-29  8:23                                             ` Pavel Machek
@ 2007-04-29  9:22                                               ` Rafael J. Wysocki
  0 siblings, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-29  9:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Sunday, 29 April 2007 10:23, Pavel Machek wrote:
> Hi!
> 
> > > > The freezer has *caused* those deadlocks (eg by stopping threads that were 
> > > > needed for the suspend writeouts to succeed!), not solved them.
> > > 
> > > I can't remember anything like this, but I believe you have a specific test
> > > case in mind.
> > 
> > Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?
> > 
> > Rafael, you really don't know what you're talking about, do you?
> > 
> > Just _look_ at them. It's the IO threads etc that shouldn't be frozen, 
> > exactly *because* they do IO. You claim that kernel threads shouldn't do 
> > IO, but that's the point: if you cannot do IO when snapshotting to disk, 
> > here's a damn big clue for you: how do you think that snapshot is going to 
> > get written?
> > 
> > I *guarantee* you that we've had a lot more problems with threads that 
> > should *not* have been frozen than with those hypothetical threads that 
> > you think should have been frozen.
> 
> Well, we had nasty corruption on XFS, caused by thread that was not
> frozen and should be. (While the other case leads "only" to deadlocks,
> so it is easier to debug.)
> 
> The locking point.. when I added freezing to swsusp, I knew very
> little about kernel locking, so I "simply" decided to avoid the
> problem altogether... using the freezer.
> 
> You may be right that locks are not a big problem for the hibernation
> after all; I just do not know.

Still, I think, if a kernel thread is a part of a device driver, then _in_
_principle_ it needs _some_ synchronization with the driver's suspend/freeze
and resume/thaw callbacks.  For example, it's reasonable to assume that the
thread should be quiet between suspend/freeze and resume/thaw.

With the freezing of kernel threads we provide a simple means of such
synchronization: use try_to_freeze() in a suitable place of your kernel thread
and you're done.  [Well, there should be a second part for making the thread
die if the thaw callback doesn't find the device, but that's in the works.]

Without it, there may be race conditions that we are not even aware of and that
may trigger in, say, 1 in 10 suspends or so and I wish you good luck with
debugging such things.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-29  8:59                                                   ` Pavel Machek
@ 2007-04-29  9:32                                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 136+ messages in thread
From: Rafael J. Wysocki @ 2007-04-29  9:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Nigel Cunningham, Pekka J Enberg, LKML, Oleg Nesterov

On Sunday, 29 April 2007 10:59, Pavel Machek wrote:
> Hi!
> 
> > > Ie we do have history of _not_ freezing things.  The freezing came later, 
> > > and came with the subsystem that had more problems..
> > 
> > It doesn't have that many problems as you are trying to suggest.  At present,
> > the only problems with it happen if someone tries to "improve" it in the way
> > I did with the workqueues.
> > 
> > Anyway, the freezing of tasks, including kernel threads, is one of the few
> > things on which Pavel, Nigel and me completely agree that they should be done,
> > so perhaps you could accept that?
> 
> Actually, if we want to support OLPC _nicely_, we'll need to get rid
> of freezer from suspend-to-RAM. Of course, that _will_ put more
> pressure at the drivers -- and break few of them...

I think the removal of sys_sync() from freeze_processes() in the s2ram case
might help.

I'm really afraid of dropping the freezing of kernel threads from the
hibernation/suspend altogether before we know we won't break drivers, because
we can introduce some very subtle and difficult to debug problems this way.

Moreover, apart from speeding up the suspend slightly (kernel threads are
frozen very quickly) this won't buy us anything, since kprobes uses the freezer
and all of the infrastructure is needed anyway.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28 13:37                       ` Oliver Neukum
@ 2007-05-03 12:06                         ` Pavel Machek
  2007-05-04 21:52                           ` Indan Zupancic
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-03 12:06 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: Pekka Enberg, Nigel Cunningham, Linus Torvalds, LKML

Hi!

> > > The kernel can already do compression and encryption.
> > 
> > Yes, if we all could agree on _which_ compression and encryption
> 
> Any of those available in the kernel. Where's the problem?

gzip is too slow for this. lzf works okay. Oh and swsusp wants rsa
crypto.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:03                     ` Kyle Moffett
  2007-04-28  1:15                       ` Rafael J. Wysocki
@ 2007-05-03 15:10                       ` Pavel Machek
  2007-05-03 16:53                         ` Kyle Moffett
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-03 15:10 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: nigel, Linus Torvalds, Rafael J. Wysocki, Pekka J Enberg, LKML

Hi!

> >>It makes it harder to debug (wouldn't it be *nice* to 
> >>just ssh in,  and do
> >>	gdb -p <snapshotter>
> >
> >Make the machine being suspended a VM and you can 
> >already do that.
> 
> >>when something goes wrong?) but we also *depend* on 
> >>user space for  various things (the same way we depend 
> >>on kernel threads, and why  it has been such a total 
> >>disaster to try to freeze the kernel  threads too!). 
> >>For example, if you want to do graphical stuff,  just 
> >>using X would be quite nice,  wouldn't it?
> >
> >But in doing so you make the contents of the disk 
> >inconsistent with  the state you've just snapshotted, 
> >leading to filesystem  corruption. Even if you modify 
> >filesystems to do checkpointing  (which is what we're 
> >really talking about), you still also have the  problem 
> >that your snapshot has to be stored somewhere before 
> >you  write it to disk, so you also have to either [snip]
> 
> Actually, it's a lot simpler than that.  We can just 
> combine the  device-mapper snapshot with a VM+kernel 
> snapshot system call and be  almost done:
> 
>   sys_snapshot(dev_t snapblockdev, int __user 
>   *snapshotfd);
> 
> When sys_snapshot is run, the kernel does:
> 
> 1)  Sequentially freeze mounted filesystems using 
> blockdev freezing.   If it's an fs that doesn't support 
> freezing then either fail or force- remount-ro that fs 
> and downgrade all its filedescriptors to RO.   Doesn't 
> need extra locking since process which try to do IO 
> either  succeed before the freeze call returns for that 
> blockdev or sleep on  the unfreeze of that blockdev.  
> Filesystems are synchronized and made  clean.

How mature is freezing filesystems -- will it work on at least ext2/3
and vfat?

What happens if you try to boot and filesystems are frozen from
previous run?

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  1:05                       ` Jeremy Fitzhardinge
@ 2007-05-03 15:14                         ` Pavel Machek
  2007-06-01 19:00                           ` Eric W. Biederman
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-03 15:14 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Matthew Garrett, Linus Torvalds, Rafael J. Wysocki,
	Pekka J Enberg, Nigel Cunningham, LKML

Hi!

> > While that would certainly be nifty, I think we're arguably starting 
> > from the wrong point here. Why are we booting a kernel, trying to poke 
> > the hardware back into some sort of mock-quiescent state, freeing memory 
> > and then (finally) overwriting the entire contents of RAM rather than 
> > just doing all of this from the bootloader?

Doing it from the bootloader sounds attractive... but it is lot of
work. I'm essentially using linux as a bootloader.

Patch for grub welcome.

> Sure, you could make suspend generate a complete bootable kernel image
> containing all RAM.  Doesn't sound too hard to me.  You know, from over
> here on the sidelines.

Ah, so we have a volunteer :-).
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:17                         ` Linus Torvalds
  2007-04-27 23:45                           ` Rafael J. Wysocki
@ 2007-05-03 15:25                           ` Pavel Machek
  1 sibling, 0 replies; 136+ messages in thread
From: Pavel Machek @ 2007-05-03 15:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Pekka J Enberg, Nigel Cunningham, LKML

Hi!

> > 1) if the kernel threads are frozen, we know that they don't hold any locks
> > that could interfere with the freezing of device drivers,
> > 2) if they are frozen, we know, for example, that they won't call user mode
> > helpers or do similar things,
> > 3) if they are frozen, we know that they won't submit I/O to disks and
> > potentially damage filesystems (suspend2 has much more problems with that
> > than swsusp, but still.  And yes, there have been bug reports related to it,
> > so it's not just my fantasy).
> 
> NONE of these are valid explanations at all. You're listing totally 
> theoretical problems, and ignoring all the _real_ problems that trying to 
> freeze kernel threads has _caused_.

xfs problem was real. And I do not see that many problems caused by
freezing kernel threads:  at least you get deadlocks, not silent fs
corruption.

> And no, kernel threads do not submit IO to disks on their own. You just 
> made that up. Yes, they can be involved in that whole disk submission 
> thing, but in a good way - they can be required in order to make disk 
> writing work!

Yep, so we have md doing io while we are doing atomic copy. That
probably means it will continue when atomic copy is done... getting
image out of sync with disk.

(Plus we used to have bdflush, doing periodic writes to disk).

> The problem that suspend has had is that it's done everything totally the 
> wrong way around. Do kernel threads do disk IO? Sure, if asked to do so. 
> For example, kernel threads can be involved in md etc, but that's a *good* 
> thing. The way to shut them up is not to freeze the threads, but to freeze 
> the *disk*.

Well, if freezing the disk was available, I'd gladly do it. Is there
easy way to implement that?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-03 15:10                       ` Pavel Machek
@ 2007-05-03 16:53                         ` Kyle Moffett
  2007-05-04  7:52                           ` David Greaves
  0 siblings, 1 reply; 136+ messages in thread
From: Kyle Moffett @ 2007-05-03 16:53 UTC (permalink / raw)
  To: Pavel Machek
  Cc: nigel, Linus Torvalds, Rafael J. Wysocki, Pekka J Enberg, LKML

On May 03, 2007, at 11:10:47, Pavel Machek wrote:
> How mature is freezing filesystems -- will it work on at least  
> ext2/3 and vfat?

I'm pretty sure it works on ext2/3 and xfs and possibly others, I  
don't know either way about VFAT though.  Essentially the "freeze"  
part involves telling the filesystem to sync all data, flush the  
journal, and mark the filesystem clean.  The intent under dm/LVM was  
to allow you to make snapshots without having to fsck the just- 
created snapshot before you mounted it.

> What happens if you try to boot and filesystems are frozen from  
> previous run?

If you're just doing a fresh boot then the filesystem is already  
clean due to the dm freeze and so it mounts up normally.  All you  
need to do then is have a little startup script which purges the  
saved image before you fsck or remount things read-write since either  
case means the image is no longer safe to resume.

If the kernel is later modified to purge all filesystem data (dcache/ 
pagecache) during snapshot and effectively remount and reopen all the  
files by path during restore then you could remove that requirement.   
You'd just need to make sure that the restore-from-disk scripts did  
an fsck or journal-restore before reloading the old kernel data.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-27 23:50                               ` David Lang
  2007-04-28  0:40                                 ` Linus Torvalds
  2007-04-28  6:58                                 ` Oliver Neukum
@ 2007-05-03 17:18                                 ` Pavel Machek
  2007-05-07  2:13                                   ` David Lang
  2 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-03 17:18 UTC (permalink / raw)
  To: David Lang
  Cc: Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

Hi!

> nobody is suggesting that you leave peocesses running 
> while you do the snapshot, what is being proposed is
> 
> 1. pause userspace (prevent scheduling)
> 2. make snapshot image of memory
> 3. make mounted filesystems read-only (possibly with 
> snapshot/checkpoint)
> 4. unpause
> 5. save image (with full userspace available, including 
> network)

Including network? Your tcp peers will be really confused, then, if
you ACK packets then claim you did not get them. No, you do not want
to start network.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-03 16:53                         ` Kyle Moffett
@ 2007-05-04  7:52                           ` David Greaves
  2007-05-04 13:27                             ` Kyle Moffett
  0 siblings, 1 reply; 136+ messages in thread
From: David Greaves @ 2007-05-04  7:52 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Pavel Machek, nigel, Linus Torvalds, Rafael J. Wysocki,
	Pekka J Enberg, LKML

Kyle Moffett wrote:
> On May 03, 2007, at 11:10:47, Pavel Machek wrote:
>> How mature is freezing filesystems -- will it work on at least ext2/3
>> and vfat?
> 
> I'm pretty sure it works on ext2/3 and xfs and possibly others, I don't
> know either way about VFAT though.  Essentially the "freeze" part
> involves telling the filesystem to sync all data, flush the journal, and
> mark the filesystem clean.  The intent under dm/LVM was to allow you to
> make snapshots without having to fsck the just-created snapshot before
> you mounted it.
> 
>> What happens if you try to boot and filesystems are frozen from
>> previous run?
> 
> If you're just doing a fresh boot then the filesystem is already clean
> due to the dm freeze and so it mounts up normally.  All you need to do
> then is have a little startup script which purges the saved image before
> you fsck or remount things read-write since either case means the image
> is no longer safe to resume.

Wouldn't it be better if freeze wrote a freeze-ID to the fs and returned it?
This would naturally be kept in the image and a UUID mismatch would be
detectable - seems safer and more flexible than 'a script'.

"This isn't the freeze you're looking for, move along"

David

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-04  7:52                           ` David Greaves
@ 2007-05-04 13:27                             ` Kyle Moffett
  0 siblings, 0 replies; 136+ messages in thread
From: Kyle Moffett @ 2007-05-04 13:27 UTC (permalink / raw)
  To: David Greaves
  Cc: Pavel Machek, nigel, Linus Torvalds, Rafael J. Wysocki,
	Pekka J Enberg, LKML

On May 04, 2007, at 03:52:03, David Greaves wrote:
> Kyle Moffett wrote:
>> On May 03, 2007, at 11:10:47, Pavel Machek wrote:
>>> What happens if you try to boot and filesystems are frozen from  
>>> previous run?
>>
>> If you're just doing a fresh boot then the filesystem is already  
>> clean due to the dm freeze and so it mounts up normally.  All you  
>> need to do then is have a little startup script which purges the  
>> saved image before you fsck or remount things read-write since  
>> either case means the image is no longer safe to resume.
>
> Wouldn't it be better if freeze wrote a freeze-ID to the fs and  
> returned it? This would naturally be kept in the image and a UUID  
> mismatch would be detectable - seems safer and more flexible than  
> 'a script'.
>
> "This isn't the freeze you're looking for, move along"

Possibly, but I was referring to the _current_ behavior of the device- 
mapper freezing.  While perhaps not ideal, it's currently very easily  
usable.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-03 12:06                         ` Pavel Machek
@ 2007-05-04 21:52                           ` Indan Zupancic
  2007-05-05  9:16                             ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: Indan Zupancic @ 2007-05-04 21:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Oliver Neukum, Pekka Enberg, Nigel Cunningham, Linus Torvalds, LKML

On Thu, May 3, 2007 14:06, Pavel Machek wrote:
>> > > The kernel can already do compression and encryption.
>> >
>> > Yes, if we all could agree on _which_ compression and encryption
>>
>> Any of those available in the kernel. Where's the problem?
>
> gzip is too slow for this. lzf works okay. Oh and swsusp wants rsa
> crypto.

Then port lzf to the kernel, or help with the lzo port.

Swsusp might want RSA crypto, but it doesn't really need it. Currently
it only uses it to be able to suspend without asking for a passphrase.

So the current sequence is:

1) Generate RSA keys + ask for a passphrase. (Once)

...

2) Suspend. (Encrypt snapshot with public RSA key).

...

3) Ask for the passphrase.

4) Resume.

RSA is used so that the passphrase can be thrown away between 1 and 2.


But the same functionality can be achieved by doing:

1) Define a user password (e.g. /etc/shadow thing). (Once)

2) When a user logs in: get random data and encrypt it with the password,
this becomes the AES key. Store both the data and key in a secure way in
memory, e.g. using the existing kernel key infrastructure.

...

3) Suspend.
   (Encrypt snapshot with the AES key and store the random data.)

...

3) Ask for the passphrase.
   (To get the AES key, encrypt the stored random data.)

4) Resume.

Variants are possible of course, but this is the main idea.


This is secure because the key infrastructure is secure, and even if
it isn't the system must be compromised to get the suspend key before
the suspend is done. But at that point the attacker already has all
information that can be found in the suspend image, and could have done
all kind of things to inflict damage (like installing a key logger).

Advantage of this scheme is that it only need AES and can be done (mostly)
in kernel space. It's also faster and simpler than the current RSA scheme.
Disadvantage is that it wastes at least 32 bytes of memory when the system
is running, to store the data and key.

Only thing that needs to be done in userspace is setting the random data
and AES key, but there exist a suitable interface for that (the key system).
As user login is already done in user space, this can be integrated with
that in a nice way.

Greetings,

Indan



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-04 21:52                           ` Indan Zupancic
@ 2007-05-05  9:16                             ` Pavel Machek
  2007-05-05 12:02                               ` Indan Zupancic
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-05  9:16 UTC (permalink / raw)
  To: Indan Zupancic
  Cc: Oliver Neukum, Pekka Enberg, Nigel Cunningham, Linus Torvalds, LKML

Hi!

> But the same functionality can be achieved by doing:
> 
> 1) Define a user password (e.g. /etc/shadow thing). (Once)
> 
> 2) When a user logs in: get random data and encrypt it with the password,
> this becomes the AES key. Store both the data and key in a secure way in
> memory, e.g. using the existing kernel key infrastructure.



> Advantage of this scheme is that it only need AES and can be done (mostly)
> in kernel space. It's also faster and simpler than the current RSA scheme.
> Disadvantage is that it wastes at least 32 bytes of memory when the system
> is running, to store the data and key.

Another disadvantage is that you need to hack into PAM infrastructure,
that your suspend password needs to be same as someone's login
password, and that it will really only work with single-user machine.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-04-28  0:18                               ` Linus Torvalds
@ 2007-05-05 11:42                                 ` Pavel Machek
  0 siblings, 0 replies; 136+ messages in thread
From: Pavel Machek @ 2007-05-05 11:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rafael J. Wysocki, Nigel Cunningham, Pekka J Enberg, LKML

Hi!

> > The "let's stop all kernel threads" is superstition. It's the same kind of 
> > superstition that made people write "sync" three times before turning off 
> > the power in the olden times. It's the kind of superstition that comes 
> > from "we don't do things right, so let's be vewy vewy quiet and _pray_ 
> > that it works when we are beign quiet".
> 
> Side note: while I think things should probably *work* even with user 
> processes going full bore while a snapshot it taken, I'll freely admit 
> that I'll follow that superstition far enough that I think it's probably a 
> good idea to try to quiesce the system to _some_ degree, and that stopping 
> user programs is a good idea. Partly because the whole memory shrinking 
> thing, and partly just because we should do the snapshot with hw IO queues 
> empty.
> 
> But I don't think it would necessarily be wrong (and in many ways it would 
> probably be *right*) to do that IO queue stopping at the queue level 
> rather than at a process level. Why stop processes just becasue you want 
> to clean out IO queues? They are two totally different things!

Actually, I'd like to stop I/O queues; if there was easy way to do
that, I'll happily switch. Notice that we'll need to stop 'I/O queues'
of the char devices, too...
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-05  9:16                             ` Pavel Machek
@ 2007-05-05 12:02                               ` Indan Zupancic
  0 siblings, 0 replies; 136+ messages in thread
From: Indan Zupancic @ 2007-05-05 12:02 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Oliver Neukum, Pekka Enberg, Nigel Cunningham, Linus Torvalds, LKML

Hello,

On Sat, May 5, 2007 11:16, Pavel Machek wrote:
>> But the same functionality can be achieved by doing:
>>
>> 1) Define a user password (e.g. /etc/shadow thing). (Once)
>>
>> 2) When a user logs in: get random data and encrypt it with the password,
>> this becomes the AES key. Store both the data and key in a secure way in
>> memory, e.g. using the existing kernel key infrastructure.
>
>
>
>> Advantage of this scheme is that it only need AES and can be done (mostly)
>> in kernel space. It's also faster and simpler than the current RSA scheme.
>> Disadvantage is that it wastes at least 32 bytes of memory when the system
>> is running, to store the data and key.
>
> Another disadvantage is that you need to hack into PAM infrastructure,
> that your suspend password needs to be same as someone's login
> password, and that it will really only work with single-user machine.

The first two are only true if you want to integrate it with user login, so
that a user only needs to sign in once, which seems like a convenient thing.
But if you don't want to integrate with the existing login infrastructure,
then just don't. And those disadvantages are true for any system that wants
users to login once.

Then the disadvantage is reduced to a user needing to provide the password
at suspend if the system wasn't booted from a snapshot. But no need for
users to generate any files, just to choose a resume password.

If the resume key is stored per user instead of a single global instance, it
will work with a multi-user system too. A more interesting question is what
should happen when one user did the suspend and the other wants to resume.
Throw away the snapshot? Refuse booting? Or boot and switch "active user"?

If you don't want people to resume each other's suspends then a key per user
works. If you want them to, then it becomes a bit tricky, especially if you
don't integrate with the login system. You don't want that a user can resume
someone else's snapshot and have access to everything that other user left
open. Nor do you want users to give a password twice.

If you want users to be able to resume each other's snapshots, you probably
also want the system to switch users after the resume. No matter what scheme
is used, this becomes hairy and hard to get watertight. (Perhaps "impossible"
is more realistic: how to be able to read the suspend image and copying it
to RAM again, without having access to all data within?)

But if it's an "us" against "them" case, and you want users to resume each
other's snapshots, you're right that the scheme I proposed will fall apart.
In which case it needs to be adjusted a bit to handle this case:

Have one global suspend/resume key, and for each user store it on disk,
encrypted with that user's password. Also store the key in memory as
before. Now when the system is suspended any user needs to have provided
his password once for everyone to be able to suspend without giving a
password. Also everyone can resume, if they have access to the file with
the list of encrypted keys and provide the right password. (Notice that
this looks more like the current scheme, where the private part of the
RSA key is encrypted with a passphrase and all stored in a file.)

Though it seems that using suspend to disk on a real multi-user system is
always asking for problems, because the suspend image may contain valuable
data which shouldn't be thrown away, but easily can by other users. Nor do
you want users to claim the machine, so it's a lose/lose situation. Also
with resume every user effectively gets root access, because of all the
memory access. So inter-user security is down the drain anyway.

Only sane usage I can see is when the users trust each other, in which case
they can as well agree on one resume password. ;-)

Greetings,

Indan



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-03 17:18                                 ` Pavel Machek
@ 2007-05-07  2:13                                   ` David Lang
  2007-05-07  3:33                                     ` Kyle Moffett
  2007-05-07 12:48                                     ` Pavel Machek
  0 siblings, 2 replies; 136+ messages in thread
From: David Lang @ 2007-05-07  2:13 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

On Thu, 3 May 2007, Pavel Machek wrote:

> Hi!
>
>> nobody is suggesting that you leave peocesses running
>> while you do the snapshot, what is being proposed is
>>
>> 1. pause userspace (prevent scheduling)
>> 2. make snapshot image of memory
>> 3. make mounted filesystems read-only (possibly with
>> snapshot/checkpoint)
>> 4. unpause
>> 5. save image (with full userspace available, including
>> network)
>
> Including network? Your tcp peers will be really confused, then, if
> you ACK packets then claim you did not get them. No, you do not want
> to start network.

anyone who is doing a hibernate or suspend who expect all the network 
connections to be working afterwords is dreaming or smokeing something.

this is just another way that the failure can show up.

in fact, I would say that it would probalby be a nice thing to do for 
intervening firewalls and external servers if a suspend closed all external TCP 
connections rather then leaving them dangling (eating up resources until they 
time out)

if you software can't tolorate the network connection going away on you it will 
have problems in normal operation anyway, let alone when you suspend/hibernate 
your machine.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07  2:13                                   ` David Lang
@ 2007-05-07  3:33                                     ` Kyle Moffett
  2007-05-07 12:48                                     ` Pavel Machek
  1 sibling, 0 replies; 136+ messages in thread
From: Kyle Moffett @ 2007-05-07  3:33 UTC (permalink / raw)
  To: David Lang
  Cc: Pavel Machek, Nigel Cunningham, Rafael J. Wysocki,
	Linus Torvalds, Pekka J Enberg, LKML

On May 06, 2007, at 22:13:51, David Lang wrote:
> anyone who is doing a hibernate or suspend who expect all the  
> network connections to be working afterwords is dreaming or  
> smokeing something.
>
> this is just another way that the failure can show up.
>
> in fact, I would say that it would probalby be a nice thing to do  
> for intervening firewalls and external servers if a suspend closed  
> all external TCP connections rather then leaving them dangling  
> (eating up resources until they time out)
>
> if you software can't tolorate the network connection going away on  
> you it will have problems in normal operation anyway, let alone  
> when you suspend/hibernate your machine.

Yeah, for suspend-to-ram+resume and for snapshot+restore you probably  
want userspace to support some kind of initscript-like mechanism  
which is triggered by the lid-switch or something before calling into  
the kernel.  That way it can close network connections mostly-nicely  
and down network interfaces before suspending, then re-run DHCP/ 
802.11/whatever configuration after resume/restore.  That might not  
be a bad place to handle NFS mounts and such too.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07  2:13                                   ` David Lang
  2007-05-07  3:33                                     ` Kyle Moffett
@ 2007-05-07 12:48                                     ` Pavel Machek
  2007-05-07 12:52                                       ` Oliver Neukum
  1 sibling, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-07 12:48 UTC (permalink / raw)
  To: David Lang
  Cc: Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

Hi!

> >>nobody is suggesting that you leave peocesses running
> >>while you do the snapshot, what is being proposed is
> >>
> >>1. pause userspace (prevent scheduling)
> >>2. make snapshot image of memory
> >>3. make mounted filesystems read-only (possibly with
> >>snapshot/checkpoint)
> >>4. unpause
> >>5. save image (with full userspace available, including
> >>network)
> >
> >Including network? Your tcp peers will be really confused, then, if
> >you ACK packets then claim you did not get them. No, you do not want
> >to start network.
> 
> anyone who is doing a hibernate or suspend who expect all the network 
> connections to be working afterwords is dreaming or smokeing
>something.

Really? It works today... if the suspend is short enough. And that's
how it should be.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 12:48                                     ` Pavel Machek
@ 2007-05-07 12:52                                       ` Oliver Neukum
  2007-05-07 14:37                                         ` david
  0 siblings, 1 reply; 136+ messages in thread
From: Oliver Neukum @ 2007-05-07 12:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Lang, Nigel Cunningham, Rafael J. Wysocki, Linus Torvalds,
	Pekka J Enberg, LKML

Am Montag, 7. Mai 2007 14:48 schrieb Pavel Machek:
> > >Including network? Your tcp peers will be really confused, then, if
> > >you ACK packets then claim you did not get them. No, you do not want
> > >to start network.
> > 
> > anyone who is doing a hibernate or suspend who expect all the network 
> > connections to be working afterwords is dreaming or smokeing
> >something.
> 
> Really? It works today... if the suspend is short enough. And that's
> how it should be.

If we get very good at Wake-on-Lan it should work for any length
of time.

	Regards
		Oliver


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 12:52                                       ` Oliver Neukum
@ 2007-05-07 14:37                                         ` david
  2007-05-07 19:51                                           ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: david @ 2007-05-07 14:37 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Pavel Machek, David Lang, Nigel Cunningham, Rafael J. Wysocki,
	Linus Torvalds, Pekka J Enberg, LKML

On Mon, 7 May 2007, Oliver Neukum wrote:

> Am Montag, 7. Mai 2007 14:48 schrieb Pavel Machek:
>>>> Including network? Your tcp peers will be really confused, then, if
>>>> you ACK packets then claim you did not get them. No, you do not want
>>>> to start network.
>>>
>>> anyone who is doing a hibernate or suspend who expect all the network
>>> connections to be working afterwords is dreaming or smokeing
>>> something.
>>
>> Really? It works today... if the suspend is short enough. And that's
>> how it should be.
>
> If we get very good at Wake-on-Lan it should work for any length
> of time.

for suspend-to-ram this would work, I stand corrected.

for hibernate this would almost certinly not work, and I don't think that 
it's worth raising false hopes.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 14:37                                         ` david
@ 2007-05-07 19:51                                           ` Pavel Machek
  2007-05-07 19:55                                             ` david
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-07 19:51 UTC (permalink / raw)
  To: david
  Cc: Oliver Neukum, David Lang, Nigel Cunningham, Rafael J. Wysocki,
	Linus Torvalds, Pekka J Enberg, LKML

Hi!

> >>Really? It works today... if the suspend is short 
> >>enough. And that's
> >>how it should be.
> >
> >If we get very good at Wake-on-Lan it should work for 
> >any length
> >of time.
> 
> for suspend-to-ram this would work, I stand corrected.
> 
> for hibernate this would almost certinly not work, and I 
> don't think that it's worth raising false hopes.

Check the facts. It used to work, and it should work today.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 19:51                                           ` Pavel Machek
@ 2007-05-07 19:55                                             ` david
  2007-05-07 20:38                                               ` Pavel Machek
  0 siblings, 1 reply; 136+ messages in thread
From: david @ 2007-05-07 19:55 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Oliver Neukum, David Lang, Nigel Cunningham, Rafael J. Wysocki,
	Linus Torvalds, Pekka J Enberg, LKML

On Mon, 7 May 2007, Pavel Machek wrote:

>>>> Really? It works today... if the suspend is short
>>>> enough. And that's
>>>> how it should be.
>>>
>>> If we get very good at Wake-on-Lan it should work for
>>> any length
>>> of time.
>>
>> for suspend-to-ram this would work, I stand corrected.
>>
>> for hibernate this would almost certinly not work, and I
>> don't think that it's worth raising false hopes.
>
> Check the facts. It used to work, and it should work today.

I don't dispute that it sometimes works today.

what I dispute is that makeing it work should be a contraint on a cleaner 
design that happens to cause tcp connections to fail on suspend-to-disk 
(hibernate).

if you are dong suspend-to-disk for such a short period that TCP 
connections are able to recover (typically <15 min for most firewalls, in 
some cases <2 min for connections with keep-alive) is it really worth it?

and once you pass the timeframes where the connections are still alive 
then it shouldn't matter, and in fact the server should gracefully close 
the connections to be nice to other devices and servers on the network.

I dispute the idea that doing a suspend-to-disk and expecting that your 
network connections will recover when you wake up is a sane expectation.

David Lang

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 19:55                                             ` david
@ 2007-05-07 20:38                                               ` Pavel Machek
  2007-05-08 17:36                                                 ` Disconnect
  0 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2007-05-07 20:38 UTC (permalink / raw)
  To: david
  Cc: Oliver Neukum, David Lang, Nigel Cunningham, Rafael J. Wysocki,
	Linus Torvalds, Pekka J Enberg, LKML

Hi!

> I don't dispute that it sometimes works today.
> 
> what I dispute is that makeing it work should be a contraint on a cleaner 
> design that happens to cause tcp connections to fail on suspend-to-disk 
> (hibernate).
> 
> if you are dong suspend-to-disk for such a short period that TCP 
> connections are able to recover (typically <15 min for most firewalls, in 
> some cases <2 min for connections with keep-alive) is it really
> worth it?

People were using swsusp to move server from one room to another.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-07 20:38                                               ` Pavel Machek
@ 2007-05-08 17:36                                                 ` Disconnect
  0 siblings, 0 replies; 136+ messages in thread
From: Disconnect @ 2007-05-08 17:36 UTC (permalink / raw)
  To: linux-kernel

We used it (with great success) to replace bad UPSs on single-PSU
database servers under (light) load. No need for scheduled downtime,
etc.

The whole point of hibernation (or suspend to disk, or whatever you
call it) is that the system goes to a zero-power state and then can be
brought back to its original state. Closing in-progress network
connections has nothing to do with pausing a machine any more than
setting IM clients to 'away' would, or locking an X session. That sort
of side-effect needs to be handled outside the core of "put state out
to disk and read it back".

On 5/7/07, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
> > I don't dispute that it sometimes works today.
> >
> > what I dispute is that makeing it work should be a contraint on a cleaner
> > design that happens to cause tcp connections to fail on suspend-to-disk
> > (hibernate).
> >
> > if you are dong suspend-to-disk for such a short period that TCP
> > connections are able to recover (typically <15 min for most firewalls, in
> > some cases <2 min for connections with keep-alive) is it really
> > worth it?
>
> People were using swsusp to move server from one room to another.
>                                                                         Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
  2007-05-03 15:14                         ` Pavel Machek
@ 2007-06-01 19:00                           ` Eric W. Biederman
  0 siblings, 0 replies; 136+ messages in thread
From: Eric W. Biederman @ 2007-06-01 19:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jeremy Fitzhardinge, Matthew Garrett, Linus Torvalds,
	Rafael J. Wysocki, Pekka J Enberg, Nigel Cunningham, LKML

Pavel Machek <pavel@ucw.cz> writes:

> Hi!
>
>> > While that would certainly be nifty, I think we're arguably starting 
>> > from the wrong point here. Why are we booting a kernel, trying to poke 
>> > the hardware back into some sort of mock-quiescent state, freeing memory 
>> > and then (finally) overwriting the entire contents of RAM rather than 
>> > just doing all of this from the bootloader?
>
> Doing it from the bootloader sounds attractive... but it is lot of
> work. I'm essentially using linux as a bootloader.
>
> Patch for grub welcome.

Well.  We actually have first class support for using linux as a
bootloader.  So you could use linux and do whatever dance you are
doing from a bootloader if you felt the desire.

That might make the dance a little easier.

Eric

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: Back to the future.
       [not found]           ` <8elpT-7wY-21@gated-at.bofh.it>
@ 2007-04-28 11:04             ` Bodo Eggert
  0 siblings, 0 replies; 136+ messages in thread
From: Bodo Eggert @ 2007-04-28 11:04 UTC (permalink / raw)
  To: Pavel Machek, David Lang, Linus Torvalds, Nigel Cunningham,
	Pekka Enberg, LKML

Pavel Machek <pavel@ucw.cz> wrote:

>> I also don't like the idea of storing this in the swap partition for a
>> couple of reasons.
>> 
>> 1. on many modern linux systems the swap partition is not large enough.
>> 
>> for example, on my boxes with 16G or ram I only allocate 2G of swap
>> space
> 
> WTF? So allocate larger swap partition. You just told me disks are big
> enough.

1) Repartitioning is sometimes not an option.
2) What happens, if the swap space gets used?

I want to be sure I can suspend my {server,laptop} in case of power running
out. Using swap is only an option for desktops.

>> 2. it's too easy for other things to stomp on your swap partition.
>> 
>>   for example: booting from a live CD that finds and uses swap
>> partitions
> 
> That's a feature. If you are booting from live CD, you _want_ to erase
> any hibernation image.

NACK. You want to keep all partitions related to the hibernated system
read-only. That's completely different from destroying all your unsafed
data and possibly long-running tasks.
-- 
Top 100 things you don't want the sysadmin to say:
51. YEEEHA!!!  What a CRASH!!!

Friß, Spammer: C@rzlmn.7eggert.dyndns.org D9GLNDg@Zk.7eggert.dyndns.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2007-06-01 19:02 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-26  6:04 Back to the future Nigel Cunningham
2007-04-26  7:28 ` Pekka Enberg
     [not found]   ` <1177573348.50 25.224.camel@nigel.suspend2.net>
2007-04-26  7:42   ` Nigel Cunningham
2007-04-26  8:17     ` Pekka Enberg
2007-04-26  9:28       ` Nigel Cunningham
2007-04-26 17:29         ` Luca Tettamanti
2007-04-26 16:56     ` Linus Torvalds
2007-04-26 17:03       ` Xavier Bestel
2007-04-26 17:34         ` Linus Torvalds
2007-04-26 20:08           ` Nigel Cunningham
2007-04-26 20:45             ` Linus Torvalds
2007-04-26 20:50               ` Nigel Cunningham
2007-04-27  0:10                 ` Olivier Galibert
2007-04-27 10:21                   ` Daniel Pittman
2007-04-27 23:19                   ` Nigel Cunningham
2007-04-26 21:38             ` Theodore Tso
2007-04-27 10:10               ` Christoph Hellwig
2007-04-26 22:08             ` Rafael J. Wysocki
2007-04-26 22:20               ` Nigel Cunningham
2007-04-26 23:15               ` Linus Torvalds
2007-04-27  7:51           ` Pekka Enberg
2007-04-26 17:07       ` Linus Torvalds
2007-04-26 18:22       ` Chase Venters
2007-04-26 18:50         ` David Lang
2007-04-26 19:56       ` Nigel Cunningham
2007-04-27  4:52         ` Pekka J Enberg
2007-04-27  6:08           ` Nigel Cunningham
2007-04-27  6:18             ` Pekka J Enberg
2007-04-27  6:29               ` Pekka J Enberg
2007-04-27  6:34               ` Nigel Cunningham
2007-04-27  6:50                 ` Pekka J Enberg
2007-04-27  7:03                   ` Nigel Cunningham
2007-04-27  7:24                     ` Pekka J Enberg
2007-04-27  9:50               ` Oliver Neukum
2007-04-27 10:12                 ` Pekka J Enberg
2007-04-27 19:07                   ` Oliver Neukum
2007-04-28  9:22                     ` Pekka Enberg
2007-04-28 13:37                       ` Oliver Neukum
2007-05-03 12:06                         ` Pavel Machek
2007-05-04 21:52                           ` Indan Zupancic
2007-05-05  9:16                             ` Pavel Machek
2007-05-05 12:02                               ` Indan Zupancic
2007-04-28 10:35                   ` Rafael J. Wysocki
2007-04-28 18:43                     ` David Lang
2007-04-28 19:37                       ` Rafael J. Wysocki
2007-04-27 21:24               ` Rafael J. Wysocki
2007-04-27 21:44                 ` Linus Torvalds
2007-04-27 22:04                   ` Rafael J. Wysocki
2007-04-27 22:08                     ` Linus Torvalds
2007-04-27 22:41                       ` Rafael J. Wysocki
2007-04-27 22:26                         ` David Lang
2007-04-27 23:21                           ` Rafael J. Wysocki
2007-04-27 23:01                             ` David Lang
2007-04-28  0:02                               ` Rafael J. Wysocki
2007-04-27 23:17                         ` Linus Torvalds
2007-04-27 23:45                           ` Rafael J. Wysocki
2007-04-27 23:57                             ` Nigel Cunningham
2007-04-27 23:50                               ` David Lang
2007-04-28  0:40                                 ` Linus Torvalds
2007-04-28  6:58                                 ` Oliver Neukum
2007-04-28  9:16                                   ` Pekka J Enberg
2007-04-28 18:28                                   ` David Lang
2007-05-03 17:18                                 ` Pavel Machek
2007-05-07  2:13                                   ` David Lang
2007-05-07  3:33                                     ` Kyle Moffett
2007-05-07 12:48                                     ` Pavel Machek
2007-05-07 12:52                                       ` Oliver Neukum
2007-05-07 14:37                                         ` david
2007-05-07 19:51                                           ` Pavel Machek
2007-05-07 19:55                                             ` david
2007-05-07 20:38                                               ` Pavel Machek
2007-05-08 17:36                                                 ` Disconnect
2007-04-27 23:59                             ` Linus Torvalds
2007-04-28  0:18                               ` Linus Torvalds
2007-05-05 11:42                                 ` Pavel Machek
2007-04-28  0:50                               ` Paul Mackerras
2007-04-28  1:00                               ` Rafael J. Wysocki
2007-04-28  1:12                                 ` Linus Torvalds
2007-04-28  0:54                                   ` David Lang
2007-04-28  1:44                                   ` Rafael J. Wysocki
2007-04-28  2:51                                     ` Daniel Hazelton
2007-04-28  7:00                                       ` progress meter in s2disk (was Re: Back to the future.) Pavel Machek
2007-04-28  8:50                                     ` Back to the future Pavel Machek
2007-04-28  9:24                                       ` Rafael J. Wysocki
2007-04-28 16:28                                       ` Linus Torvalds
2007-04-28 17:50                                         ` Rafael J. Wysocki
2007-04-28 21:25                                           ` Linus Torvalds
2007-04-28 23:03                                             ` Rafael J. Wysocki
2007-04-28 23:45                                               ` Linus Torvalds
2007-04-29  0:01                                                 ` Nigel Cunningham
2007-04-29  5:01                                                   ` Bojan Smojver
2007-04-29  3:43                                                 ` Kyle Moffett
2007-04-29  8:57                                                 ` Rafael J. Wysocki
2007-04-29  8:59                                                   ` Pavel Machek
2007-04-29  9:32                                                     ` Rafael J. Wysocki
2007-04-29  8:23                                             ` Pavel Machek
2007-04-29  9:22                                               ` Rafael J. Wysocki
2007-04-28 18:32                                       ` David Lang
2007-04-28 19:14                                         ` Rafael J. Wysocki
2007-04-28 18:44                                           ` David Lang
2007-05-03 15:25                           ` Pavel Machek
2007-04-27 22:07                   ` Nigel Cunningham
2007-04-28  1:03                     ` Kyle Moffett
2007-04-28  1:15                       ` Rafael J. Wysocki
2007-04-28  0:51                         ` David Lang
2007-04-28  1:25                         ` Kyle Moffett
2007-05-03 15:10                       ` Pavel Machek
2007-05-03 16:53                         ` Kyle Moffett
2007-05-04  7:52                           ` David Greaves
2007-05-04 13:27                             ` Kyle Moffett
2007-04-28  0:18                   ` Jeremy Fitzhardinge
2007-04-28  1:00                     ` Matthew Garrett
2007-04-28  1:05                       ` Jeremy Fitzhardinge
2007-05-03 15:14                         ` Pavel Machek
2007-06-01 19:00                           ` Eric W. Biederman
2007-04-28  1:08                       ` Rafael J. Wysocki
2007-04-27 20:44           ` Rafael J. Wysocki
2007-04-28 19:09         ` Bill Davidsen
2007-04-26 22:40       ` Pavel Machek
2007-04-27  5:41         ` Pekka Enberg
2007-04-27 14:55           ` Pavel Machek
2007-04-27 21:39             ` Nigel Cunningham
2007-04-26 22:42       ` Pavel Machek
2007-04-26 22:24         ` David Lang
2007-04-26 23:12           ` Pavel Machek
2007-04-26 22:49             ` David Lang
2007-04-26 23:27               ` Pavel Machek
2007-04-26 22:56                 ` David Lang
2007-04-27  0:23               ` Olivier Galibert
2007-04-27 12:49       ` Pavel Machek
2007-04-27 21:26         ` Rafael J. Wysocki
2007-04-27 22:12           ` David Lang
2007-04-26  8:38 ` Jan Engelhardt
2007-04-26  9:33   ` Nigel Cunningham
2007-04-28  0:28 ` Bojan Smojver
     [not found] <8e5l8-7SD-21@gated-at.bofh.it>
     [not found] ` <8e6Ka-1uR-3@gated-at.bofh.it>
     [not found]   ` <8e6TS-1Id-11@gated-at.bofh.it>
     [not found]     ` <8efu9-6mF-1@gated-at.bofh.it>
     [not found]       ` <8ekWV-6FF-33@gated-at.bofh.it>
     [not found]         ` <8el6y-6Sj-5@gated-at.bofh.it>
     [not found]           ` <8elpT-7wY-21@gated-at.bofh.it>
2007-04-28 11:04             ` Bodo Eggert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.