From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753695AbXD0AKn (ORCPT ); Thu, 26 Apr 2007 20:10:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753692AbXD0AKn (ORCPT ); Thu, 26 Apr 2007 20:10:43 -0400 Received: from dspnet.fr.eu.org ([213.186.44.138]:3629 "EHLO dspnet.fr.eu.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751861AbXD0AKk (ORCPT ); Thu, 26 Apr 2007 20:10:40 -0400 Date: Fri, 27 Apr 2007 02:10:38 +0200 From: Olivier Galibert To: Nigel Cunningham Cc: Linus Torvalds , Xavier Bestel , Pekka Enberg , LKML Subject: Re: Back to the future. Message-ID: <20070427001038.GA11265@dspnet.fr.eu.org> Mail-Followup-To: Olivier Galibert , Nigel Cunningham , Linus Torvalds , Xavier Bestel , Pekka Enberg , LKML References: <1177567481.5025.211.camel@nigel.suspend2.net> <84144f020704260028q190fc90fs8f9ea703e42e7910@mail.gmail.com> <1177573348.5025.224.camel@nigel.suspend2.net> <1177607026.30284.197.camel@frg-rhel40-em64t-04> <1177618081.4737.42.camel@nigel.suspend2.net> <1177620656.4737.56.camel@nigel.suspend2.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1177620656.4737.56.camel@nigel.suspend2.net> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 27, 2007 at 06:50:56AM +1000, Nigel Cunningham wrote: > I'm perfectly willing to think through some alternate approach if you > suggest something or prod my thinking in a new direction, but I'm afraid > I just can't see right now how we can achieve what you're after. Ok, what about this approach I've been mulling about for a while: Suspend-to-disk is pretty much an exercise in state saving. There are multiple ways to do state saving, but they tend to end up in two categories: implicit and explicit. In implicit state saving, you try to save the state of the system/application/whatever "under its feet", more or less, and then fixup what is no saved/saveable correctly. A well-known example is the undumping process Emacs goes (went?) where it tries to dump the state of the memory as a new executable, with a lot of pleasure with various executable formats and subtleties due to side effects in libc code you don't control. In explicit state saving each object saves what is needed from its state to an independently defined format (instead of "whatever the memory organization happens to be at that point"). When reloading the state you have to parse it, and it usually requires rebuilding/relocating all references/pointers/etc. XEmacs currently has a "portable dumper" that pretty much does just that. We don't have any redumping problems anymore, they're over. Which one is the best depends heavily on the application. The amount of code in the implicit case depends on the amount of fixups to do. In the kernel case it happens to be a lot, pretty much everything that touches hardware has to save to memory the device state and reload it on resume. And bugs on hardware handling can be quite annoying to debug. And if some driver does not to saving/resume correctly, you have no way outside of playing with modules to ensure the safety of the suspend cycle. The amount of code in the explicit case is an interesting variable in the case of the kernel. You have to save what is needed, but how do you define what is needed? It is, pretty much, what running processes can observe from userspace. Now, what can a process observe: - its application text and anonymous memory pages - its file handles - its mapped files - its mapped whatever else - its sys5 IPC stuff - futex stuff and friends, namespaces, etc - its intrinsic characteristics it can reach through syscalls (i.e. the user-visible parts of current, like pid, uid...) - its currently running system call, if any So that's what we'd have to explicitely save. Anonymous memory, sys5 IPC, futex and current structures, that's easy stuff in practice. The fun part are pretty much: - references to files - references to active networking links - references to devices and associated visible state - currently running system call, aka the kernel stack for the process The last one is the one I'm the most afraid of. I hope that the signal stuff and/or the asynchronous syscall stuff that was discussed recently would allow to "unwind" blocking system calls back to the syscall level and then store the parameters for resume-time restart. The non-blocking calls you can just let finish. The first one is really interesting. If you value your filesystems, you'd rather have them clean after the suspend. And also you pretty much know that filesystems can move around when you're not looking, be it USB hotplug stuff (discovery order is random-ish isn't it?), module loading order issues or multithreaded device discovery. So you're way more happy *not* caching anything from the filesystem you can avoid. But what is a file reference, really? With the dcache handy, it's pretty much a path, since inodes don't always exist reliably. And if you have the lists of paths used by the processes on a particular filesystem, you can easily get an idea of where, if anywhere, the filesystem is even if you don't have reliable serials. More interestingly, you cannot, in any case, instantly corrupt your filesystem by having a mismatch between the in-memory cache and the reality. The processes which referenced files you can't find anywhere will end-up with EBADF or segfault depending on whether it was fd or mmap, ala revoke(). They'll probably die horribly. I'd rather have processes die than filesystems die, since in any case if the file isn't here anymore in practice the process could only destroy things. An interesting things there, nothing in that touches either the filesystem or the block devices. Everything is done at the VFS level. The devices don't need to care. And the "this filesystem goes there" can be done in userspace in an initramfs if people want to experiment with kinky strategies. After all, why not allow a sysadmin to regroup two filesystems into one though a suspend, the processes mostly don't need to care (well, tar may, but heh). Deleted files would have to be sillyrenamed or something. Implementation details ;-) Active networking links, you can consider them dead for a start. The networking guys can play with keepalives and stuff if they want to in a second step. Network seldom survives suspend anyway, too many timeouts involved, especially with dynamic IPs. That leaves references to devices. null, ptys, random, log are not a problem, they're virtual constructs. In a first approximation you can revoke() the rest brutally. On a "standard" system that will kill X (ouch), GPM and other input-interested devices, and everything with an opened sound device. Then you can add explicit state saving support to the devices you want, one by one. It may be possible to handle sound collectively at the ALSA layer level, I don't really know. Input shouldn't be too hard, not much state to save, X will be a pain and will probably need special casing. X is a big special case anyway, no matter what happens. For the less directly used devices you can always all explicit support when you feel like it. The interesting part is that either the device supports the suspend and says so explicitely, or the process can't access the device anymore using the previous fds/mmaps after resume. No weird half-condition. If (very) resilient, the process can even close, reopen, reconfigure and go on its merry way. And if you design the saving format correctly (attribute name/value pairs as text work beautifully for such a case), you can be resilient to extreme things including kernel version change or rsync-ing / and the state file and resuming in another box. And if a device gets something it can't parse as the state to go back to for a given fd/mmap for a process, it can always revoke() that one and go on. The main point of that kind of state-saving is to be trustable-by-design. For each process, either its environment could be restored correctly or the incorrect parts can not be accessed anymore. And the stability of the system and its filesystems is ensured pretty much whatever happens. There are a billion details to take into account in a real implementation, but I'm sure you can get the gist of the idea. OG.