From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262175AbTEENIK (ORCPT ); Mon, 5 May 2003 09:08:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262177AbTEENIK (ORCPT ); Mon, 5 May 2003 09:08:10 -0400 Received: from mailrelay2.lanl.gov ([128.165.4.103]:33928 "EHLO mailrelay2.lanl.gov") by vger.kernel.org with ESMTP id S262175AbTEENID (ORCPT ); Mon, 5 May 2003 09:08:03 -0400 Subject: Kernel hot-swap using Kexec, BProc and CC/SMP Clusters. From: Steven Cole To: linux-kernel@vger.kernel.org Cc: "Eric W. Biederman" , Larry McVoy , Linus Torvalds Content-Type: text/plain Organization: Message-Id: <1052140733.2163.93.camel@spc9.esa.lanl.gov> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.4-1.1mdk Date: 05 May 2003 07:18:55 -0600 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org This posting was sent yesterday but had a forbidden subject line, borrowed from a Jonathan Swift short story (A Modest Proposal). Here it is again, with apologies to those on the cc list for the duplicate. The subject of loading a new kernel without restarting surfaces from time to time and has come up again on linux-kernel, http://marc.theaimsgroup.com/?l=linux-kernel&m=105198997207784&w=2 so here are my thoughts on the matter. Comments, alternatives and reasons why this can't or shouldn't be done are all welcome. Especially comments about how this could or should be done differently. Here is yet another road-map for changing the kernel on a machine while minimizing the disruption to user processes. This method has the advantage that all of the major pieces here have either been proposed or are in various stages of development. However, the glue that holds it all together does not yet exist in any form, as far as I am aware. At least the major pieces all have merits of their own, regardless of whether they are used together as described here. Some disclaimers up front: This may be over-engineered, not possible, or just a horrible way to accomplish something no one really needs. This method would only work on a two-way or greater SMP box, and may not be feasible on 32-bit arches due to the difficulty (or impossibility) of squeezing more than one kernel into ZONE_NORMAL at the same time, although it's possible that techniques relevant to using very large amounts of memory (like page clustering on NUMA) could be adapted to come to the rescue here. Piece C) may be much easier on 64-bit arches for that reason, and hopefully 32-bit systems will take the place of 16-bit systems in our vague memories before the decade is out. Better to plan ahead now. The major pieces are: A) Kexec, now in 2.5.68-mm4. Kexec provides a way of Linux loading Linux. The relieving hot-swap kernel might be given command line arguments to come up not tweaking/probing any hardware and to not run init. Information normally gained from hardware probes would be made available from the still running old kernel (or designated leader in the case of multiple old kernels on multiple nodes). The new kernel might be told on which CPU or set of CPUs on which to boot, or there might be some way for the appropriate CPU or set of CPUs to be reliably detected. B) BProc, implemented for 2.4 but not in mainline and not yet ported to 2.5. Beowulf Distributed Process Space (BProc) is described here: http://bproc.sourceforge.net/ and is used to manage this 1024-node machine: http://www.lanl.gov/projects/pink/ which is located a few miles down the road from where I work on much more pedestrian projects. In addition to managing user processes across machines in a traditional cluster, perhaps this could be developed to manage processes across nodes in a CC-cluster (and to transfer the functionality of the Master BProc Node to another), which brings us to C). C) Cache coherent clustering proposed by Larry McVoy, described here: http://www.bitmover.com/ml/slide01.html and rather long threads on linux-kernel start here: http://marc.theaimsgroup.com/?l=linux-kernel&m=100751282125562&w=2 and http://marc.theaimsgroup.com/?l=linux-kernel&m=100752000911911&w=2 This was proposed as a way to scale Linux to machines with large numbers of CPUs. With advances in multiple chip on die and more extreme hyper-threading, Linux may some day have to deal with, for example, a 512 CPU system. Think of a CC-cluster of 32 nodes of 16 CPUs each, with a separate kernel running each node. Obviously, there are many very difficult issues (like how the kernels interact and don't interfere with one another over i/o buses, etc) to be solved, so this piece is nowhere near being implemented, at least as far as I know. Many major problems with this have been pointed out before, so this could turn out to be infeasible. I hope that is not the case. The degenerate case is a two-way box with separate kernels on each CPU. Putting these three pieces together, we could hot-swap the kernel, with user processes being minimally affected and external connections perhaps not even noticing. For the simplest case of a dual-CPU box: 1) One of the CPUs is halted and declared unavailable. The user processes now have only one CPU on which to run, but this disruption will be temporary. 2) Using Kexec, the new kernel is booted by the old kernel on the halted CPU, with command line arguments to come up in a relieving mode, not probing hardware, not running init, getting necessary hardware configuration details from the old kernel. The new kernel also has to come up shoe-horned into the same space as the old kernel in a CC-cluster mode (this is the acknowledged really hard part again). 3) Once up as a separate and autonomous kernel, the new kernel checks to see that it is properly configured for the hardware which it has just been told about and presents its qualifications to the old kernel. If it passes these tests, the old kernel uses BProc to transfer all user processes to the new kernel. A human analog for this exists in the formal transfer of authority between the on-coming and off-going officer of the deck on a naval vessel: New kernel: "I am ready to relieve you" (After assessing the situation. This would include determining which file systems and drivers were needed, which modules need to be loaded, etc. and perhaps which daemons need to be running prior to user process transfer) Old kernel: "I am ready to be relieved" (After assessing the relief's ability to take over. Same as above, but a double check on the new kernel's configuration. If either of these two steps fail, the kernel-swap is aborted, the old kernel tells the new kernel to shutdown and takes back control of the halted CPU.) New kernel: "I relieve you" (user processes and daemons are now transferred with BProc) Old kernel: "I stand relieved" (all user processes are verified to be successfully transferred) Now the old kernel can exit (or be told to shutdown properly). 4) The new kernel assimilates the CPU on which the old kernel was running (resistance is futile), marks it as available and user processes and then new kernel can now be scheduled on it. The hot-swap of kernels is now complete and for this simple case, the system is back to a regular single SMP kernel. The CC-cluster configuration was just an intermediate step. Since the new kernel doesn't have to probe any hardware, this hot-swap could in principle be very fast, so the time that the system's capability is degraded could be very short, on the order of a few seconds or less. For a 2-CPU system, this temporary degradation would be at least 50%. For the extreme case of the 512 CPU system with 32 nodes, the new kernels could be brought up on one node at at time, so the degradation might be as little as 1/32 or 3%. The interfaces for doing all this should probably remain stable during a major release cycle, so that any properly configured 2.8.x kernel would be able to hot-swap with any other properly configured 2.8.x kernel. The earliest that this could be done is 2.7.something, but perhaps even later, judging from all the problems identified in the cc/smp cluster threads. Perhaps a brief note about why a kernel hot-swap is even desired might be in order here. As systems become more and more complex (and therefore important), their boot times seem to increase. My experience with production systems is that taking them down for even a short time can be hard to schedule. Rebooting for a needed upgrade with even a well-tested vendor-kernel is a hard sell sometimes. Yes, I know that these are issues which are orthogonal to this discussion, but having the ability to install a new kernel with almost no disruption to the customer could be worth it for some customers. Simpler and easier strategies for almost accomplishing a hot-swap involving user process check-pointing have been suggested, but those involve a short duration halting of the system. If the gain of a no-halt hot-swap is worth the considerable pain, then perhaps this road-map is worth investigating further. If you made it this far, thanks in advance for reading this to the end. Steven