* Re: [lkcd-devel] Re: What's left over.
@ 2002-10-31 20:22 Andreas Herrmann
2002-10-31 20:40 ` Linus Torvalds
0 siblings, 1 reply; 72+ messages in thread
From: Andreas Herrmann @ 2002-10-31 20:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, lkcd-devel, lkcd-devel-admin, lkcd-general,
Rusty Russell, Matt D. Robinson
Linus Torvalds <torvalds@transmeta.com>
Sent by: lkcd-devel-admin@lists.sourceforge.net
10/31/02 04:46 PM
On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> People have to realize that my kernel is not for random new
> features. The stuff I consider important are things that people
> use on their own, or stuff that is the base for other work.
A dump mechanism within the kernel is a base for much easier
kernel debugging.
IMHO, analyzing a dump is much more effective than guessing
a kernel bug solely with help of an oops message.
Using lkcd/lcrash, I've debugged enough problems in
kernel modules that were otherwise quite hard to determine.
It is hard to understand why developers do not want the
aid of dump/dump-analysis for kernel development.
Regards,
Andreas
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 20:22 [lkcd-devel] Re: What's left over Andreas Herrmann
@ 2002-10-31 20:40 ` Linus Torvalds
2002-10-31 20:54 ` Patrick Finnegan
2002-10-31 21:08 ` Benjamin LaHaise
0 siblings, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2002-10-31 20:40 UTC (permalink / raw)
To: Andreas Herrmann
Cc: linux-kernel, lkcd-devel, lkcd-devel-admin, lkcd-general,
Rusty Russell, Matt D. Robinson
On Thu, 31 Oct 2002, Andreas Herrmann wrote:
>
> A dump mechanism within the kernel is a base for much easier
> kernel debugging.
> IMHO, analyzing a dump is much more effective than guessing
> a kernel bug solely with help of an oops message.
And imnsho, debugging the kernel on a source level is the way to do it.
Which is why it's not going to be me who merges it.
Read my emails.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 20:40 ` Linus Torvalds
@ 2002-10-31 20:54 ` Patrick Finnegan
2002-10-31 21:08 ` Benjamin LaHaise
1 sibling, 0 replies; 72+ messages in thread
From: Patrick Finnegan @ 2002-10-31 20:54 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Herrmann, linux-kernel, lkcd-devel, lkcd-devel-admin,
lkcd-general, Rusty Russell, Matt D. Robinson
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Andreas Herrmann wrote:
> >
> > A dump mechanism within the kernel is a base for much easier
> > kernel debugging.
> > IMHO, analyzing a dump is much more effective than guessing
> > a kernel bug solely with help of an oops message.
>
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.
But, LKCD is useful also for tracing crashes back to hardware that causes
it. It's really hard to find problems in hardware using source code,
since the source code DOENS'T have anything to do with the problems.
Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu
http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 20:40 ` Linus Torvalds
2002-10-31 20:54 ` Patrick Finnegan
@ 2002-10-31 21:08 ` Benjamin LaHaise
2002-10-31 22:04 ` Bernhard Kaindl
1 sibling, 1 reply; 72+ messages in thread
From: Benjamin LaHaise @ 2002-10-31 21:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andreas Herrmann, linux-kernel, lkcd-devel, lkcd-devel-admin,
lkcd-general, Rusty Russell, Matt D. Robinson
On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.
>
> Read my emails.
That is one of the reasons that crash dumps are useful. Quite a few
problems that customers hit are not easy to reproduce, but when they
provide a dump file that can be loaded into gdb with the original
kernel debugging info and the backtrace command issued and various
bits of internal structures examined, usually a good hypothesis can
be made for the cause. Feed that back into a code audit and you end
up fixing problems that are decidedly challenging.
-ben
--
"Do you seek knowledge in time travel?"
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 21:08 ` Benjamin LaHaise
@ 2002-10-31 22:04 ` Bernhard Kaindl
2002-11-01 0:33 ` Werner Almesberger
0 siblings, 1 reply; 72+ messages in thread
From: Bernhard Kaindl @ 2002-10-31 22:04 UTC (permalink / raw)
To: linux-kernel; +Cc: Linus Torvalds, lkcd-general
On Thu, 31 Oct 2002, Benjamin LaHaise wrote:
> On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> > And imnsho, debugging the kernel on a source level is the way to do it.
> >
> > Which is why it's not going to be me who merges it.
> >
> > Read my emails.
>
> That is one of the reasons that crash dumps are useful. Quite a few
> problems that customers hit are not easy to reproduce, but when they
> provide a dump file that can be loaded into gdb with the original
> kernel debugging info and the backtrace command issued and various
> bits of internal structures examined, usually a good hypothesis can
> be made for the cause. Feed that back into a code audit and you end
> up fixing problems that are decidedly challenging.
>
> -ben
I could not have said it better. I've a good real-life example for it,
one which really happened and one just as example to give an image.
[ I'm not an expert, I'm just writing about my experiance ]
[ in order to try to make linux even better than it is ]
About debugging at source level:
Dump analysis does not say that you are not debugging on a source level,
with a vmlinux compiled with -g, (which could be stripped before making
the image) crash analysis tools could operate at source level(depending
on the compiler's reorderings of course, the assumtion that -O2 maps
source:binary 1:1 is of course not from this world)
An analogy to doctors, hospitals and patients:
dump analysis says you don't need to have a living patient
in order to cure a disease. It says you may have slept on the
other side of the world while the disease murdered your fellow
at home. But as you don't like that it happens again to another
fellow, you want to have a remote lab which gives you every info
you need to have in order to know what might have murdered him.
The dump tools are this remote lab. If you don't have it, you
may need to fly over to the site where the disease is, monitor
the patient and try to find out what's happening and you can't
find out what's up without at least one another dead patient at
the end.
But the hospital may not like to even have one single dead
patient more than neccesary(best 0) and would choose a doctor
who has the remote lab where he can quickly check what's up
and find a cure *before* the next patient gets ill.
Back to the computer world, this would mean that an OS having
the remote lab(dump tools) would be favoured over on OS that
don't has. The same goes for LTT and Dynamic Probes.
Back to crash dump: In some environments like laboratory or blood
bank information systems you need to use computers in order to
efficiently process, store and distribute data, and organize
the handling of blood. In such environments, the life of people
can change on a fast, efficiently and stably working organsation.
Of course you need to be able to recover and continue such
organisation even with the laboratory information system being
down for a reboot or maintenance.
But you simply cannot go there, halt all the distributed information
retrieval and automated job control with the laboratory apparatuses,
block all the users(maybe thousands) for debugging the kernel and
check what is going on while the whole hospital is waiting for you.
Of course you can do this, but only once or only in at a time
where every use of the system can be organized to bypass it und
use paper, in-house mail and phone to do the things the system
is normally doing. A hospital with thousands of patients cannot
wait while debugging.
> Which is why it's not going to be me who merges it.
Sure, but it would help Linux World Domination if the base
kernel would support it also.
Bernd
PS: Sorry for the extreme example but this is an example
I know from my previous work and I've just tried to describe
it as real as possible.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 22:04 ` Bernhard Kaindl
@ 2002-11-01 0:33 ` Werner Almesberger
0 siblings, 0 replies; 72+ messages in thread
From: Werner Almesberger @ 2002-11-01 0:33 UTC (permalink / raw)
To: Bernhard Kaindl; +Cc: linux-kernel, Linus Torvalds, lkcd-general
Bernhard Kaindl wrote:
> An analogy to doctors, hospitals and patients:
I have a simpler medical analogy:
- in many cases, all you know is that the patient died
(e.g. think of a router - it has no console, no user
interacting with it, etc.)
- the Oops tells you the the patient died of a heart failure
(NULL pointer dereferenced in this or that function, called
from ...)
- but it's only the autopsy (the crash dump) that reveals that
the patient was poisoned, and that this is not a routine
case
I view crash dumps as a tool that helps me imagine what the
machine was doing. Without that, I can learn many interesting
things about the code, but I won't necessarily find the actual
bug.
Examples of non-obvious bugs can be found in the various module
unload race discussions. There, usually competent people
suggested incorrect designs, simply because they failed to
imagine some constellations, and no amount of staring at the
source could have helped this lack of imagination.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-11-02 10:36 Brad Hards
2002-11-02 19:28 ` [lkcd-devel] " Matt D. Robinson
0 siblings, 1 reply; 72+ messages in thread
From: Brad Hards @ 2002-11-02 10:36 UTC (permalink / raw)
To: Matt D. Robinson; +Cc: Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Fri, 1 Nov 2002 13:01, Matt D. Robinson wrote:
<snip>
> Uh ... have you read the patches? Do you see how few the
> changes are to non-dump code? Do you know that most of those
> changes only get triggered in a crash situation anyway?
I applied the patches, and reported some issues.
http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
I see no signs that any of them have been addressed, although I haven't tried
a really recent set.
> Breakage occurs when people change code areas that are used
> all the time, like VM, network, block layer, etc.
Actually, this is the area that Linux is best at. If you break it, some poor
sod will hit the problem, and you'll know really soon.
> Look at the patches and tell me where we are causing overhead
> and and seriously potential breakage. If you find problems,
> then tell us, don't just comment on breakage scenarios.
I'm a fairly typical user - I just have a couple of desktop machines and a
server/firewall.
I don't have 700 nodes in a cluster, and when my machines break, its normally
something I did. Sometimes the desktop locks up (say every second month,
unless I'm dicking with the kernel), but I reboot and everything is happy.
LKCD doesn't really seem to do anything for me - it wouldn't really worry me
if it went in (since I don't have to maintain it - it isn't near any of my
code), but I'd really prefer that having the _CONFIG option set to N didn't
make the kernel any bigger, or change any code paths.
Is this unreasonable?
Brad
BTW: I admit that I'd be pretty pissed if Linus said that my code was
"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for
a few walks, and worry about how to get it integrated after that.
- --
http://linux.conf.au. 22-25Jan2003. Perth, Aust. I'm registered. Are you?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org
iD8DBQE9w6rCW6pHgIdAuOMRAlI5AJ48ELVdExIeCr5C5HtDpU5+1ZnuBQCdEji0
t4q2NjZQVGEumrz6b+CqEEs=
=xtYY
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-02 10:36 Brad Hards
@ 2002-11-02 19:28 ` Matt D. Robinson
0 siblings, 0 replies; 72+ messages in thread
From: Matt D. Robinson @ 2002-11-02 19:28 UTC (permalink / raw)
To: Brad Hards; +Cc: Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel
On Sat, 2 Nov 2002, Brad Hards wrote:
|>I applied the patches, and reported some issues.
|>http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
|>I see no signs that any of them have been addressed, although I haven't tried
|>a really recent set.
We did put your fixes in, if they don't work, let me know.
|>LKCD doesn't really seem to do anything for me - it wouldn't really worry me
|>if it went in (since I don't have to maintain it - it isn't near any of my
|>code), but I'd really prefer that having the _CONFIG option set to N didn't
|>make the kernel any bigger, or change any code paths.
|>
|>Is this unreasonable?
Absolutely not. I would expect most people to not use it, and I
would hope that most distributions would build it as a module but
not turn it on (unless they really wanted it on by default).
|>Brad
|>
|>BTW: I admit that I'd be pretty pissed if Linus said that my code was
|>"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for
|>a few walks, and worry about how to get it integrated after that.
It's neither here nor there anymore. I think if companies like
Red Hat don't want it turned on, that's fine, but they should at
least allow their customers to have it available to them for
use, if that's what they want.
Of course, I'm not going to go through all the reasons why there's
a major disconnect between Linux distributions and hardware vendors,
but suffice it to say that's the root of the problem here.
--Matt
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-11-01 19:18 Linus Torvalds
2002-11-01 20:22 ` [lkcd-devel] " Matt D. Robinson
0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-11-01 19:18 UTC (permalink / raw)
To: Joel Becker
Cc: Alan Cox, Bill Davidsen, Chris Friesen, Matt D. Robinson,
Rusty Russell, Linux Kernel Mailing List, lkcd-general,
lkcd-devel
On Fri, 1 Nov 2002, Joel Becker wrote:
>
> I always liked the AIX dumper choices. You could either dump to
> the swap area (and startup detects the dump and moves it to the
> filesystem before swapon) or provide a dedicated dump partition. The
> latter was prefered.
> Either of these methods merely require the dumper to correctly
> write to one disk partition. This is about as simple as you are going
> to get in disk dumping.
Ehh.. That was on closed hardware that was largely designed with and for
the OS.
Alan isn't worried about the "which sector do I write" kind of thing.
That's the trivial part. Alan is worried about the fact that once you know
which sector to write, actually _doing_ so is a really hard thing. You
have bounce buffers, you have exceedingly complex drivers that work
differently in PIO and DMA modes and are more likely than not the _cause_
of a number of problems etc.
And you have a situation where interrupts are not likely to work well
(because you crashed with various locks held), so the regular driver
simply isn't likely to work all that well.
And you have a situation where there are hundreds of different kinds of
device drivers for the disk.
In other words, the AIX situation isn't even _remotely_ comparable. A
large portion of the complexity in the PC stability space is in device
drivers. It's the thing I worry most about for 2.6.x stabilization, by
_far_.
And if you get these things wrong, you're quite likely to stomp on your
disk. Hard. You may be tryign to write the swap partition, but if the
driver gets confused, you just overwrote all your important data. At which
point it doesn't matter if your filesystem is journaling or not, since you
just potentially overwrote it.
In other words: it's a huge risk to play with the disk when the system is
already known to be unstable. The disk drivers tend to be one of the main
issues even when everything else is _stable_, for chrissake!
To add insult to injury, you will not be able to actually _test_ any of
the real error paths in real life. Sure, you will be able to test forced
dumps on _your_ hardware, but while that is fine in the AIX model ("we
control the hardware, and charge the user five times what it is worth"),
again that doesn't mean _squat_ in the PC hardware space.
See?
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-01 19:18 Linus Torvalds
@ 2002-11-01 20:22 ` Matt D. Robinson
2002-11-02 13:02 ` Kai Henningsen
0 siblings, 1 reply; 72+ messages in thread
From: Matt D. Robinson @ 2002-11-01 20:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Joel Becker, Alan Cox, Bill Davidsen, Chris Friesen,
Rusty Russell, Linux Kernel Mailing List, lkcd-general,
lkcd-devel
On Fri, 1 Nov 2002, Linus Torvalds wrote:
|>Alan isn't worried about the "which sector do I write" kind of thing.
|>That's the trivial part. Alan is worried about the fact that once you know
|>which sector to write, actually _doing_ so is a really hard thing. You
|>have bounce buffers, you have exceedingly complex drivers that work
|>differently in PIO and DMA modes and are more likely than not the _cause_
|>of a number of problems etc.
[ preamble - this is only a technical discussion, I'm interested
in feedback on what we can improve upon ]
I agree with you. We'd prefer to have a better low-level driver
primitive sitting on top of two low-level disk drivers (IDE and
SCSI). Fundamentally, though, this is difficult to do:
0) There's a lot of early stuff you take risks with, such as the
partition size (assuming you can probe it), knowing that it
hasn't changed since boot, and pre-allocating buffers for disk
I/O operations. You always take the partition risk no matter
what.
1) You have to establish that the IDE or SCSI device can be reset
into an appropriate mode for seek/write mode -- if a DMA operation
fails to the drive, and you can't reset the drive, you may be stuck.
2) Once the hardware reports back success, it is a matter of how
you write the blocks. I once wrote the low-level IDE driver
below request structures, writing sequentially to the drive,
and ran into occasional drive lock-ups while writing during
interrupt crashes. This was more likely due to my inexperience
with the IDE driver than anything else.
|>And you have a situation where interrupts are not likely to work well
|>(because you crashed with various locks held), so the regular driver
|>simply isn't likely to work all that well.
This is simply an avoidance of certain code paths. We saw this
problem earlier in 2.2 using kiobufs and got around it for the
most part by doing our best to avoid the io_request_lock. That's
why we haven't seen the lock contention problems for 2.5.
|>And you have a situation where there are hundreds of different kinds of
|>device drivers for the disk.
This is the biggest problem, absolutely. Our idea moving forward
was to create a _dump() primitive with drivers that allows you to
determine, upon configuration of a disk dump device, whether or
not the low-level driver supported dumping or not. I suggested this
to Al Viro a long time ago on this list, but it didn't go anywhere.
That way the driver itself knows that it can support a low-level
page-write method. If it doesn't, you can't use disk dumping to
that device.
I'm willing to re-open this effort.
|>And if you get these things wrong, you're quite likely to stomp on your
|>disk. Hard. You may be tryign to write the swap partition, but if the
|>driver gets confused, you just overwrote all your important data. At which
|>point it doesn't matter if your filesystem is journaling or not, since you
|>just potentially overwrote it.
We haven't seen this before, but it is always a possibility for any
dump scenario. That's why you some choose netdump instead. :)
|>In other words: it's a huge risk to play with the disk when the system is
|>already known to be unstable. The disk drivers tend to be one of the main
|>issues even when everything else is _stable_, for chrissake!
|>
|>To add insult to injury, you will not be able to actually _test_ any of
|>the real error paths in real life. Sure, you will be able to test forced
|>dumps on _your_ hardware, but while that is fine in the AIX model ("we
|>control the hardware, and charge the user five times what it is worth"),
|>again that doesn't mean _squat_ in the PC hardware space.
We have actually done a lot of testing with injection of failures
into the middle of VM, network drivers, etc., in conjunction with
disk dumping. Certainly it doesn't cover all the cases, but nothing
ever will.
|> Linus
--Matt
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-01 20:22 ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-02 13:02 ` Kai Henningsen
0 siblings, 0 replies; 72+ messages in thread
From: Kai Henningsen @ 2002-11-02 13:02 UTC (permalink / raw)
To: linux-kernel
yakker@aparity.com (Matt D. Robinson) wrote on 01.11.02 in <Pine.LNX.4.44.0211011205330.26575-100000@nakedeye.aparity.com>:
> On Fri, 1 Nov 2002, Linus Torvalds wrote:
> |>And if you get these things wrong, you're quite likely to stomp on your
> |>disk. Hard. You may be tryign to write the swap partition, but if the
> |>driver gets confused, you just overwrote all your important data. At which
> |>point it doesn't matter if your filesystem is journaling or not, since you
> |>just potentially overwrote it.
>
> We haven't seen this before, but it is always a possibility for any
> dump scenario. That's why you some choose netdump instead. :)
*If* you want safe dumping to a partition, it seems wrong to me to try to
figure that out after the crash.
Instead,
* configure the crash space with a user-mode app or possibly a kernel
command line arg
* Whenever repartitioning, check if the crash dump partition is affected,
and if so, clear it until it is explicitely reconfigured
* Save a good checksum (say, md5 or sha1) of the crash partition config,
and only dump if that checksum checks out
You might want to checksum even more than that, of course :-)
But there's certainly a reason Netware liked to crash dump to a series of
floppies - too bad those are much too small for today's machines. When
floppy sizes stopped to be slightly larger than standard RAM sizes[*], the
computing public lost big time, and we haven't recovered from that.
[*] Apple ][+: 48 KB RAM, 140 KB floppy. IBM PC: 640 KB RAM, 1.2 MB
floppy. (Yes, I know there were other combinations as well.) Where's my
approximately-1-GB floppy that everyone and their aunt have installed
today? No, CD writers are *not* universal. And burn-once CDs aren't much
like floppies.
Of course, the same problem exists with general backup technology - tape
the size of modern disks is not really affordable anymore.
MfG Kai
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-11-01 6:36 Linus Torvalds
2002-11-01 7:00 ` [lkcd-devel] " Castor Fu
0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-11-01 6:36 UTC (permalink / raw)
To: Bill Davidsen
Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
On Fri, 1 Nov 2002, Bill Davidsen wrote:
>
> If you really believed the stuff you say you'd put it in and promise to
> take it out if people didn't find it useful or there were inherent
> limitations.
This never works. Be honest. Nobody takes out features, they are stuck
once they get in. Which is exactly why my job is to say "no", and why
there is no "accepted unless proven bad".
> It would probably take 10-30% off the time to a stable release.
Talk is cheap.
I've not seen a _single_ bug-report with a fix that attributed the
existing LKCD patches. I might be more impressed if I had.
The basic issue is that we don't put patches in in the hope that they will
prove themselves later. Your argument is fundamentally flawed.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-01 6:36 Linus Torvalds
@ 2002-11-01 7:00 ` Castor Fu
0 siblings, 0 replies; 72+ messages in thread
From: Castor Fu @ 2002-11-01 7:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Bill Davidsen, Matt D. Robinson, Rusty Russell, linux-kernel,
lkcd-general, lkcd-devel
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Fri, 1 Nov 2002, Bill Davidsen wrote:
> >
> > If you really believed the stuff you say you'd put it in and promise to
> > take it out if people didn't find it useful or there were inherent
> > limitations.
>
> This never works. Be honest. Nobody takes out features, they are stuck
> once they get in. Which is exactly why my job is to say "no", and why
> there is no "accepted unless proven bad".
>
> > It would probably take 10-30% off the time to a stable release.
>
> Talk is cheap.
>
> I've not seen a _single_ bug-report with a fix that attributed the
> existing LKCD patches. I might be more impressed if I had.
Maybe people don't bother to spell out how they got there. Here's one.
-castor
:: Newsgroups: mlist.linux.kernel
:: Date: Mon, 17 Dec 2001 09:48:53 -0800 (PST)
:: From: Castor Fu <castor@3pardata.com>
:: X-To: <linux-kernel@vger.kernel.org>
:: Subject: i386 machine_restart unsafe in interrupt context
:: Message-ID: <linux.kernel.Pine.LNX.4.33.0112170935520.1623-100000@marais.SOMEWHERE>
:: MIME-Version: 1.0
:: Content-Type: TEXT/PLAIN; charset=US-ASCII
:: Approved: news@nntp-server.caltech.edu
:: Lines: 27
::
::
:: I have a problem where systems fail to reboot on panic(). I've resolved
:: it by changing smp_send_stop() to use an NMI (like the KDB patch does to
:: manage communication).
::
:: The source of the problem is that the panic path has the following:
::
:: panic()
:: machine_restart()
:: machine_real_restart()
:: smp_send_stop()
:: smp_call_function()
::
:: and smp_call_function() is not safe in an interrupt context.
::
:: I imagine people might want to handle this differently, but I'd be
:: happy to diffs if there's interest. It may be that there are enough
:: cases like this that smp_call_function might want a version that
:: uses an NMI. . .
::
:: -Castor Fu
:: castor@3par.com
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
@ 2002-10-31 22:47 Richard J Moore
2002-10-31 23:39 ` Werner Almesberger
0 siblings, 1 reply; 72+ messages in thread
From: Richard J Moore @ 2002-10-31 22:47 UTC (permalink / raw)
To: Werner Almesberger
Cc: Jeff Garzik, linux-kernel, lkcd-devel, lkcd-devel-admin,
lkcd-general, Rusty Russell, Linus Torvalds, Matt D. Robinson
> I'm not so convinced about this. I like the Mission Critical
> approach:
and so do many people. In fact netdump, mcode and lkcd are all
complementary parts of the same need. That's why we are working with
mcrit's blessing to merge mcore into lkcd. That's a big piece of work,
which we hope to make progress with during 2003 - Suparna's the expert :-)
Richard
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 22:47 Richard J Moore
@ 2002-10-31 23:39 ` Werner Almesberger
2002-11-05 12:45 ` Suparna Bhattacharya
0 siblings, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-10-31 23:39 UTC (permalink / raw)
To: Richard J Moore
Cc: Jeff Garzik, linux-kernel, lkcd-devel, lkcd-devel-admin,
lkcd-general, Rusty Russell, Linus Torvalds, Matt D. Robinson
Richard J Moore wrote:
> and so do many people. In fact netdump, mcode and lkcd are all
> complementary parts of the same need.
It's the "complementary" that worries me. Once you have mcore, what
good are direct dumps to the network or the disk for ? With mcore,
the whole issue of accessing stable storage is eliminated.
I don't know if the approach of having multiple quasi-equivalent
means of storing a dump is something that Linus dislikes about
LKCD, but I think it might be worth exploring if LKCD's chance of
acceptance could be improved by focusing on a single but general
mechanism.
I think it would be a pity if we ended up not having crash dumps
in 2.6 only because they're over-featured ...
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 23:39 ` Werner Almesberger
@ 2002-11-05 12:45 ` Suparna Bhattacharya
0 siblings, 0 replies; 72+ messages in thread
From: Suparna Bhattacharya @ 2002-11-05 12:45 UTC (permalink / raw)
To: Werner Almesberger
Cc: Richard J Moore, Jeff Garzik, linux-kernel, lkcd-devel,
lkcd-devel-admin, lkcd-general, Rusty Russell, Linus Torvalds,
Matt D. Robinson
On Thu, Oct 31, 2002 at 08:39:35PM -0300, Werner Almesberger wrote:
> Richard J Moore wrote:
> > and so do many people. In fact netdump, mcode and lkcd are all
> > complementary parts of the same need.
>
> It's the "complementary" that worries me. Once you have mcore, what
> good are direct dumps to the network or the disk for ? With mcore,
> the whole issue of accessing stable storage is eliminated.
>
> I don't know if the approach of having multiple quasi-equivalent
> means of storing a dump is something that Linus dislikes about
> LKCD, but I think it might be worth exploring if LKCD's chance of
> acceptance could be improved by focusing on a single but general
> mechanism.
The very question that's kept me up late some nights :)
And one of the reasons for spending so much time in integrating
mcore seamlessly into the lkcd framework rather than plug it in
as is at a high level. Precisely to avoid bloat while retaining
flexibility and to move from something that works today to
more improved schemes in the future.
The decision on what dump device implementations - block, net,
memory, and other special types to include could be a separate
one from the base dump system, and could change as time passes.
>
> I think it would be a pity if we ended up not having crash dumps
> in 2.6 only because they're over-featured ...
The dump driver interface is pretty simple, if you look at it
.. though it was meant to be powerful enough to do a lot of nice
things in the future.
Regards
Suparna
>
> - Werner
>
> --
> _________________________________________________________________________
> / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: Influence the future
> of Java(TM) technology. Join the Java Community
> Process(SM) (JCP(SM)) program now.
> http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> _______________________________________________
> lkcd-devel mailing list
> lkcd-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lkcd-devel
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-10-31 21:33 Rusty Russell
2002-11-01 1:19 ` [lkcd-devel] " Matt D. Robinson
0 siblings, 1 reply; 72+ messages in thread
From: Rusty Russell @ 2002-10-31 21:33 UTC (permalink / raw)
To: Chris Friesen
Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
lkcd-general, lkcd-devel
In message <3DC171FF.5000803@nortelnetworks.com> you write:
> Ideally I would like to see a dump framework that can have a number of
> possible dump targets. We should be able to dump to any combination of
> network, serial, disk, flash, unused ram that isn't wiped over restarts,
> etc...
Both the lkcd and ide mini-oopser have that (although the mini-oopser
has only x86-ide for now).
The mini-oopser has different aims than LCKD: they want to debug one
system, I want to make sure we're reaping OOPS reports from those 99%
of desktop users who run X and simply reboot when their machine
crashes once a month.
I did *not* put the mini-oopser on the Snowball list, because I don't
have time to polish it.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 21:33 Rusty Russell
@ 2002-11-01 1:19 ` Matt D. Robinson
2002-11-01 2:59 ` Rusty Russell
0 siblings, 1 reply; 72+ messages in thread
From: Matt D. Robinson @ 2002-11-01 1:19 UTC (permalink / raw)
To: Rusty Russell
Cc: Chris Friesen, Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel
On Fri, 1 Nov 2002, Rusty Russell wrote:
|>The mini-oopser has different aims than LCKD: they want to debug one
|>system, I want to make sure we're reaping OOPS reports from those 99%
|>of desktop users who run X and simply reboot when their machine
|>crashes once a month.
I'd like to incorporate the mini-oopser as an LKCD dump method.
I'll chat with you off-line about this. Shouldn't be that
difficult to do.
|>I did *not* put the mini-oopser on the Snowball list, because I don't
|>have time to polish it.
|>
|>Rusty.
Thanks,
--Matt
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-01 1:19 ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-01 2:59 ` Rusty Russell
0 siblings, 0 replies; 72+ messages in thread
From: Rusty Russell @ 2002-11-01 2:59 UTC (permalink / raw)
To: Matt D. Robinson
Cc: Chris Friesen, Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel
In message <Pine.LNX.4.44.0210311718140.23393-100000@nakedeye.aparity.com> you
write:
> On Fri, 1 Nov 2002, Rusty Russell wrote:
> |>The mini-oopser has different aims than LCKD: they want to debug one
> |>system, I want to make sure we're reaping OOPS reports from those 99%
> |>of desktop users who run X and simply reboot when their machine
> |>crashes once a month.
>
> I'd like to incorporate the mini-oopser as an LKCD dump method.
> I'll chat with you off-line about this. Shouldn't be that
> difficult to do.
That would defeat the "mini" part 8)
Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-10-31 20:59 Dave Anderson
2002-11-01 1:25 ` [lkcd-devel] " Matt D. Robinson
0 siblings, 1 reply; 72+ messages in thread
From: Dave Anderson @ 2002-10-31 20:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
On Thu, 31 Oct 2002, Linus Torvalds wrote:
> - included features kill off (potentially better) projects.
>
> There's a big "inertia" to features. It's often better to keep
> features _off_ the standard kernel if they may end up being
> further developed in totally new directions.
>
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
>
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).
It would be most unfortunate if the existance of netdump is used as a
reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> We want to see this in the kernel, frankly, because it's a pain
> in the butt keeping up with your kernel revisions and everything
> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
> (hopefully) Red Hat don't want to spend their time having to roll
> this stuff in each and every time you roll a new kernel.
While Red Hat advocates Ingo's netdump option, we have customer
requests that are requiring us to look at LKCD disk-based dumps as an
alternative, co-existing dump mechanism. Since the two methods are not mutually
exclusive, LKCD will never kill off netdump -- nor certainly vice-versa. We're
all just looking for a better means to be able to
provide support to our customers, not to mention its value as a
development aid.
Dave Anderson
Red Hat, Inc.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 20:59 Dave Anderson
@ 2002-11-01 1:25 ` Matt D. Robinson
0 siblings, 0 replies; 72+ messages in thread
From: Matt D. Robinson @ 2002-11-01 1:25 UTC (permalink / raw)
To: Dave Anderson
Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
|>On Thu, 31 Oct 2002, Matt D. Robinson wrote:
|>> We want to see this in the kernel, frankly, because it's a pain
|>> in the butt keeping up with your kernel revisions and everything
|>> else that goes in that changes. And I'm sure SuSE, UnitedLinux and
|>> (hopefully) Red Hat don't want to spend their time having to roll
|>> this stuff in each and every time you roll a new kernel.
|>
|>While Red Hat advocates Ingo's netdump option, we have customer
|>requests that are requiring us to look at LKCD disk-based dumps as an
|>alternative, co-existing dump mechanism. Since the two methods are
|>not mutually exclusive, LKCD will never kill off netdump -- nor
|>certainly vice-versa. We're all just looking for a better means
|>to be able to provide support to our customers, not to mention
|>its value as a development aid.
I think you and I are in agreement (as always has been in the
past), Dave. LKCD is meant to create a base for disk, network,
or any dump method. If Red Hat wants netdump to be the primary
dumping method, that's Red Hat's decision, and more power to
them. If SuSE wants disk dumps, that's SuSE's decision. But
for both of them to have to roll their own every single release
or kernel upgrade is unproductive.
What's most concerning about this entire discussion is that I
bet < 20% of the people discussing this have actually LOOKED at
the LKCD patches to see whether or not this is as invasive,
difficult, bloated, or anything negative. We've spent over a
month now posting them, getting comments, responding to all of
the comments, making sure feedback is accounted for and
responded to, only to get an "LKCD is stupid" type response.
--Matt
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [lkcd-devel] Re: What's left over.
@ 2002-10-31 18:17 Deepak Kumar Gupta, Noida
0 siblings, 0 replies; 72+ messages in thread
From: Deepak Kumar Gupta, Noida @ 2002-10-31 18:17 UTC (permalink / raw)
To: Chris Friesen, Linus Torvalds
Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
> Linus Torvalds wrote:
>
> > In particular when it comes to this project, I'm told about
> > "netdump", which doesn't try to dump to a disk, but
> over the net.
> > And quite frankly, my immediate reaction is to say "Hell, I
> > _never_ want the dump touching my disk, but over the network
> > sounds like a great idea".
> >
> > To me this says "LKCD is stupid". Which means that I'm not
> going to apply
> > it, and I'm going to need some real reason to do so - ie
> being proven
> > wrong in the field.
>
> How do you deal with netdump when your network driver is what
> caused the
> crash?
>
> Ideally I would like to see a dump framework that can have a
> number of
> possible dump targets. We should be able to dump to any
> combination of
> network, serial, disk, flash, unused ram that isn't wiped
> over restarts,
> etc...
This is what the LKCD with generic interface is .. LKCD with generic
interface has the capability to include various dump targets in a very clean
way. Originally the LKCD meant for saving dump only on the disks, but its
generic interface has provided the option to have a number of dump targets.
Regards
Deepak.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-10-31 17:25 Linus Torvalds
2002-10-31 21:02 ` Jeff Garzik
0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-10-31 17:25 UTC (permalink / raw)
To: Matt D. Robinson; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
[ Ok, this is a really serious email. If you don't get it, don't bother
emailing me. Instead, think about it for an hour, and if you still don't
get it, ask somebody you know to explain it to you. ]
On Thu, 31 Oct 2002, Matt D. Robinson wrote:
>
> Sure, but why should they have to? What technical reason is there
> for not including it, Linus?
There are many:
- bloat kills:
My job is saying "NO!"
In other words: the question is never EVER "Why shouldn't it be
accepted?", but it is always "Why do we really not want to live
without this?"
- included features kill off (potentially better) projects.
There's a big "inertia" to features. It's often better to keep
features _off_ the standard kernel if they may end up being
further developed in totally new directions.
In particular when it comes to this project, I'm told about
"netdump", which doesn't try to dump to a disk, but over the net.
And quite frankly, my immediate reaction is to say "Hell, I
_never_ want the dump touching my disk, but over the network
sounds like a great idea".
To me this says "LKCD is stupid". Which means that I'm not going to apply
it, and I'm going to need some real reason to do so - ie being proven
wrong in the field.
(And don't get me wrong - I don't mind getting proven wrong. I change my
opinions the way some people change underwear. And I think that's ok).
> I completely don't understand your reasoning here.
Tough. That's YOUR problem.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
2002-10-31 17:25 Linus Torvalds
@ 2002-10-31 21:02 ` Jeff Garzik
2002-10-31 22:37 ` Werner Almesberger
0 siblings, 1 reply; 72+ messages in thread
From: Jeff Garzik @ 2002-10-31 21:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
Linus Torvalds wrote:
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
>
[yes, I realize the LKCD merge debate is over, bear with me :)]
I'm sort of in the middle on this issue: The existence of netdump does
not imply that disk dumps are a bad thing.
netdumps require a net dump server, and it is simply not realistic at
all to assume that users seeing crashes will always have a netdump
server set up in advance, or even have multiple machines to make that
possible. Disk dumps are valuable because their requirements are very
low, and because of all the user-support reasons that Andrew Morton
mentioned in this thread.
That said, I used to be an LKCD cheerleader until a couple people made
some good points to me: it is not nearly low-level enough to truly be
of use in crash situations. netdump can work if your interrupts are
hosed/screaming, and various mid-layers are dying. For LKCD to be of
any use, it needs to _skip_ the block layer and talk directly to
low-level drivers.
So, I think the stock kernel does need some form of disk dumping,
regardless of any presence/absence of netdump. But LKCD isn't there yet...
Jeff
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
2002-10-31 21:02 ` Jeff Garzik
@ 2002-10-31 22:37 ` Werner Almesberger
2002-11-05 11:42 ` [lkcd-devel] " Suparna Bhattacharya
0 siblings, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-10-31 22:37 UTC (permalink / raw)
To: Jeff Garzik
Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
lkcd-general, lkcd-devel
Jeff Garzik wrote:
> That said, I used to be an LKCD cheerleader until a couple people made
> some good points to me: it is not nearly low-level enough to truly be
> of use in crash situations.
I'm not so convinced about this. I like the Mission Critical
approach: save the dump to memory, then either boot through the
firmware or through bootimg (nowadays, that would be kexec),
then retrieve the dump from memory, and do whatever you like
with it.
The huge advantage here is that you don't need a ton of
specialized dump drivers and/or have much of the original kernel
infrastructure to be in a usable state. The rebooted system will
typically be stable enough to offer the full range of utilities,
including up to date drivers for all possible devices, so you
can safely write to disk, scp all the mess to your support
critter, or post an automatic flame to linux-kernel :-)
The weak points of the Mission Critical design are that early
memory allocation in the kernel needs to be tightly controlled,
that architectures that wipe CPU caches on reboot need to
commit them to memory before the firmware restart, and that
drivers need to be able to recover from an "unclean" hardware
state. (I think we'll see much of the latter happen as kexec
advances. The other two issues aren't really special.)
Actually, at the RAS BOF I thought that IBM were developing LKCD
in this direction, and had also eliminated a few not so elegant
choices of Mission Critical's original design. I haven't looked
at the LKCD code, but the descriptions sound as if all the
special-case cruft seems to be back again, which I would find a
little disappointing.
There might be a case for specialized low-overhead dump handlers
for small embedded systems and such, but they're probably better
maintained outside of the mainstream kernel. (They're more like
firmware anyway.)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 22:37 ` Werner Almesberger
@ 2002-11-05 11:42 ` Suparna Bhattacharya
2002-11-05 18:00 ` Werner Almesberger
0 siblings, 1 reply; 72+ messages in thread
From: Suparna Bhattacharya @ 2002-11-05 11:42 UTC (permalink / raw)
To: Werner Almesberger
Cc: Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
linux-kernel, lkcd-general, lkcd-devel
On Thu, Oct 31, 2002 at 07:37:05PM -0300, Werner Almesberger wrote:
> Jeff Garzik wrote:
> > That said, I used to be an LKCD cheerleader until a couple people made
> > some good points to me: it is not nearly low-level enough to truly be
> > of use in crash situations.
>
> I'm not so convinced about this. I like the Mission Critical
> approach: save the dump to memory, then either boot through the
> firmware or through bootimg (nowadays, that would be kexec),
> then retrieve the dump from memory, and do whatever you like
> with it.
>
> The huge advantage here is that you don't need a ton of
> specialized dump drivers and/or have much of the original kernel
> infrastructure to be in a usable state. The rebooted system will
> typically be stable enough to offer the full range of utilities,
> including up to date drivers for all possible devices, so you
> can safely write to disk, scp all the mess to your support
> critter, or post an automatic flame to linux-kernel :-)
>
> The weak points of the Mission Critical design are that early
> memory allocation in the kernel needs to be tightly controlled,
> that architectures that wipe CPU caches on reboot need to
> commit them to memory before the firmware restart, and that
> drivers need to be able to recover from an "unclean" hardware
> state. (I think we'll see much of the latter happen as kexec
> advances. The other two issues aren't really special.)
>
> Actually, at the RAS BOF I thought that IBM were developing LKCD
> in this direction, and had also eliminated a few not so elegant
> choices of Mission Critical's original design. I haven't looked
Yes, we are putting that in as one of the alternative dump targets
available. I have done quite a bit of work on that implementing the
ideas we talked about at OLS, and that's what I've been referring
to as the memory dump target. Its not quite ready yet and we
need something like kexec to be available which we can use on Intel
systems to achieve the softboot (the acceptance status of that still
doesn't seem to be clear), so I was looking at this as a
follow-on thing once the core infrastructure is there. More so
because we probably need to give it some time to stabilize and try
it on different environments and look at the issues you mention.
Why do we even consider the other options when we are doing
this already ? Well, as we discussed earlier there's non-disruptive dumps
for one, where this wouldn't work. The other is that before overwriting
memory we need to be able to stop all activity in the system for certain
(system may appear hung/locked up) and I'm not fully certain about
how to do this for all environments. Maybe an answer lies in
rethinking some parts of the algorithm a bit.
Also having the interface allows people to develop more specific/
reliable solutions for their environment. So we do not necessiate
code duplication, but if something exists, then the infrastructure
can use it.
The general feeling here is that a one solution fits all thing
may not work best right now ... and hence the focus on an interface
based approach that gives us the needed flexibility.
> at the LKCD code, but the descriptions sound as if all the
> special-case cruft seems to be back again, which I would find a
> little disappointing.
Hope that helps a bit.
Regards
Suparna
>
> There might be a case for specialized low-overhead dump handlers
> for small embedded systems and such, but they're probably better
> maintained outside of the mainstream kernel. (They're more like
> firmware anyway.)
>
> - Werner
>
> --
> _________________________________________________________________________
> / Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: Influence the future
> of Java(TM) technology. Join the Java Community
> Process(SM) (JCP(SM)) program now.
> http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> _______________________________________________
> lkcd-devel mailing list
> lkcd-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lkcd-devel
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 11:42 ` [lkcd-devel] " Suparna Bhattacharya
@ 2002-11-05 18:00 ` Werner Almesberger
2002-11-05 18:36 ` Alan Cox
2002-11-09 21:21 ` Pavel Machek
0 siblings, 2 replies; 72+ messages in thread
From: Werner Almesberger @ 2002-11-05 18:00 UTC (permalink / raw)
To: Suparna Bhattacharya
Cc: Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
linux-kernel, lkcd-general, lkcd-devel
Suparna Bhattacharya wrote:
> Yes, we are putting [MCORE] in as one of the alternative dump targets
> available.
Great !
> Its not quite ready yet and we need something like kexec to be
> available which we can use on Intel systems to achieve the softboot
> (the acceptance status of that still doesn't seem to be clear),
Yes, I've just checked with Eric, and he hasn't received any
indication from Linus so far. I posted a reminder to linux-kernel.
I'd really hate to see kexec miss 2.6.
> Why do we even consider the other options when we are doing
> this already ? Well, as we discussed earlier there's non-disruptive
> dumps for one, where this wouldn't work.
But they're very different anyway, aren't they ? I mean, you could
even implement them (well, almost) from user space, with today's
kernels.
> The other is that before overwriting
> memory we need to be able to stop all activity in the system for certain
> (system may appear hung/locked up) and I'm not fully certain about
> how to do this for all environments. Maybe an answer lies in
> rethinking some parts of the algorithm a bit.
This is certainly the hairiest part, yes. I think we have about
four types of devices/elements to worry about:
- those that just sit there, and never talk unless spoken to
- those that may generate interrupts
- those that DMA if you ask them nicely
- those that DMA when they feel like it (e.g. copy an incoming
network packet to the next buffer in the free list)
The latter are the real problem. I see the following possibilities
for dealing with them:
- faith-based computing: pray that nothing bad will befall your
system :-)
- de-activate them individually. There should be a lot of work
that can be shared with power management. And that's one of
the reasons why I think the memory target should be available
early, or convergence will take forever.
- try to reset them, without necessarily knowing what they are
or what they do. I don't know is there is a useful way for
resetting the PCI bus by software, but if there is one, this
looks like the most promising strategy to me, even if it may
be somethat lacking in elegance.
- if all else fails, maybe introduce an "unsafe" memory type
for potential DMA targets of unpredictable devices, that will
not be re-used. I hope we won't need this, though. (But in case
such a memory type gets introduced by the memory-scrubbers, at
least you could blame _them_ :-)
> The general feeling here is that a one solution fits all thing
> may not work best right now ... and hence the focus on an interface
> based approach that gives us the needed flexibility.
Yes, this is certainly important. I just think that the "memory
target" concept is closer to a general solution than disk dumps.
But there are always those 5% with special needs, and it's good
if they can use the same framework.
Thanks,
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 18:00 ` Werner Almesberger
@ 2002-11-05 18:36 ` Alan Cox
2002-11-05 19:19 ` Werner Almesberger
` (2 more replies)
2002-11-09 21:21 ` Pavel Machek
1 sibling, 3 replies; 72+ messages in thread
From: Alan Cox @ 2002-11-05 18:36 UTC (permalink / raw)
To: Werner Almesberger
Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.
Let me ask the same dumb question - what does kexec need that a dumper
doesn't. In other words given reboot/trap hooks can kexec happily live
as a standalone module ?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 18:36 ` Alan Cox
@ 2002-11-05 19:19 ` Werner Almesberger
2002-11-05 20:10 ` Alan Cox
2002-11-06 0:21 ` Andy Pfiffer
2002-11-06 2:48 ` Eric W. Biederman
2002-11-06 4:29 ` Eric W. Biederman
2 siblings, 2 replies; 72+ messages in thread
From: Werner Almesberger @ 2002-11-05 19:19 UTC (permalink / raw)
To: Alan Cox
Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
Alan Cox wrote:
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't.
kexec needs:
- a system call to set it up
- a way to silence devices (difference to dumper: kexec normally
operates under an intact system, so it's more similar to, say,
swsusp. But I expect that cleaning up device power management
would also clear the path for more reliable dumpers.)
- a bit of glue, e.g. to switch to "real mode", etc. AFAIK, none
of this touches other code, but there are of course some
assumptions here on what other codes provides or does.
- device drivers that can bring silent devices back to life
(normally, device drivers do this already, but kexec may
uncover dormant bugs in this area)
Since recent kernels already preserve memory areas with BIOS data,
kexec is actually quite a bit less intrusive than bootimg was.
> In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
"Module", as in "software package": yes, the main problem spot
would be the system call allocation, which is also inconvenient
for other developers. By the way, kexec does not tap into the
kernel's reboot process, so no such hooks are needed. If LKCD
wants to use kexec, it can do whatever it does to be invoked at
the time of a crash, and then call kexec directly.
"Module", as in "loadable kernel module": I think so, although
it's currently "bool", not "tristate". Also, you'd have the
system call issue again.
So not merging it is mainly inconvenient to use, adds the system
call allocation as a continuous annoyance, and makes it a little
harder to work on the related infrastructure. But then, despite
being somewhat obscure, bootimg and kexec have been in use for
years, the design is about as lean as it can get, and it's cool.
What more could you ask for ? :-)
What kexec needs now is more exposure, so that the BIOS
compatibility issues get noticed and fixed, it is ported to other
architectures, and that more people can start figuring out how to
use it, and how to build a boot environment. Then, maybe in a
year or two, we can send "Methuselah" LILO and "nice little OS"
GRUB off to their well-deserved retirement.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 19:19 ` Werner Almesberger
@ 2002-11-05 20:10 ` Alan Cox
2002-11-05 23:25 ` Werner Almesberger
2002-11-06 0:21 ` Andy Pfiffer
1 sibling, 1 reply; 72+ messages in thread
From: Alan Cox @ 2002-11-05 20:10 UTC (permalink / raw)
To: Werner Almesberger
Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Tue, 2002-11-05 at 19:19, Werner Almesberger wrote:
> kexec needs:
> - a system call to set it up
Device, file, insmod...
> - a way to silence devices (difference to dumper: kexec normally
> operates under an intact system, so it's more similar to, say,
> swsusp. But I expect that cleaning up device power management
> would also clear the path for more reliable dumpers.)
So you need to register with the power management as the last thing to
be suspended and do a suspend before kexec.
> So not merging it is mainly inconvenient to use, adds the system
> call allocation as a continuous annoyance, and makes it a little
> harder to work on the related infrastructure. But then, despite
> being somewhat obscure, bootimg and kexec have been in use for
> years, the design is about as lean as it can get, and it's cool.
> What more could you ask for ? :-)
I'm mostly worried about how to make these things fit the least
intrusively into the kernel.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 20:10 ` Alan Cox
@ 2002-11-05 23:25 ` Werner Almesberger
0 siblings, 0 replies; 72+ messages in thread
From: Werner Almesberger @ 2002-11-05 23:25 UTC (permalink / raw)
To: Alan Cox
Cc: ebiederm, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
Alan Cox wrote:
>> - a system call to set it up
>
> Device, file, insmod...
I don't know what Eric thinks about using something else than a
system call, but I think he made a quite reasonable choice.
The data structure isn't entirely trivial, so a misc device plus
ioctl would be a bit on the ugly side. I vaguely remember having
proposed something like this a while ago (may have been for
pivot_root), and everybody went "noooo!!" ;-)
insmod would be possible, although with a rather unusual parameter
passing scheme. Also, when using kexec from inside the kernel (e.g.
MCORE), the insmod solution would have to split kexec into the
interface and the kexec core.
But yes, there's always a means to avoid adding a new system
call. /dev/syscall with an ioctl
struct syscall_ioctl {
const char *symbol_name;
va_list ap;
};
anyone ? :-) (Implementing it might be a bit of a challenge :)
> So you need to register with the power management as the last thing to
> be suspended and do a suspend before kexec.
Well, kexec just calls device_shutdown. The problem isn't the
interface, it's that device_shutdown apparently doesn't work too
well (devices not supporting it, some semantics mixup, etc.).
But this is general infrastructure work, that should be done
with or without kexec.
> I'm mostly worried about how to make these things fit the least
> intrusively into the kernel.
Just look at Eric's kexec patch. It isn't particularly intrusive:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103604471723358&w=2
(For 2.5.45. The patch fails for 2.5.46, because new system calls
were added ...)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 19:19 ` Werner Almesberger
2002-11-05 20:10 ` Alan Cox
@ 2002-11-06 0:21 ` Andy Pfiffer
2002-11-06 1:10 ` Werner Almesberger
2002-11-10 18:35 ` Pavel Machek
1 sibling, 2 replies; 72+ messages in thread
From: Andy Pfiffer @ 2002-11-06 0:21 UTC (permalink / raw)
To: Werner Almesberger
Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Tue, 2002-11-05 at 11:19, Werner Almesberger wrote:
> Alan Cox wrote:
> > Let me ask the same dumb question - what does kexec need that a dumper
> > doesn't.
>
> kexec needs:
> - a system call to set it up
> - a way to silence devices <snip>
<snip>
> - a bit of glue <snip>
> - device drivers that can bring silent devices back to life
<snip>
> > In other words given reboot/trap hooks can kexec happily live
> > as a standalone module ?
You could probably skip the system call to set it up. Example: I could
imagine a bizarre set of pseudo-devices:
# insmod kexec
# cat bzImage > /proc/kexec/next-image
# echo "root=805" > /proc/kexec/next-cmndline
# echo 1 > /proc/kexec/reboot
and hide away that dirty little sequence with a nice kexec(3) library
routine.
The Two Kernel Monte trick (that rewrote when insmod'ed the kernel's
function pointers for sys_reboot) was also effective, but that
apparently isn't an option any longer.
> What kexec needs now is more exposure, so that the BIOS
> compatibility issues get noticed and fixed, it is ported to other
> architectures, and that more people can start figuring out how to
> use it, and how to build a boot environment.
I'll 2nd that sentiment, and add another big one: fixing (apparent)
problems with drivers and chipset-munging code, so that devices can be
reliably re-probed/re-inited/etc. after the reboot.
Long term, I think it would be advantageous to be able to avoid SCSI and
other time consuming device probes for the common and simple reboot case
of 1) the currently running kernel is being rebooted, and 2) no changes
to the device configuration have occured. Shouldn't we be able to "save
away" what is in sysfs, and then re-inject that state after a fast
reboot?
Andy
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 0:21 ` Andy Pfiffer
@ 2002-11-06 1:10 ` Werner Almesberger
2002-11-06 1:37 ` Alexander Viro
2002-11-10 18:35 ` Pavel Machek
1 sibling, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-11-06 1:10 UTC (permalink / raw)
To: Andy Pfiffer
Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
Andy Pfiffer wrote:
> You could probably skip the system call to set it up.
Yes, yes, there are many ways to do this. This isn't the issue. The
questions regarding this are:
- it kexec allowed to use a system call ?
- if yes, is a system call the technically right solution ?
- if yes, is it a practical solution ?
So far, it hasn't been considered inherently wrong to use system
calls, even for highly Linux-specific functions, and even if they
aren't performance-critical (just think of pivot_root). (*)
If this perception has changed, such a change of policy would also
affect kexec, but then we don't need to discuss kexec but the
policy change. (I don't know - is such a change in the air ?)
(*) By the way, I remember now where I brought up some hack for
avoiding to use a system call - it was for bootimg :-)
Now, if we assume that it's okay for kexec to use a system call,
the next question is whether kexec should indeed use it, i.e.
whether a system call makes sense for what it is trying to do.
Since there are no device files or network elements naturally
involved here (i.e. other major kernel function interfaces),
the answer seems to be "yes".
Last but not least, we need to decide whether using a system
call would be painful for Eric or for kexec users. This would be
the case if kexec isn't merged, and the kexec patch would need
frequent updates because system calls have changed.
I understand Alan's question as the "what if ... ?" type. If
kexec is indeed rejected for merging, it may make sense to change
the interface to something which may be technically less elegant,
but which makes patch maintenance easier to handle.
> I'll 2nd that sentiment, and add another big one: fixing (apparent)
> problems with drivers and chipset-munging code, so that devices can be
> reliably re-probed/re-inited/etc. after the reboot.
Yes, kexec is likely to turn up a few problems in this area, too.
Right now, we only hear about such issues if some BIOS lets
something slip through. With kexec, such problems should show up
sooner.
> Long term, I think it would be advantageous to be able to avoid SCSI and
> other time consuming device probes
Definitely. May I refer you to my booting paper, which discusses
all this in section 5 ? :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 1:10 ` Werner Almesberger
@ 2002-11-06 1:37 ` Alexander Viro
2002-11-06 2:05 ` Werner Almesberger
2002-11-06 4:07 ` Eric W. Biederman
0 siblings, 2 replies; 72+ messages in thread
From: Alexander Viro @ 2002-11-06 1:37 UTC (permalink / raw)
To: Werner Almesberger
Cc: Andy Pfiffer, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
On Tue, 5 Nov 2002, Werner Almesberger wrote:
> Now, if we assume that it's okay for kexec to use a system call,
> the next question is whether kexec should indeed use it, i.e.
> whether a system call makes sense for what it is trying to do.
> Since there are no device files or network elements naturally
> involved here (i.e. other major kernel function interfaces),
> the answer seems to be "yes".
That's not obvious. By the same logics, we would need syscalls for
turning off overcommit, etc., etc.
FWIW, I suspect that
open("/proc/image", O_EXCL|O_WRONLY);
bunch of lseek()/write()
close()
would be more natural - definitely easier to understand than arguments of
your sys_kexec(). It's easy to switch from your code to that - you
put initialization into ->open(), pulling segments from userland into
->write(), use default ->lseek() and do actual work on ->close() if
no errors had happened. file->private_data will point to intermediate
state you need.
After all, that's what happens - you form an image, writing to it user-supplied
data from given buffers at given offsets and when you are done with that you
commit the changes. IMO special syscall is less natural match for that
than sequence above - commit-on-close is not something unusual, so it matches
the semantics of all syscalls involved...
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 1:37 ` Alexander Viro
@ 2002-11-06 2:05 ` Werner Almesberger
2002-11-07 6:04 ` Eric W. Biederman
2002-11-06 4:07 ` Eric W. Biederman
1 sibling, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-11-06 2:05 UTC (permalink / raw)
To: Alexander Viro
Cc: Andy Pfiffer, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
Alexander Viro wrote:
> That's not obvious. By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.
Okay okay, add file system specific ioctls and sysctl to my list
of alternative mechanisms :-)
> FWIW, I suspect that
> open("/proc/image", O_EXCL|O_WRONLY);
> bunch of lseek()/write()
> close()
Hmm, interesting. Yes, that should work. One would of course have
to retain the current interface for in-kernel use (e.g. MCORE), but
that's probably okay. Let's see what Eric thinks about it - it's
his code after all.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 2:05 ` Werner Almesberger
@ 2002-11-07 6:04 ` Eric W. Biederman
2002-11-07 12:17 ` Werner Almesberger
0 siblings, 1 reply; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-07 6:04 UTC (permalink / raw)
To: Werner Almesberger; +Cc: Alexander Viro, Linux Kernel Mailing List
Werner Almesberger <wa@almesberger.net> writes:
> Alexander Viro wrote:
> > That's not obvious. By the same logics, we would need syscalls for
> > turning off overcommit, etc., etc.
>
> Okay okay, add file system specific ioctls and sysctl to my list
> of alternative mechanisms :-)
>
> > FWIW, I suspect that
> > open("/proc/image", O_EXCL|O_WRONLY);
> > bunch of lseek()/write()
> > close()
>
> Hmm, interesting. Yes, that should work. One would of course have
> to retain the current interface for in-kernel use (e.g. MCORE), but
> that's probably okay. Let's see what Eric thinks about it - it's
> his code after all.
For the record my opinion is there is extra code bloat but it is ok
if it is built as kexecfs. Any other way of getting a magic file
to work with seems currently insane.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-07 6:04 ` Eric W. Biederman
@ 2002-11-07 12:17 ` Werner Almesberger
0 siblings, 0 replies; 72+ messages in thread
From: Werner Almesberger @ 2002-11-07 12:17 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Alexander Viro, Linux Kernel Mailing List
Eric W. Biederman wrote:
[ Al's FS-based kexec interface ]
> For the record my opinion is there is extra code bloat but it is ok
> if it is built as kexecfs. Any other way of getting a magic file
> to work with seems currently insane.
Yes, such an interface change would only make sense if you couldn't
get the system call, or if there would actually be a useful way for
setting up kexec using "third party" programs. But it seems unlikely
to me that somebody could get all the magic right just by using dd.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 1:37 ` Alexander Viro
2002-11-06 2:05 ` Werner Almesberger
@ 2002-11-06 4:07 ` Eric W. Biederman
2002-11-06 4:47 ` Eric W. Biederman
2002-11-06 19:24 ` Rob Landley
1 sibling, 2 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 4:07 UTC (permalink / raw)
To: Alexander Viro
Cc: Werner Almesberger, Andy Pfiffer, Alan Cox, Suparna Bhattacharya,
Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
Alexander Viro <viro@math.psu.edu> writes:
> On Tue, 5 Nov 2002, Werner Almesberger wrote:
>
> > Now, if we assume that it's okay for kexec to use a system call,
> > the next question is whether kexec should indeed use it, i.e.
> > whether a system call makes sense for what it is trying to do.
> > Since there are no device files or network elements naturally
> > involved here (i.e. other major kernel function interfaces),
> > the answer seems to be "yes".
>
> That's not obvious. By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.
>
> FWIW, I suspect that
> open("/proc/image", O_EXCL|O_WRONLY);
> bunch of lseek()/write()
> close()
> would be more natural - definitely easier to understand than arguments of
> your sys_kexec(). It's easy to switch from your code to that - you
> put initialization into ->open(), pulling segments from userland into
> ->write(), use default ->lseek() and do actual work on ->close() if
> no errors had happened. file->private_data will point to intermediate
> state you need.
>
> After all, that's what happens - you form an image, writing to it user-supplied
> data from given buffers at given offsets and when you are done with that you
> commit the changes. IMO special syscall is less natural match for that
> than sequence above - commit-on-close is not something unusual, so it matches
> the semantics of all syscalls involved...
First take a look at a ELF header. There is a one to one mapping between
the arguments to kexec and the segments found there.
Second lseek()/write() pairs do not have the capacity to specify holes/bss
segments kexec does, so it would not be a 1 to 1 transform. But I can
live without holes.
Third I am not fully certain it makes sense to implement a function that will
boot into a user specified image remotely. If the export process has
too many capabilities we could be in trouble.
Are you arguing for more /proc files? Where does the magic file come
from? I cannot request the allocation of a device number because the
allocation was frozen before 2.4 started. Though char 1 minor 11
seems the obvious choice. Or should it be a magic file in sysfs
instead of procfs? All of the require the code to live someplace
where I need to allocate a place in the namespace. So there is no
inherent advantage over a system call. And unless someone exports the
hooks to properly shutdown the system to modules it is useless.
Given that this is a seldom used system function I agree that it does not
need to be optimized.
I do not have any problem with changing the interface to something
more palatable to other kernel developers. But I will only do it for
one of two reasons. My patch will never get accepted in any
reasonable time frame and it makes maintenance easier for me. Or
makes the interface palatable for acceptance, into the kernel.
Neither position currently appears apparent.
----------
Now to dig into the heart of the issue.
I could write the new kernel image into /dev/mem and just jump to
it. Because that is really all I want an interface to do. There
are several practical problems, with something quite that simple.
No kernel shutdown code is run, so I am left with processors flying
all over the place, devices doing all manner of things, after their
device drivers have walked away. Something needs to put the system in
a quiescent state. The fix I call the reboot notifiers, and
device_shutdown. (And then implement a bunch of ->shutdown() methods)
As we all know writing to /dev/mem is not safe because the memory is
being use for other things. So I need a way to safely use memory
during the transition, from one kernel to another.
Personally I would love to be able to allocate one big contiguous
buffer that the kernel is not using and neither is the image I will
eventually load. Then I could just memcpy from that buffer and I
would be done.
Alas memory management in the kernel is done in pages, and can be
fragmented after running for many moons. So I need to allocate all of
my memory in pages, and I need to let the kernel know where it will
all eventually live so I can correctly order the memcpy operations.
Once all the memory copying is sorted out I need to jump to the new
kernel (a kernel being anything that runs without an OS). Logically
all you should have to do is do a single jump instruction but in
practice there is much more that has to be done. The kernel when it
loads up looks around and enables all sorts of cpu optimizations so
the kernel runs as well as possible on the processor. The new kernel
image needs to be given a least common denominator interface so it can
enable what it is prepared to take advantage of. In addition to what
the normal shutdown path can accomplish on x86 this involves disabling
page, changing the gdt, and changing the idt, and possibly disabling
SMP. It should be possible to enhance device_shutdown so it can
properly disable SMP though if that will happen still remains in the
air.
-----------------------------------------
So kexec needs:
- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
for jumping to another kernel.
- Awareness of it's existence.
To some extent every piece of this is intimately tied to the kernel
implementation, from the ability to modify page tables, when jumping
to a new kernel, to the best algorithm for finding a safe memory
buffer, to the proper way to shutdown devices this week, and being
intimately tied to the kernel the code needs to find a home in the
kernel.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 4:07 ` Eric W. Biederman
@ 2002-11-06 4:47 ` Eric W. Biederman
2002-11-06 19:24 ` Rob Landley
1 sibling, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 4:47 UTC (permalink / raw)
To: Alexander Viro
Cc: Werner Almesberger, Andy Pfiffer, Alan Cox, Suparna Bhattacharya,
Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
And the question I was building up to, but forgot to ask.
Given that the kexec code is tied intimately to the kernel
implementation.
Given that there is no real advantage in an incremental write
model for kexec users (except not needing to allocate a syscall
number).
Do you see a better way to structure the kexec interface?
Another file in proc, not carefully placed is just a hair better than
an ioctl. Using /proc is not desirable because there are uses of
kexec that need a very small kernel, and /proc is a pig, is otherwise
useless size bloat.
For some uses including the one that drove me to write it CONFIG_KEXEC
and CONFIG_TINY will both be defined.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 4:07 ` Eric W. Biederman
2002-11-06 4:47 ` Eric W. Biederman
@ 2002-11-06 19:24 ` Rob Landley
1 sibling, 0 replies; 72+ messages in thread
From: Rob Landley @ 2002-11-06 19:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Rusty Russell,
Linux Kernel Mailing List
On Wednesday 06 November 2002 04:07, Eric W. Biederman wrote:
> Personally I would love to be able to allocate one big contiguous
> buffer that the kernel is not using and neither is the image I will
> eventually load. Then I could just memcpy from that buffer and I
> would be done.
>
> Alas memory management in the kernel is done in pages, and can be
> fragmented after running for many moons. So I need to allocate all of
> my memory in pages, and I need to let the kernel know where it will
> all eventually live so I can correctly order the memcpy operations.
Reverse Mappings are cool, and one reason tehy're cool is, in theory, you can
grab a page of physical memory away from something else. In theory code
could be written to ask the kernel "could you please swap this the heck out,
pin the page in memory, and give it to me instead now?" And it can refuse
("it's already pinned by something else, maybe it's a kernel page, go away"),
it can block a bit ("gotta flush it to disk, wait until DMA is done"), or it
could immediatley comply ("it was a clean buffer, have it, keep it, stuff it
and mount it on the wall for all I care...").
This means you can retroactively get contiguous areas of memory by shoving
stuff aside. If it's in use, it'll swap back in immediately. (An obvious
optimization occurs, but that's not necessary for minimal functionality.)
So the the whole problem of needing contiguous areas of memory could, in
theory, be addressed using RMAP.
--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 0:21 ` Andy Pfiffer
2002-11-06 1:10 ` Werner Almesberger
@ 2002-11-10 18:35 ` Pavel Machek
1 sibling, 0 replies; 72+ messages in thread
From: Pavel Machek @ 2002-11-10 18:35 UTC (permalink / raw)
To: Andy Pfiffer
Cc: Werner Almesberger, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
Hi!
> > > Let me ask the same dumb question - what does kexec need that a dumper
> > > doesn't.
> >
> > kexec needs:
> > - a system call to set it up
> > - a way to silence devices <snip>
> <snip>
> > - a bit of glue <snip>
> > - device drivers that can bring silent devices back to life
> <snip>
>
> > > In other words given reboot/trap hooks can kexec happily live
> > > as a standalone module ?
>
> You could probably skip the system call to set it up. Example: I could
> imagine a bizarre set of pseudo-devices:
>
> # insmod kexec
> # cat bzImage > /proc/kexec/next-image
> # echo "root=805" > /proc/kexec/next-cmndline
> # echo 1 > /proc/kexec/reboot
>
> and hide away that dirty little sequence with a nice kexec(3) library
> routine.
Actually, sys_reboot has void * parameter. Reusing it as "next-image"
char * seems okay to me.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 18:36 ` Alan Cox
2002-11-05 19:19 ` Werner Almesberger
@ 2002-11-06 2:48 ` Eric W. Biederman
2002-11-06 4:29 ` Eric W. Biederman
2 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 2:48 UTC (permalink / raw)
To: Alan Cox
Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
>
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
Kexec primarily needs the reboot/trap hooks in working order, and exported,
for it to live externally to the kernel.
Currently the reboot_notifier call chain is private to sys.c, and is not
exported even to other parts of the kernel.
Even together device_shutdown, and the reboot_notifier do not properly shutdown
the cpus on an SMP system.
Plus we are missing quite a ->shutdown methods at random in the kernel, and if
kexec is not easily available someone might not get around to writing
and debugging them.
Plus a system call seems the natural interface for something that
appears to be a reboot.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 18:36 ` Alan Cox
2002-11-05 19:19 ` Werner Almesberger
2002-11-06 2:48 ` Eric W. Biederman
@ 2002-11-06 4:29 ` Eric W. Biederman
2002-11-06 6:25 ` Linus Torvalds
2 siblings, 1 reply; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 4:29 UTC (permalink / raw)
To: Alan Cox
Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
>
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?
In replying to another post by Al Viro I managed to think this through.
kexec needs:
- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
for jumping to another kernel.
- Awareness of it's existence.
Most of this code is intimate with how the kernel currently behaves
and needs at least minor adjustments for things like living in PAE
mode.
The safe buffers a kernel can probably avoid.
I cannot see the core functionality of kexec every living happily as a
standalone module. The kexec code accomplishes nothing. If there is
something useful it does it can probably be moved elsewhere and the
line count reduced. The entire code base is basically obsessed with
getting safe temporary buffers for the new kernel to live in, and
given improvements to how the kernel allocates memory that are
theoretically possible with rmap I could remove that code as well.
The only thing that keeps kexec at all maintainable outside the kernel
is that big fundamental changes do not happen often. But the kernel
must be tracked, closely. I don't see that as a recipe for a
standalone module. I can barely even see making the code a module, or
what the point would be.
The reason kmonte fails in so many cases where kexec succeeds is
precisely because kmonte is a module.
If we include machine_kexec or something very similar to but more
generalized to the list of exported functions, perhaps kexec could
just have the buffer allocation code and live happily outside of the
kernel. But as it is, if we want to factor kexec into pieces so one
piece can live happily as a standalone module it will take some
serious design work, and a total rethink of the implementation. And
we will still have to add code to the kernel.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 4:29 ` Eric W. Biederman
@ 2002-11-06 6:25 ` Linus Torvalds
2002-11-06 6:38 ` Suparna Bhattacharya
` (3 more replies)
0 siblings, 4 replies; 72+ messages in thread
From: Linus Torvalds @ 2002-11-06 6:25 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On 5 Nov 2002, Eric W. Biederman wrote:
>
> In replying to another post by Al Viro I managed to think this through.
> kexec needs:
Note that kexec doesn't bother me at all, and I might find myself using it
myself.
>From a sanity standpoint, I think the thing already _has_ a system call,
though: clearly "sys_reboot()" is the place to add a case for "reboot into
this image". No? That's where we shut down devices anyway, and it's the
sane place to say "reboot into the kexec image"
Which still leaves you with a real sys_kexec() to actually _load_ the
image, or course. I think loading of the image should be a totally
separate event from the actual booting of the image, since we may want to
load the image early, then do various user-level shutdown (unmounting
etc), and then reboot.
Right now the kexec() stuff seems to mix up the loading and rebooting, but
I didn't take a very deep look, maybe I'm wrong.
Anyway, I don't really get why the kexec() system call would not just be
void *kexec_image = NULL;
unsigned long kexec_size;
int sys_kexec(void *uaddr, size_t len)
{
void *new;
if (!capable(CAP_ADMIN))
return -EPERM;
/* Get rid of old image if any.. */
if (kexec_image) {
vfree(kexec_image);
kexec_image = NULL;
}
/* Zero length just meant "get rid of it" */
if (!len)
return 0;
if (!access_ok(VERIFY_READ, uaddr, len))
return -EFAULT;
new = vmalloc(len);
if (!new)
return -ENOMEM;
if (memcpy_from_user(new, uaddr, len)) {
vfree(new);
return -EFAULT;
}
kexec_image = new;
kexec_size = len;
return 0;
}
and be done with it that way? Then the actual "reboot" (and that would be
in the existing "sys_reboot()") basically just does something like
memcpy(kernelbase, kexec_image, kexec_size);
at the very end (while obviously having to be careful about itself being
out of the way. It can avoid the page table issue by using the "page *"
array that vmalloc uses internally anyway: see "area->pages[]" in
vmalloc).
Note that the two-phase boot means that you can load the new kernel early,
which allows you to later on use it for oops handling (it's a bit late to
try to set up the kernel to be loaded at that time ;)
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 6:25 ` Linus Torvalds
@ 2002-11-06 6:38 ` Suparna Bhattacharya
2002-11-06 7:48 ` Eric W. Biederman
` (2 subsequent siblings)
3 siblings, 0 replies; 72+ messages in thread
From: Suparna Bhattacharya @ 2002-11-06 6:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric W. Biederman, Alan Cox, Werner Almesberger, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Tue, Nov 05, 2002 at 10:25:35PM -0800, Linus Torvalds wrote:
>
> On 5 Nov 2002, Eric W. Biederman wrote:
> >
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
>
> Note that kexec doesn't bother me at all, and I might find myself using it
> myself.
>
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
>
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting
> etc), and then reboot.
>
> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.
>
> Anyway, I don't really get why the kexec() system call would not just be
>
> void *kexec_image = NULL;
> unsigned long kexec_size;
>
> int sys_kexec(void *uaddr, size_t len)
> {
> void *new;
>
> if (!capable(CAP_ADMIN))
> return -EPERM;
>
> /* Get rid of old image if any.. */
> if (kexec_image) {
> vfree(kexec_image);
> kexec_image = NULL;
> }
>
> /* Zero length just meant "get rid of it" */
> if (!len)
> return 0;
>
> if (!access_ok(VERIFY_READ, uaddr, len))
> return -EFAULT;
>
> new = vmalloc(len);
> if (!new)
> return -ENOMEM;
>
> if (memcpy_from_user(new, uaddr, len)) {
> vfree(new);
> return -EFAULT;
> }
>
> kexec_image = new;
> kexec_size = len;
> return 0;
> }
>
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
>
> memcpy(kernelbase, kexec_image, kexec_size);
>
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).
>
> Note that the two-phase boot means that you can load the new kernel early,
> which allows you to later on use it for oops handling (it's a bit late to
> try to set up the kernel to be loaded at that time ;)
Yes, that's exactly what we need to support a soft-boot based dump
mechanism, much like the Mission Critical folks split up the bootimg
syscall to do the early load on a sane system, and the actual soft-boot
at crash time. And it fits in naturally as you point out ..
Regards
Suparna
>
> Linus
>
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 6:25 ` Linus Torvalds
2002-11-06 6:38 ` Suparna Bhattacharya
@ 2002-11-06 7:48 ` Eric W. Biederman
2002-11-06 9:11 ` Suparna Bhattacharya
2002-11-06 22:05 ` Michal Jaegermann
2002-11-06 16:13 ` Eric W. Biederman
2002-11-07 8:50 ` Eric W. Biederman
3 siblings, 2 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 7:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
Linus Torvalds <torvalds@transmeta.com> writes:
> On 5 Nov 2002, Eric W. Biederman wrote:
> >
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
>
> Note that kexec doesn't bother me at all, and I might find myself using it
> myself.
Good. Just before I saw this message I sent you my patch ported to 2.5.46,
and from the feed back on this one it looks like people would
appreciate a tweak or two.
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
>
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting
> etc), and then reboot.
That sounds reasonable to me. Especially as that lines up a little more
with what the mcore people want as well. Until today I hadn't realized
they were using a spare current to process oopses. For just booting
another kernel all of the staging can currently be done by reading the
new kernel into your process before calling the user-level shutdown code.
> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.
It currently happens all in one step because I had never gotten
feedback that people wanted it in two steps.
> Note that the two-phase boot means that you can load the new kernel early,
> which allows you to later on use it for oops handling (it's a bit late to
> try to set up the kernel to be loaded at that time ;)
Given that it is definitely a good idea to split the patch up into two
pieces. And a kernel for oops handling should work once all of other
problems are resolved.
> Anyway, I don't really get why the kexec() system call would not just be
>
> void *kexec_image = NULL;
> unsigned long kexec_size;
>
> int sys_kexec(void *uaddr, size_t len)
> {
> void *new;
>
> if (!capable(CAP_ADMIN))
> return -EPERM;
>
> /* Get rid of old image if any.. */
> if (kexec_image) {
> vfree(kexec_image);
> kexec_image = NULL;
> }
>
> /* Zero length just meant "get rid of it" */
> if (!len)
> return 0;
>
> if (!access_ok(VERIFY_READ, uaddr, len))
> return -EFAULT;
>
> new = vmalloc(len);
> if (!new)
> return -ENOMEM;
>
> if (memcpy_from_user(new, uaddr, len)) {
> vfree(new);
> return -EFAULT;
> }
>
> kexec_image = new;
> kexec_size = len;
> return 0;
> }
>
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
>
> memcpy(kernelbase, kexec_image, kexec_size);
>
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).
Using area->pages[] is an interesting idea.
>From my current interface this is missing the following pieces.
1) The address or addresses to load the new kernel at. (Think kernel + ramdisk)
2) The address to jump to start the new kernel.
3) My interesting buffer handling.
The question is how much of that do we need.
Thinking out loud, and hopefully answering your question.
- We need a working stack when the new kernel is jumped to so PIC code
can exist at the entry point.
- An oops processing kernel needs to load at an address other than 1MB,
or at the very least it's boot sequence needs to squirrel away the
old contents of the kernel text and data segments, which reside at
1MB, before it moves down to 1MB.
- When we transfer control to the trampoline in machine_kexec we need
to be able to refer to everything with physical addresses.
- I do not see a way out of running my buffer verifier algorithm.
The problem is that I do not want to put complex logic in the
assembly machine_kexec trampoline. So I want to be able to pass
it something it can just memcpy to it's final resting place. Which
means the buffer pages either need to be the final resting place of
the new kernel (ideal) or are not a page that of the final resting
place.
- I can dig up area->pages[] but I don't see vmalloc buying me
anything. Doing the copies and allocations a page at a time is not
hard. I have to sort the contents of the pages, and where they
are located so I need to undo the virtual mapping.
area ->pages is all by struct pages *, which is most inconvenient
when you are tearing down page tables, I would need to put the pages
into another data structure that just had the page frame number or
physical page address anyway.
- Once I am using my own data structure to track the pages, and I am
already vetting the pages for safe locations. Going the rest of the
way to my current interface is not a big step, and I have already
tested that code.
So either I have blinders on, or there is very little percentage in
changing how I load an image. But to make the oops processing easier
I will split it up into two parts.
Then I guess the reasonable thing to do is to modify sys_reboot to
call machine_kexec instead of machine_restart when a kexec_image is
present. Or should I add another magic number, and another case to
sys_reboot?
case LINUX_REBOOT_CMD_RESTART:
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
printk(KERN_EMERG "Restarting system.\n");
+ if (kexec_image)
+ machine_kexec(kexec_image);
machine_restart(NULL);
break;
O.k. In the next couple of days I will split the loading, and
executing phase of my kexec code into parts, and resubmit the code.
The we can dig in on what it takes to make kexec run stably.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 7:48 ` Eric W. Biederman
@ 2002-11-06 9:11 ` Suparna Bhattacharya
2002-11-06 22:05 ` Michal Jaegermann
1 sibling, 0 replies; 72+ messages in thread
From: Suparna Bhattacharya @ 2002-11-06 9:11 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Alan Cox, Werner Almesberger, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
>
> > On 5 Nov 2002, Eric W. Biederman wrote:
> > >
> > > In replying to another post by Al Viro I managed to think this through.
> > > kexec needs:
> >
> > Note that kexec doesn't bother me at all, and I might find myself using it
> > myself.
>
> Good. Just before I saw this message I sent you my patch ported to 2.5.46,
> and from the feed back on this one it looks like people would
> appreciate a tweak or two.
>
>
> That sounds reasonable to me. Especially as that lines up a little more
> with what the mcore people want as well. Until today I hadn't realized
> they were using a spare current to process oopses. For just booting
> another kernel all of the staging can currently be done by reading the
> new kernel into your process before calling the user-level shutdown code.
>
> > Right now the kexec() stuff seems to mix up the loading and rebooting, but
> > I didn't take a very deep look, maybe I'm wrong.
>
> It currently happens all in one step because I had never gotten
> feedback that people wanted it in two steps.
I'd mentioned it a few times in the context of mcore, but probably
didn't explain myself clearly enough then.
>
> > Note that the two-phase boot means that you can load the new kernel early,
> > which allows you to later on use it for oops handling (it's a bit late to
> > try to set up the kernel to be loaded at that time ;)
>
> Given that it is definitely a good idea to split the patch up into two
> pieces. And a kernel for oops handling should work once all of other
> problems are resolved.
Yes, this fits the model we need.
>
> The question is how much of that do we need.
>
> Thinking out loud, and hopefully answering your question.
> - We need a working stack when the new kernel is jumped to so PIC code
> can exist at the entry point.
>
> - An oops processing kernel needs to load at an address other than 1MB,
> or at the very least it's boot sequence needs to squirrel away the
> old contents of the kernel text and data segments, which reside at
> 1MB, before it moves down to 1MB.
Yes, that bit of memory save logic exists in the mcore mechanism. These
pages are saved away in compressed form in memory and written out
later after dump.
Now to avoid these pages from being used by the new kernel until
the dump is safetly written out to disk, mcore patches some of
the initialization code to mark these pages (containing saved
dump) as reserved.
> O.k. In the next couple of days I will split the loading, and
> executing phase of my kexec code into parts, and resubmit the code.
Great !
> The we can dig in on what it takes to make kexec run stably.
>
Regards
Suparna
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 7:48 ` Eric W. Biederman
2002-11-06 9:11 ` Suparna Bhattacharya
@ 2002-11-06 22:05 ` Michal Jaegermann
1 sibling, 0 replies; 72+ messages in thread
From: Michal Jaegermann @ 2002-11-06 22:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
>
> Then I guess the reasonable thing to do is to modify sys_reboot to
> call machine_kexec instead of machine_restart when a kexec_image is
> present. Or should I add another magic number, and another case to
> sys_reboot?
Given that "bird-eye" description why not to make a "normal" restart
a particular case of kexec where you just have one kernel loaded
from an external storage? It does not seem to be that much
different although some issues are skipped or taken for granted. Or
I am talking nonsense?
Michal
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 6:25 ` Linus Torvalds
2002-11-06 6:38 ` Suparna Bhattacharya
2002-11-06 7:48 ` Eric W. Biederman
@ 2002-11-06 16:13 ` Eric W. Biederman
2002-11-07 8:50 ` Eric W. Biederman
3 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-06 16:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
lkcd-general, lkcd-devel
Linus Torvalds <torvalds@transmeta.com> writes:
> >From a sanity standpoint, I think the thing already _has_ a system call,
> though: clearly "sys_reboot()" is the place to add a case for "reboot into
> this image". No? That's where we shut down devices anyway, and it's the
> sane place to say "reboot into the kexec image"
When kexec is separated into two pieces I agree. As I had it
initially in one step it does not look at all like reboot. Now I
just need to think up a new magic number for sys_reboot.
[snip wonderful vision of the theoretical simplicity of sys_kexec].
In case I was not sufficiently clear last night. It could be as
simple as your example code if I replaced vmalloc by
__get_free_pages/alloc_pages, and allocated a large contiguous area of
ram. But MAX_ORDER limits me to 8MB images, and allocating an 8MB
chunk is unreliable, and even a 2MB chunk is dangerous.
So I must use some form of scatter/gather list of pages, like
area ->pages[] to make it work. Things get tricky because I gather
(via memcpy) the pages at a location that potentially overlaps the
source pages. So I must walk through the list of pages making certain
I when I gather (memcpy) the buffer pages into their final location I
will not stomp on a buffer page I have not come to yet. Correctly
doing that untangling is where the complexity in kernel/kexec.c comes
from.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-06 6:25 ` Linus Torvalds
` (2 preceding siblings ...)
2002-11-06 16:13 ` Eric W. Biederman
@ 2002-11-07 8:50 ` Eric W. Biederman
2002-11-07 15:44 ` Linus Torvalds
` (2 more replies)
3 siblings, 3 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-07 8:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
I am now officially grumpy. From a code perspective splitting kexec
into two phases load, and execute is a simple change to make. From a
semantics standpoint things get ugly, and messy. And that means I
can't just dash off another patch.
There are currently 2 cases that it would be nice to have work.
1) Load a new kernel and immediately execute it.
2) Load a new kernel and execute it on panic.
At first glance splitting the code into a load and execute phases allows
us to use one mechanism to accomplish both goals. In practice
that does not work. There are 2 problems.
panic does not call sys_reboot it rolls that functionality by hand.
And to a certain extent it makes sense for panic to take a different
path because we know something is badly wrong so we need to be extra
careful.
In staging the image we allocate a whole pile of pages, and keep them
locked in place. Waiting for years potentially until the machine
reboots or panics. This memory is not accounted for anywhere so no
one can see that we have it allocated, which makes debugging hard.
Additionally in locking up megabytes for a long period of time we
create unsolvable fragmentation issues for the mm layer to deal with.
In a unified design I can buffer the image in the anonymous pages of a
user space process just as well as I can in locked down kernel memory.
So factoring sys_kexec in to load and execute pieces only helps for
executing the new image on a kernel panic, and that case does not
actually work.
So currently factoring kexec looks like a pointless exercise, that
will just lead to more pain.
I am left with the following questions.
- How should the pages allocated to an early loaded image be accounted
for?
- How do we avoid making life hard for the mm system with an early
loaded image?
- Is it safe to call sys_reboot from panic?
- Can we simply factor out the sequence:
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
And place it into it's own subroutine?
- What does the current mcore implementation do? And is that a good
model to follow to resolve some of these issues?
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-07 8:50 ` Eric W. Biederman
@ 2002-11-07 15:44 ` Linus Torvalds
2002-11-09 23:05 ` Eric W. Biederman
2002-11-07 15:48 ` Linus Torvalds
2002-11-08 18:01 ` Alan Cox
2 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-11-07 15:44 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
On 7 Nov 2002, Eric W. Biederman wrote:
>
> There are currently 2 cases that it would be nice to have work.
> 1) Load a new kernel and immediately execute it.
> 2) Load a new kernel and execute it on panic.
I really don't think (1) is _ever_ a valid thing to do.
The fact is, loading a new kernel wants filesystems and a fully working
system. While executing it wants the filesystems quiescent.
> panic does not call sys_reboot it rolls that functionality by hand.
Forget about panic for now. It's a design issue - it should be possible to
work, but somebody else can do it if the infrastructure is done right.
> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.
And in a unified design, I won't apply the patches. It's that simple.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-07 15:44 ` Linus Torvalds
@ 2002-11-09 23:05 ` Eric W. Biederman
2002-11-09 23:33 ` Linus Torvalds
` (3 more replies)
0 siblings, 4 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-09 23:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
There are two cases I am seeing users wanting.
1) Load a new kernel on panic.
- Extra care must be taken so what broke the first kernel does
not break this one, and so that the shards of the old kernel
do not break it.
- Care must be taken so that loading the second kernel does not
erase valuable data that is desirable to place in a crash dump.
- This kernel cannot live at the same address as the old one, (at
least not initially).
2) Load a new kernel under normal operating conditions.
And when you have a normal user space that boils down to:
- Acquire the kernel you are going to boot.
- Run the user space shutdown scripts, so the system is in
a clean state.
- Execute the new kernel.
- The normal case is that the newly loaded kernel will live at the
same physical location where the current kernel lives.
Currently my code handles starting a new kernel under normal operating
conditions. There are currently two ways I can implement a clean user
space shutdown with out needing locked buffers in the kernel until the
very last moment.
Method 1 (This works today with my sample user space):
- copy the kernel to /newkernel
- init 6
- if [ -r /newkernel ]; then
/sbin/kexec /newkernel
else
/sbin/reboot
fi
- /sbin/kexec reads in /newkernel
- /newkernel is parsed to figure out how it should be loaded
- sys_kexec is called to copy the kernel from user space anonymous
memory to temporary kernel buffers.
Method 2 (For people with read only roots):
- /sbin/delayed_kexec /path/to/new/kernel
- Read in the /path/to/new/kernel into anonymous pages
- Parse it and figure out how it should be loaded
- Run the shutdown scripts from /etc/rc6.d/
- Call sys_kexec, which will copy the data from user space anonymous
pages, to kernel space.
This is to just make it clear that I am not working from a
FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
maintenance. I am quite willing to make changes assuming I understand
what is gained with the change.
What I currently support is a moderately nice interface, that has the
kernel doing as much as it can without being bogged down in the
specific details in any one file format, or needing something besides
a 32bit entry point to jump to.
I model an image as a set of segments of physical memory. And I copy
the image loaded with sys_kexec to it's final location, before jumping
to the new image. There are two reasons for this. It takes 3
segments to load a bzImage (setup.S, vmlinux, and an initrd). And an
arbitrary number of segments maps cleanly to a static ELF binary.
There is only one difficult case. What happens when the buffers the
kernel allocates are physically in one of the segments of memory of
the new kernel image. Possible especially for the initrd which is
loaded at the end of memory.
I then use the following algorithm to sort the potential mess out
before I jump to the new code. And since this code depends on
swapping the contents of pages, knowing the physical location of
the pages, and is not limited to 128MB I am reluctant to look a
vmalloc variant. I can more get my pages from a slab if I need to
report I have them.
static int kimage_get_off_destination_pages(struct kimage *image)
{
kimage_entry_t *ptr, *cptr, entry;
unsigned long buffer, page;
unsigned long destination = 0;
/* Here we implement safe guards to insure that
* a source page is not copied to it's destination
* page before the data on the destination page is
* no longer useful.
*
* To make it work we actually wind up with a
* stronger condition. For every page considered
* it is either it's own destination page or it is
* not a destination page of any page considered.
*
* Invariants
* 1. buffer is not a destination of a previous page.
* 2. page is not a destination of a previous page.
* 3. destination is not a previous source page.
*
* Result: Either a source page and a destination page
* are the same or the page is not a destination page.
*
* These checks could be done when we allocate the pages,
* but doing it as a final pass allows us more freedom
* on how we allocate pages.
*
* Also while the checks are necessary, in practice nothing
* happens. The destination kernel wants to sit in the
* same physical addresses as the current kernel so we never
* actually allocate a destination page.
*
* BUGS: This is a O(N^2) algorithm.
*/
buffer = __get_free_page(GFP_KERNEL);
if (!buffer) {
return -ENOMEM;
}
buffer = virt_to_phys((void *)buffer);
for_each_kimage_entry(image, ptr, entry) {
/* Here we check to see if an allocated page */
kimage_entry_t *limit;
if (entry & IND_DESTINATION) {
destination = entry & PAGE_MASK;
}
else if (entry & IND_INDIRECTION) {
/* Indirection pages must include all of their
* contents in limit checking.
*/
limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
}
if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
continue;
}
page = entry & PAGE_MASK;
limit = ptr;
/* See if a previous page has the current page as it's
* destination.
* i.e. invariant 2
*/
cptr = kimage_dst_conflict(image, page, limit);
if (cptr) {
unsigned long cpage;
kimage_entry_t centry;
centry = *cptr;
cpage = centry & PAGE_MASK;
memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
*cptr = page | (centry & ~PAGE_MASK);
*ptr = buffer | (entry & ~PAGE_MASK);
buffer = cpage;
}
if (!(entry & IND_SOURCE)) {
continue;
}
/* See if a previous page is our destination page.
* If so claim it now.
* i.e. invariant 3
*/
cptr = kimage_src_conflict(image, destination, limit);
if (cptr) {
unsigned long cpage;
kimage_entry_t centry;
centry = *cptr;
cpage = centry & PAGE_MASK;
memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
*cptr = buffer | (centry & ~PAGE_MASK);
*ptr = cpage | ( entry & ~PAGE_MASK);
buffer = page;
}
/* If the buffer is my destination page do the copy now
* i.e. invariant 3 & 1
*/
if (buffer == destination) {
memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
*ptr = buffer | (entry & ~PAGE_MASK);
buffer = page;
}
}
free_page((unsigned long)phys_to_virt(buffer));
return 0;
}
static kimage_entry_t *kimage_dst_conflict(
struct kimage *image, unsigned long page, kimage_entry_t *limit)
{
kimage_entry_t *ptr, entry;
unsigned long destination = 0;
for_each_kimage_entry(image, ptr, entry) {
if (ptr == limit) {
return 0;
}
else if (entry & IND_DESTINATION) {
destination = entry & PAGE_MASK;
}
else if (entry & IND_SOURCE) {
if (page == destination) {
return ptr;
}
destination += PAGE_SIZE;
}
}
return 0;
}
static kimage_entry_t *kimage_src_conflict(
struct kimage *image, unsigned long destination, kimage_entry_t *limit)
{
kimage_entry_t *ptr, entry;
for_each_kimage_entry(image, ptr, entry) {
unsigned long page;
if (ptr == limit) {
return 0;
}
else if (entry & IND_DESTINATION) {
/* nop */
}
else if (entry & IND_DONE) {
/* nop */
}
else {
/* SOURCE & INDIRECTION */
page = entry & PAGE_MASK;
if (page == destination) {
return ptr;
}
}
}
return 0;
}
Having had time to digest the idea of starting a new kernel on panic
I can now make some observations and what I believe it would take to
make it as robust as possible.
- On panic because random pieces of the kernel may be broken we want
to use as little of the kernel as possible.
- Therefore machine_kexec should not allocate any memory, and as
quickly as possible it should transition to the new kernel
- So a new page table should be sitting around with the new kernel
already mapped, and likewise other important tables like the
gdt, and the idt, should be pre-allocated.
- Then machine_kexec can just switch stacks, page tables, and other
machine dependent tables and jump to the new kernel.
- The load stage needs to load everything at the physical location it
will initially run at. This would likely need support from rmap.
- The load stage needs to preallocate page tables and buffers.
- The load stage would likely work easiest by either requiring a mem=xxx
line, reserving some of physical memory for the new kernel. Or
alternatively using some rmap support to clear out a swath of
physical memory the new kernel can be loaded into.
- The new kernel loaded on panic must know about the previous kernel,
and have various restrictions because of that.
Supporting a kernel loaded from a normal environment is a rather
different problem.
- It cannot be loaded at it's run location (The current kernel is
sitting there).
- It should not need to know about the previously executing kernel.
- Work can be done in machine_kexec to allocate memory so everything
does not need to be pre allocated.
- I can safely use multiple calls to the page allocator instead of
needing a special mechanism to allocate memory.
And now I go back to the silly exercise of factoring my code so the
new kernel can be kept in locked kernel memory, instead of in a file
while the shutdown scripts are run.
Unless the linux kernel is modified to copy itself to the top of
physical memory when it loads I have trouble seeing how any of this
will help make the panic case easier to implement.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:05 ` Eric W. Biederman
@ 2002-11-09 23:33 ` Linus Torvalds
2002-11-10 1:37 ` Eric W. Biederman
2002-11-09 23:39 ` Randy.Dunlap
` (2 subsequent siblings)
3 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-11-09 23:33 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
On 9 Nov 2002, Eric W. Biederman wrote:
>
> Currently my code handles starting a new kernel under normal operating
> conditions. There are currently two ways I can implement a clean user
> space shutdown with out needing locked buffers in the kernel until the
> very last moment.
PLEASE tell me why you don't just use the 20-line "vmalloc()" function I
already wrote for you?
It works for all cases - and since you do need to load the kernel into
memory anyway, it's not using any more memory than your existing code. And
it's infinitely more flexible to have a clearly separated load-process,
than having to have some load happen at reboot time (whether by init or by
anything else).
And since the kernel is fully working at the load time, you can even do
things like swap out pages in order to make room for the kernel at the
right place. So you can even do something like this:
int alloc_kernel_pages(unsigned long *array, int nr_pages,
unsigned long min_address)
{
void *bad_page_list = NULL;
int i = 0, retval;
while (i < nr_pages) {
unsigned long page = __get_free_page(GFP_USER);
if (!page)
goto oom;
if (page < min_address) {
*(void **)page = bad_page_list;
bad_page_list = (void *)page;
continue;
}
array[i] = page;
i++;
}
retval = 0;
out:
while (bad_page_list) {
unsigned long page = (unsigned long) bad_page_list;
bad_page_list = *(void **)bad_page_list;
free_page(page);
}
return retval;
oom:
while (i > 0)
free_page(array[--i]);
retval = -ENOMEM;
goto out;
}
and now you are guaranteed that all the allocated pages are above a
certain mark (change the "min_address" to be a "validity callback" or
whatever if you want to be fancy and allow arbitrary rules, which is good
if you want to allow pages in the low 1M on x86, for example), which means
that your final reboot stage is _much_much_ simpler and you don't ever
have to worry about overlap.
Use one of the pages to allocate the memcpy() trampoline and the actual
data structures used for the copy, for example. Use the rest for the
actual kernel data.
Keep it simple.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:33 ` Linus Torvalds
@ 2002-11-10 1:37 ` Eric W. Biederman
2002-11-10 2:12 ` Alan Cox
2002-11-10 3:17 ` Linus Torvalds
0 siblings, 2 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 1:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Linus Torvalds <torvalds@transmeta.com> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > Currently my code handles starting a new kernel under normal operating
> > conditions. There are currently two ways I can implement a clean user
> > space shutdown with out needing locked buffers in the kernel until the
> > very last moment.
>
> PLEASE tell me why you don't just use the 20-line "vmalloc()" function I
> already wrote for you?
The reasons I don't jump on board:
- It does not handle multiple segments.
Without multiple segments I think I simply out essential complexity
of the problem. A bzImage has at least 2.
- vmalloc is artificially limited to 128MB.
- I still have to run code to prevent imperfect overlaps. A perfect
overlap being a source buffer living in it's destination address.
- I still have to run code to find the physical addresses of the
pages, and locate those in non-destination pages, and form a linked
list of pages for that.
> It works for all cases - and since you do need to load the kernel into
> memory anyway, it's not using any more memory than your existing code. And
> it's infinitely more flexible to have a clearly separated load-process,
> than having to have some load happen at reboot time (whether by init or by
> anything else).
I am trying to process it but I don't see why having the load happen
as a seperate syscall is clearer. Having it happen as a seperate
architecture independent function I understand.
asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
struct kexec_segment *segments)
{
/* Am I using to much stack space here? */
struct kimage image;
int result;
/* We only trust the superuser with rebooting the system. */
if (!capable(CAP_SYS_BOOT))
return -EPERM;
lock_kernel();
//// This chunk does the load and there is no kernel shutdown code
//// run yet.
kimage_init(&image);
result = do_kexec(entry, nr_segments, segments, &image);
if (result) {
kimage_free(&image);
unlock_kernel();
return result;
}
//// ----------- I can snip here for your two syscall version -----------
//// This part is the kernel shutdown
/* The point of no return is here... */
notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
system_running = 0;
device_shutdown();
printk(KERN_EMERG "Starting new kernel\n");
//// And here is where I start the new kernel.
machine_kexec(&image);
}
>
> And since the kernel is fully working at the load time, you can even do
> things like swap out pages in order to make room for the kernel at the
> right place. So you can even do something like this:
I have clearly separated load code, that runs before any of the kernel
starts to shutdown. Until it completes successfully I do not start
to shutdown the kernel. My user space is shut down but that is a
different story.
Swapping out pages is nice, but when user space is shutdown there
shouldn't be any extra pages in the kernel to swap out, and if you are
that tight on memory that you need to swap it won't work, anyway.
> int alloc_kernel_pages(unsigned long *array, int nr_pages,
> unsigned long min_address)
> {
> void *bad_page_list = NULL;
> int i = 0, retval;
>
> while (i < nr_pages) {
> unsigned long page = __get_free_page(GFP_USER);
>
> if (!page)
> goto oom;
>
> if (page < min_address) {
> *(void **)page = bad_page_list;
> bad_page_list = (void *)page;
> continue;
> }
> array[i] = page;
> i++;
> }
> retval = 0;
> out:
> while (bad_page_list) {
> unsigned long page = (unsigned long) bad_page_list;
> bad_page_list = *(void **)bad_page_list;
> free_page(page);
> }
> return retval;
> oom:
> while (i > 0)
> free_page(array[--i]);
> retval = -ENOMEM;
> goto out;
> }
Which is a good algorithm but it has the potential to allocate a lot
of extra pages, and I have implemented this it in the past. It's
worst case is just nasty.
My current code allocates at most 1 extra page and works gracefully if
it happens to allocates the pages it really wanted to use. It is just
a hair more complex, and it makes everything else very simple.
> and now you are guaranteed that all the allocated pages are above a
> certain mark (change the "min_address" to be a "validity callback" or
> whatever if you want to be fancy and allow arbitrary rules, which is good
> if you want to allow pages in the low 1M on x86, for example), which means
> that your final reboot stage is _much_much_ simpler and you don't ever
> have to worry about overlap.
Exactly and that is why I do it where I do it. In the C load code.
In the kernel so it has to be written only once.
> Use one of the pages to allocate the memcpy() trampoline and the actual
> data structures used for the copy, for example. Use the rest for the
> actual kernel data.
>
> Keep it simple.
Yep.
After loading everything I have a total of 243 lines of code.
100 lines of assembly doing the copies in the trampoline.
143 lines of C modifying the page tables, the gdt, and the idt,
copying the trampoline to the correct place, and going for it.
And despite my utter puzzlement on why you want the syscall cut in two.
I will now go cut along the dotted line. If that is all it takes to
have piece I can do that. A sore head from all of the scratching
trying to figure out why it needs to be cut in two, but I can cut
sys_kexec in two.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 1:37 ` Eric W. Biederman
@ 2002-11-10 2:12 ` Alan Cox
2002-11-10 2:16 ` Eric W. Biederman
2002-11-10 3:17 ` Linus Torvalds
1 sibling, 1 reply; 72+ messages in thread
From: Alan Cox @ 2002-11-10 2:12 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> The reasons I don't jump on board:
> - It does not handle multiple segments.
> Without multiple segments I think I simply out essential complexity
> of the problem. A bzImage has at least 2.
Thats a matter for user space and the unpacker
> - vmalloc is artificially limited to 128MB.
Just grabbing a load of pages and using kmap/scatter gather by hand is
not
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:12 ` Alan Cox
@ 2002-11-10 2:16 ` Eric W. Biederman
2002-11-10 3:03 ` Werner Almesberger
2002-11-10 14:30 ` Alan Cox
0 siblings, 2 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 2:16 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> > The reasons I don't jump on board:
> > - It does not handle multiple segments.
> > Without multiple segments I think I simply out essential complexity
> > of the problem. A bzImage has at least 2.
>
> Thats a matter for user space and the unpacker
>
> > - vmalloc is artificially limited to 128MB.
>
> Just grabbing a load of pages and using kmap/scatter gather by hand is
> not
To use kmapped memory I need to setup a page table to do the final copy.
And to setup a page table I need to know where the memory is going to be copied
to.
So my gut impression at least says an interface that ignores where
the image wants to live just adds complexity in other places, and
makes for an interface that is hard to maintain long term, because
you depend on a lot of kernel implementation details, which are likely
to change in arbitrary ways.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:16 ` Eric W. Biederman
@ 2002-11-10 3:03 ` Werner Almesberger
2002-11-10 3:23 ` Eric W. Biederman
2002-11-10 14:30 ` Alan Cox
1 sibling, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-11-10 3:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Linus Torvalds, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Eric W. Biederman wrote:
> So my gut impression at least says an interface that ignores where
> the image wants to live just adds complexity in other places,
Linus' alloc_kernel_pages function would actually be able to handle
this, provided that the "validity callback" checks if the allocated
page happens to be in one of the destination areas.
I'm not so sure if this implementation is really that much more
compact than your current conflict resolution, though. Also, it may
be hairy in scenarios where you actually expect to fill more than
50% of system memory. (But your concerns about a 128MB limit scare
me, too. I realize that people have taken initrds to extremes I
never quite imagined, but that still looks a little excessive :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:03 ` Werner Almesberger
@ 2002-11-10 3:23 ` Eric W. Biederman
0 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 3:23 UTC (permalink / raw)
To: Werner Almesberger
Cc: Alan Cox, Linus Torvalds, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Werner Almesberger <wa@almesberger.net> writes:
> Eric W. Biederman wrote:
> > So my gut impression at least says an interface that ignores where
> > the image wants to live just adds complexity in other places,
>
> Linus' alloc_kernel_pages function would actually be able to handle
> this, provided that the "validity callback" checks if the allocated
> page happens to be in one of the destination areas.
>
> I'm not so sure if this implementation is really that much more
> compact than your current conflict resolution, though. Also, it may
> be hairy in scenarios where you actually expect to fill more than
> 50% of system memory. (But your concerns about a 128MB limit scare
> me, too. I realize that people have taken initrds to extremes I
> never quite imagined, but that still looks a little excessive :-)
I have not heard of more than about 90MB. One of the things I would
not be surprised to see in the next couple of years as memory gets
cheaper is diskless systems that don't even bother doing NFS root and
just put everything in an initrd. But that is not the main concern.
Since there are more polite ways of allocating memory already
implemented. Sucking up a 16MB hunk of some ones vmalloc space is
quite rude. Currently the limit is pretty much 50% of system memory
or 1GB whichever is less because the code must be loaded into user
space first, and I don't touch high memory. Although I guess if it
was mmaped read only the limit may be higher.
I don't expect to come to close to using all of system memory
except on limited memory systems. But it is always nice to be
polite.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:16 ` Eric W. Biederman
2002-11-10 3:03 ` Werner Almesberger
@ 2002-11-10 14:30 ` Alan Cox
2002-11-10 16:56 ` Eric W. Biederman
1 sibling, 1 reply; 72+ messages in thread
From: Alan Cox @ 2002-11-10 14:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> To use kmapped memory I need to setup a page table to do the final copy.
> And to setup a page table I need to know where the memory is going to be copied
> to.
And ?
I find it hard to believe you can't drive an MMU if you can write code
that boots one Linux from another
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 14:30 ` Alan Cox
@ 2002-11-10 16:56 ` Eric W. Biederman
0 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 16:56 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> > To use kmapped memory I need to setup a page table to do the final copy.
> > And to setup a page table I need to know where the memory is going to be
> copied
>
> > to.
>
> And ?
>
> I find it hard to believe you can't drive an MMU if you can write code
> that boots one Linux from another
One of the simplifying things I do between OS's is turn of the MMU, or
at least give it a 1-1 trivial mapping with physical memory.
If all of that memory is hanging out there forever. It probably makes sense
to be high memory capable. But for the first rev of this I won't be.
Addresses > 4GB are a major pain to work with on x86.
But I do have a test machine that can reproduce that so I can test for
strange bugs. I added a BIOS option to put all but 512M out of 4GB
above the 4GB line.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 1:37 ` Eric W. Biederman
2002-11-10 2:12 ` Alan Cox
@ 2002-11-10 3:17 ` Linus Torvalds
2002-11-10 4:26 ` Eric W. Biederman
2002-11-11 18:03 ` Eric W. Biederman
1 sibling, 2 replies; 72+ messages in thread
From: Linus Torvalds @ 2002-11-10 3:17 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
On 9 Nov 2002, Eric W. Biederman wrote:
>
> And despite my utter puzzlement on why you want the syscall cut in two.
I'm amazed about your puzzlement, since everybody else seem to get my
arguments, but as long as you play along I don't much care.
I will explain once more why it needs to be cut into two, even if you're
apparently willing to do it even without understanding:
When you reboot, you often cannot load the image.
This is _trivially_ true for panics or things like
- I don't understand why you do not want to accept this. Even if
your code doesn't even _handle_ panics, it's so obvious that
this is true that I don't understand why you want a limitation
in your particular current implementation to be a fundamental
flaw of the whole idea.
But it is _also_ true for any standard setup where you don't have
a special "init" that knows about loading the kernel, and where to
load it from.
- Do you want to rewrite every "init" setup out there, adding
some way to tell init where to load the kernel from?
Or do you want to just split the thing in two, so that you can
load the kernel _before_ you ask init to shut down, and just
happily use bog-standard tools that everybody is already
familiar with..
The two-part loader can clearly handle both cases. And if _you_ don't want
a two-part loader, you can do exactly what you do now by just doing two
system calls.
As to vmalloc - I don't actually much care how the first and second parts
are implemented. I suggested a vmalloc()-like approach just because your
patch looks unnecessarily complicated to me. But while I am convinced that
the two-phase loading/exec is absolutely the way to do it, the actual
low-level implementation is just a detail.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:17 ` Linus Torvalds
@ 2002-11-10 4:26 ` Eric W. Biederman
2002-11-11 18:03 ` Eric W. Biederman
1 sibling, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 4:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Linus Torvalds <torvalds@transmeta.com> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > And despite my utter puzzlement on why you want the syscall cut in two.
>
> I'm amazed about your puzzlement, since everybody else seem to get my
> arguments, but as long as you play along I don't much care.
>
> I will explain once more why it needs to be cut into two, even if you're
> apparently willing to do it even without understanding:
>
> When you reboot, you often cannot load the image.
>
> This is _trivially_ true for panics or things like
That the load needs to be separate for handling panics is trivially
true. I simply have a very hard time believing that the load you want
for the normal case will be the load you want for a panic. I think
I want to be much more paranoid in preparing for the kernel to blow
up. And I want to move data around much more carefully. And that
care adds restrictions I want for the normal case.
So splitting it up to prepare for panic handling just looks like over
design.
> But it is _also_ true for any standard setup where you don't have
> a special "init" that knows about loading the kernel, and where to
> load it from.
>
> - Do you want to rewrite every "init" setup out there, adding
> some way to tell init where to load the kernel from?
>
> Or do you want to just split the thing in two, so that you can
> load the kernel _before_ you ask init to shut down, and just
> happily use bog-standard tools that everybody is already
> familiar with..
When you can change the init setup with just a couple of lines of
shell script seeing if file exists in magic location (say a special
ramfs or tmpfs), I guess it does not look hard to me.
> The two-part loader can clearly handle both cases. And if _you_ don't want
> a two-part loader, you can do exactly what you do now by just doing two
> system calls.
Right which is why I don't much care, so long as I don't have to test
reboot on panic any time soon...
I doubt we will see eye to eye on this one. So I will now finish up
the patch to split this all up.
> As to vmalloc - I don't actually much care how the first and second parts
> are implemented. I suggested a vmalloc()-like approach just because your
> patch looks unnecessarily complicated to me.
I'd love to make it simpler as well if I saw a clear opportunity that
wasn't just moving the complexity somewhere else. But when I really
look at it I think that I am just wordy.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:17 ` Linus Torvalds
2002-11-10 4:26 ` Eric W. Biederman
@ 2002-11-11 18:03 ` Eric W. Biederman
1 sibling, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-11 18:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh
Linus Torvalds <torvalds@transmeta.com> writes:
> On 9 Nov 2002, Eric W. Biederman wrote:
> >
> > And despite my utter puzzlement on why you want the syscall cut in two.
>
> I'm amazed about your puzzlement, since everybody else seem to get my
> arguments, but as long as you play along I don't much care.
I think this comes from being the guy down in the trenches implementing
the code. And it is sometimes hard to look up, far enough to have design
discussions.
I totally agree that having a load/exec split is the right
approach now that I can imagine an implementation where the code will
actually work for the panic case. Before it felt like lying. Doing
the split-up, promising that kexec on panic will work eventually,
when I could not even see it as a possibility was at the core of my
objections.
What brought me around is that I can add a flag field to kexec_load.
With that flag field I can tell the kernel please step extra carefully
this code will be used to handle kexec on panic. Without that I may
be up a creek without a paddle for figuring out how to debug that code.
To be able to support this at all I have had to be very creative in
inventing debugging code. Which is why I have the serial console
program kexec_test. It provides visibility into what is happening
when nothing else will. That and memtest86 which will occasionally
catch DMA's that have not been stopped, (memory errors on good ram) I
at least have a place to start rather than a blank screen when
guessing why the new kernel did not start up.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:05 ` Eric W. Biederman
2002-11-09 23:33 ` Linus Torvalds
@ 2002-11-09 23:39 ` Randy.Dunlap
2002-11-10 2:58 ` Eric W. Biederman
2002-11-10 1:31 ` Werner Almesberger
2002-11-10 2:08 ` Alan Cox
3 siblings, 1 reply; 72+ messages in thread
From: Randy.Dunlap @ 2002-11-09 23:39 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel
{warning: cc: list too large :}
On 9 Nov 2002, Eric W. Biederman wrote:
| There are two cases I am seeing users wanting.
| 1) Load a new kernel on panic.
| - Extra care must be taken so what broke the first kernel does
| not break this one, and so that the shards of the old kernel
| do not break it.
| - Care must be taken so that loading the second kernel does not
| erase valuable data that is desirable to place in a crash dump.
| - This kernel cannot live at the same address as the old one, (at
| least not initially).
Conceptually we would like a new kernel on panic, although
I doubt that it's normally safe to "load a new kernel on panic."
Or maybe it depends on the definition of "load."
What I'm trying to say is that I think the new kernel must
already be loaded when the panic happens.
Is that what you describe later (below)?
| 2) Load a new kernel under normal operating conditions.
| And when you have a normal user space that boils down to:
| - Acquire the kernel you are going to boot.
| - Run the user space shutdown scripts, so the system is in
| a clean state.
| - Execute the new kernel.
| - The normal case is that the newly loaded kernel will live at the
| same physical location where the current kernel lives.
|
|
| Currently my code handles starting a new kernel under normal operating
| conditions. There are currently two ways I can implement a clean user
| space shutdown with out needing locked buffers in the kernel until the
| very last moment.
|
| Method 1 (This works today with my sample user space):
| - copy the kernel to /newkernel
| - init 6
| - if [ -r /newkernel ]; then
| /sbin/kexec /newkernel
| else
| /sbin/reboot
| fi
| - /sbin/kexec reads in /newkernel
| - /newkernel is parsed to figure out how it should be loaded
| - sys_kexec is called to copy the kernel from user space anonymous
| memory to temporary kernel buffers.
|
| Method 2 (For people with read only roots):
| - /sbin/delayed_kexec /path/to/new/kernel
| - Read in the /path/to/new/kernel into anonymous pages
| - Parse it and figure out how it should be loaded
| - Run the shutdown scripts from /etc/rc6.d/
| - Call sys_kexec, which will copy the data from user space anonymous
| pages, to kernel space.
|
| This is to just make it clear that I am not working from a
| FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
| maintenance. I am quite willing to make changes assuming I understand
| what is gained with the change.
|
|
| What I currently support is a moderately nice interface, that has the
| kernel doing as much as it can without being bogged down in the
| specific details in any one file format, or needing something besides
| a 32bit entry point to jump to.
|
| I model an image as a set of segments of physical memory. And I copy
| the image loaded with sys_kexec to it's final location, before jumping
| to the new image. There are two reasons for this. It takes 3
| segments to load a bzImage (setup.S, vmlinux, and an initrd). And an
| arbitrary number of segments maps cleanly to a static ELF binary.
|
| There is only one difficult case. What happens when the buffers the
| kernel allocates are physically in one of the segments of memory of
| the new kernel image. Possible especially for the initrd which is
| loaded at the end of memory.
|
| I then use the following algorithm to sort the potential mess out
| before I jump to the new code. And since this code depends on
| swapping the contents of pages, knowing the physical location of
| the pages, and is not limited to 128MB I am reluctant to look a
| vmalloc variant. I can more get my pages from a slab if I need to
| report I have them.
|
[code deleted]
|
| Having had time to digest the idea of starting a new kernel on panic
| I can now make some observations and what I believe it would take to
| make it as robust as possible.
|
| - On panic because random pieces of the kernel may be broken we want
| to use as little of the kernel as possible.
|
| - Therefore machine_kexec should not allocate any memory, and as
| quickly as possible it should transition to the new kernel
|
| - So a new page table should be sitting around with the new kernel
| already mapped, and likewise other important tables like the
| gdt, and the idt, should be pre-allocated.
|
| - Then machine_kexec can just switch stacks, page tables, and other
| machine dependent tables and jump to the new kernel.
|
| - The load stage needs to load everything at the physical location it
| will initially run at. This would likely need support from rmap.
|
| - The load stage needs to preallocate page tables and buffers.
|
| - The load stage would likely work easiest by either requiring a mem=xxx
| line, reserving some of physical memory for the new kernel. Or
| alternatively using some rmap support to clear out a swath of
| physical memory the new kernel can be loaded into.
|
| - The new kernel loaded on panic must know about the previous kernel,
| and have various restrictions because of that.
|
|
| Supporting a kernel loaded from a normal environment is a rather
| different problem.
|
| - It cannot be loaded at it's run location (The current kernel is
| sitting there).
|
| - It should not need to know about the previously executing kernel.
|
| - Work can be done in machine_kexec to allocate memory so everything
| does not need to be pre allocated.
|
| - I can safely use multiple calls to the page allocator instead of
| needing a special mechanism to allocate memory.
|
|
| And now I go back to the silly exercise of factoring my code so the
| new kernel can be kept in locked kernel memory, instead of in a file
| while the shutdown scripts are run.
|
| Unless the linux kernel is modified to copy itself to the top of
| physical memory when it loads I have trouble seeing how any of this
| will help make the panic case easier to implement.
|
| Eric
| -
--
~Randy
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:39 ` Randy.Dunlap
@ 2002-11-10 2:58 ` Eric W. Biederman
2002-11-10 14:35 ` Alan Cox
0 siblings, 1 reply; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 2:58 UTC (permalink / raw)
To: Randy.Dunlap
Cc: Eric W. Biederman, Linus Torvalds, Alan Cox, Werner Almesberger,
Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel
"Randy.Dunlap" <rddunlap@osdl.org> writes:
> {warning: cc: list too large :}
>
> On 9 Nov 2002, Eric W. Biederman wrote:
>
> | There are two cases I am seeing users wanting.
> | 1) Load a new kernel on panic.
> | - Extra care must be taken so what broke the first kernel does
> | not break this one, and so that the shards of the old kernel
> | do not break it.
> | - Care must be taken so that loading the second kernel does not
> | erase valuable data that is desirable to place in a crash dump.
> | - This kernel cannot live at the same address as the old one, (at
> | least not initially).
>
> Conceptually we would like a new kernel on panic, although
> I doubt that it's normally safe to "load a new kernel on panic."
> Or maybe it depends on the definition of "load."
>
> What I'm trying to say is that I think the new kernel must
> already be loaded when the panic happens.
> Is that what you describe later (below)?
Yes that was my meaning. The new kernel must be preloaded.
And only started on panic.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:58 ` Eric W. Biederman
@ 2002-11-10 14:35 ` Alan Cox
2002-11-10 18:13 ` Eric W. Biederman
0 siblings, 1 reply; 72+ messages in thread
From: Alan Cox @ 2002-11-10 14:35 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Randy.Dunlap, Linus Torvalds, Werner Almesberger,
Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel
On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > What I'm trying to say is that I think the new kernel must
> > already be loaded when the panic happens.
> > Is that what you describe later (below)?
>
> Yes that was my meaning. The new kernel must be preloaded.
> And only started on panic.
Another question from the point of view of unifying things. What is
wrong with
insmod kexec
creates /dev/kexec (or kexecfs is you are Al Viro)
hooks the reboot and panic final notifiers
user copies file to /dev/kexec (which stuffs it into ram)
reboot
kexec module handler jumps to the first page of the
kexec data in a defined state assuming its PIC
At which point we have clearly reduced kexec/oops reporter/lkcd/netdump
to a single common tiny interface.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 14:35 ` Alan Cox
@ 2002-11-10 18:13 ` Eric W. Biederman
0 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 18:13 UTC (permalink / raw)
To: Alan Cox
Cc: Randy.Dunlap, Linus Torvalds, Werner Almesberger,
Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > > What I'm trying to say is that I think the new kernel must
> > > already be loaded when the panic happens.
> > > Is that what you describe later (below)?
> >
> > Yes that was my meaning. The new kernel must be preloaded.
> > And only started on panic.
>
> Another question from the point of view of unifying things. What is
> wrong with
>
> insmod kexec
> creates /dev/kexec (or kexecfs is you are Al Viro)
> hooks the reboot and panic final notifiers
> user copies file to /dev/kexec (which stuffs it into ram)
>
> reboot
> kexec module handler jumps to the first page of the
> kexec data in a defined state assuming its PIC
>
>
> At which point we have clearly reduced kexec/oops reporter/lkcd/netdump
> to a single common tiny interface.
It would take a special hook that ran after the notifiers, and
device_shutdown. At least in the normal case running what shutdown
code we can is fairly important. And hooking the notifier lists
would not give a guarantee of going last.
There is a long ways to go in working with device drivers to even get
the easy kexec case working stably, in non-special circumstances.
The kernel gets there great but it does not cope well with the APICs
activated and the legacy pic disabled during bootup.
The additional device shutdown code is useful even in the normal
reboot path. Most BIOS's don't care but it should fix a few problems
with BIOS that are not as paranoid about the state of the system as
they should be when reboot is called. Little things like always
shutting down on the bootstrap cpu are on my todo list.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:05 ` Eric W. Biederman
2002-11-09 23:33 ` Linus Torvalds
2002-11-09 23:39 ` Randy.Dunlap
@ 2002-11-10 1:31 ` Werner Almesberger
2002-11-10 3:10 ` Eric W. Biederman
2002-11-10 2:08 ` Alan Cox
3 siblings, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-11-10 1:31 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
Eric W. Biederman wrote:
> - Extra care must be taken so what broke the first kernel does
> not break this one, and so that the shards of the old kernel
> do not break it.
For this, you should checksum the data that you've pre-loaded, and
verify it before rebooting. If the pre-loaded kernel has been hit,
you just do a normal reboot. (In the case if a bzImage, you'd
probably fail uncompression anyway.)
Alternatively, you could also wire this into the uncompression
functions (i.e. reboot if bzImage or initrd don't uncompress
cleanly), but this would be more intrusive.
> - Care must be taken so that loading the second kernel does not
> erase valuable data that is desirable to place in a crash dump.
Or copy all "interesting" memory to a safe place before the kexec.
I don't quite like the idea of building a kernel that "knows" which
addresses it isn't supposed to touch, and I think being able to use
the same kernel binary for regular and panic use would be a
desirable feature.
Also, firmware may not give you the choice of preserving all memory,
so you need that "copy memory to a safe place" functionality anyway.
Furthermore, you most likely want to checksum that memory, too.
But ... I think you're designing too far ahead. The "load kernel on
panic" part isn't trivial, and I think it would be better to tackle
this in a second phase. For now, having a reasonably generic kexec
mechanism would be all that's needed in term of building blocks.
> Method 2 (For people with read only roots):
> - /sbin/delayed_kexec /path/to/new/kernel
> - Read in the /path/to/new/kernel into anonymous pages
There's no delayed_kexec in kexec-tools 1.4, so let me gues how
this would work: as far as I know, there's no way for regular
user space to create a persistent unreferenced memory object, so
you'd probably load the data, perhaps mlock the pages, and then
fork a process that keeps the data in memory. Then, this process
would probably call sys_kexec upon reception of a signal, or
such.
Unfortunately, init assumes that it can SIGKILL all non-init
processes (that is, all processes with PID != 1). Worse yet, this
assumption makes sense, because walking the process list and
killing each of them individually would be racy.
So you'd either have to add this race condition to init, add some
magic to make this type of killing atomic, teach the kernel that
your kexec memory keeper process is somehow magic too, or merge
kexec into init. Not nice.
> I then use the following algorithm to sort the potential mess out
> before I jump to the new code.
I like this approach. It gives you complete freedom of where to
load data. This also makes it future-proof. But I don't see the
reason why you couldn't do the same thing with vmalloc. Using
vmalloc may actually simplify your code a little.
> Having had time to digest the idea of starting a new kernel on panic
> I can now make some observations and what I believe it would take to
> make it as robust as possible.
That pretty much sums it up, yes. But as I've said, this isn't
really something that needs to be implemented at the same time
as the basic kexec functionality. A two-phase kexec with
unrestricted copying capabilities should be a good enough
building block that only minor changes, if any, would be needed
when adding kexec-on-panic.
> And now I go back to the silly exercise of factoring my code so the
> new kernel can be kept in locked kernel memory, instead of in a file
> while the shutdown scripts are run.
Not silly :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 1:31 ` Werner Almesberger
@ 2002-11-10 3:10 ` Eric W. Biederman
2002-11-10 3:30 ` Werner Almesberger
2002-11-10 3:49 ` Linus Torvalds
0 siblings, 2 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 3:10 UTC (permalink / raw)
To: Werner Almesberger
Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
Werner Almesberger <wa@almesberger.net> writes:
>
> But ... I think you're designing too far ahead. The "load kernel on
> panic" part isn't trivial, and I think it would be better to tackle
> this in a second phase. For now, having a reasonably generic kexec
> mechanism would be all that's needed in term of building blocks.
I'm not designing yet, just looking and what I see says that it
does not very much resemble the non panic case.
> > Method 2 (For people with read only roots):
> > - /sbin/delayed_kexec /path/to/new/kernel
> > - Read in the /path/to/new/kernel into anonymous pages
>
> There's no delayed_kexec in kexec-tools 1.4, so let me gues how
> this would work: as far as I know, there's no way for regular
> user space to create a persistent unreferenced memory object, so
> you'd probably load the data, perhaps mlock the pages, and then
> fork a process that keeps the data in memory. Then, this process
> would probably call sys_kexec upon reception of a signal, or
> such.
What I was thinking is that the process would for and exec
something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
And that script would do all of the user space shutdown.
No need to mlock any pages, or hack init, or special hacks.
Just user space cleanly shutting itself down.
>
> > I then use the following algorithm to sort the potential mess out
> > before I jump to the new code.
>
> I like this approach. It gives you complete freedom of where to
> load data. This also makes it future-proof. But I don't see the
> reason why you couldn't do the same thing with vmalloc. Using
> vmalloc may actually simplify your code a little.
Mostly it's a bird in the hand versus a bird in the bush. I simply
see nowhere that vmalloc makes my code simpler.
> > Having had time to digest the idea of starting a new kernel on panic
> > I can now make some observations and what I believe it would take to
> > make it as robust as possible.
>
> That pretty much sums it up, yes. But as I've said, this isn't
> really something that needs to be implemented at the same time
> as the basic kexec functionality. A two-phase kexec with
> unrestricted copying capabilities should be a good enough
> building block that only minor changes, if any, would be needed
> when adding kexec-on-panic.
My feel is that kexec-on-panic is a rather different problem. Which
is why I thought it all through, to see if they felt close. At the
very least you almost need to know that it is the same.
>
> > And now I go back to the silly exercise of factoring my code so the
> > new kernel can be kept in locked kernel memory, instead of in a file
> > while the shutdown scripts are run.
>
> Not silly :-)
Except for the part about getting Linus to accept it I don't see
the advantage. kexec-on-panic looks different enough that I don't
think it will help at all with that case.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:10 ` Eric W. Biederman
@ 2002-11-10 3:30 ` Werner Almesberger
2002-11-10 3:49 ` Eric W. Biederman
2002-11-10 3:49 ` Linus Torvalds
1 sibling, 1 reply; 72+ messages in thread
From: Werner Almesberger @ 2002-11-10 3:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
Eric W. Biederman wrote:
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.
Yes, but init also does a kill(-1,...) to get rid of all processes,
before the last steps of system shutdown. So you have to somehow
make your "page holding" process survive beyond this point.
> My feel is that kexec-on-panic is a rather different problem.
You make it a different problem by assuming that you'd have a
kernel that is specifically built for running at a "safe"
location. If you assume that you're just using your normal
kernel, the two problems converge again. There are still a
few things that are different, like the checksumming, but
they can safely be added at a later time.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:30 ` Werner Almesberger
@ 2002-11-10 3:49 ` Eric W. Biederman
0 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 3:49 UTC (permalink / raw)
To: Werner Almesberger
Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
Werner Almesberger <wa@almesberger.net> writes:
> Eric W. Biederman wrote:
> > What I was thinking is that the process would for and exec
> > something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> > And that script would do all of the user space shutdown.
>
> Yes, but init also does a kill(-1,...) to get rid of all processes,
> before the last steps of system shutdown. So you have to somehow
> make your "page holding" process survive beyond this point.
True. But it is just as easy to drop the file into something like
ramfs. Or a file on the read only file on the root filesystem. Now
that we can having shutdown do a pivot_root and totally unmounting
the root filesystem is probably a good idea.
> > My feel is that kexec-on-panic is a rather different problem.
>
> You make it a different problem by assuming that you'd have a
> kernel that is specifically built for running at a "safe"
> location.
Well at least the part cleans up after the running kernel. That is
what I think it takes to make it stable. Perhaps I am wrong, but
I think getting other architecture stable is very hard.
> If you assume that you're just using your normal
> kernel, the two problems converge again. There are still a
> few things that are different, like the checksumming, but
> they can safely be added at a later time.
I guess I can be proven wrong.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 3:10 ` Eric W. Biederman
2002-11-10 3:30 ` Werner Almesberger
@ 2002-11-10 3:49 ` Linus Torvalds
1 sibling, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2002-11-10 3:49 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Werner Almesberger, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
On 9 Nov 2002, Eric W. Biederman wrote:
>
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.
>
> No need to mlock any pages, or hack init, or special hacks.
> Just user space cleanly shutting itself down.
Ehh.. You do realize that the above doesn't actually _work_?
First off, "all the user space shutdown" includes things like turning off
networking. Oh, and if you're on a NFS-root system, your process is now
officially _toast_.
Unless you do the "mlockall()" etc that you seem to think that you don't
need, that is.
In other words: oh, yes, you do need those mlock() calls.
And never mind the fact that everybody has a slightly different "init"
setup, so executing "/etc/rc 6" may not actually _do_ anything. So now you
need to learn about all the different initscripts, and get that right.
And btw, thanks to the mlockall() requirements, you actually end up
pinning more memory than the two-phase approach ever would have done while
you do all this.
You then need to do the pre-loading anyway for the "kexec-on-panic" case
that you think is so different (I don't understand why you think a reboot
is different from another reboot, but whatever). So now you not only
maintain something that knows about many different init scripts and uses
more memory, it also ends up doing the same reboot thing at least two
different ways.
-- meanwhile, in another universe --
With the two-way separation, you don't have any of these problems. Your
maintenance nightmare has now become _one_ added script:
/etc/rc.d/rc6.d/K00loadkernel
and people using other init script variants can trivially add a script to
match their setup. Done. No maintenance headache, no special init
binaries, no torn-out-hair.
And by adding a startup script, you can have a _different_ small "debug
dump" kernel loaded early, so that if the machine reboots without going
through the controlled sequence, it will automatically boot into that
debug kernel..
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 23:05 ` Eric W. Biederman
` (2 preceding siblings ...)
2002-11-10 1:31 ` Werner Almesberger
@ 2002-11-10 2:08 ` Alan Cox
2002-11-10 2:18 ` Eric W. Biederman
3 siblings, 1 reply; 72+ messages in thread
From: Alan Cox @ 2002-11-10 2:08 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> There are two cases I am seeing users wanting.
> 1) Load a new kernel on panic.
Load a new *something* on panic. That something might be a new kernel
but it might also be a kernel dump system like LKCD or a debugger front
end for something like kdb, or a network dump module, or ...
Alan
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:08 ` Alan Cox
@ 2002-11-10 2:18 ` Eric W. Biederman
2002-11-10 14:31 ` Alan Cox
0 siblings, 1 reply; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-10 2:18 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
> On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> > There are two cases I am seeing users wanting.
> > 1) Load a new kernel on panic.
>
> Load a new *something* on panic. That something might be a new kernel
> but it might also be a kernel dump system like LKCD or a debugger front
> end for something like kdb, or a network dump module, or ...
And if it isn't a kernel why not load it as a module? The code
has to come preloaded anyway.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-10 2:18 ` Eric W. Biederman
@ 2002-11-10 14:31 ` Alan Cox
0 siblings, 0 replies; 72+ messages in thread
From: Alan Cox @ 2002-11-10 14:31 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
lkcd-general, lkcd-devel
On Sun, 2002-11-10 at 02:18, Eric W. Biederman wrote:
> > Load a new *something* on panic. That something might be a new kernel
> > but it might also be a kernel dump system like LKCD or a debugger front
> > end for something like kdb, or a network dump module, or ...
>
> And if it isn't a kernel why not load it as a module? The code
> has to come preloaded anyway.
You may want to load it as a module or via syscall request. Doesn't
matter which really. But you do want all the intelligence in the loaded
code not in the reboot stub of the dying code.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-07 8:50 ` Eric W. Biederman
2002-11-07 15:44 ` Linus Torvalds
@ 2002-11-07 15:48 ` Linus Torvalds
2002-11-08 18:01 ` Alan Cox
2 siblings, 0 replies; 72+ messages in thread
From: Linus Torvalds @ 2002-11-07 15:48 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
On 7 Nov 2002, Eric W. Biederman wrote:
>
> In staging the image we allocate a whole pile of pages, and keep them
> locked in place. Waiting for years potentially until the machine
> reboots or panics. This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
So how about facing the fact that my "vmalloc()" approach actually solves
all these issues. The memory is visible to the rest of the system (few
things care about it right now, but it _is_ accounted for and things like
/dev/kmem will actually see it etc).
And the vmalloc() approach is even portable, so one of the two phases is
something that is totally generic (and the second phase is almost totally
architecture-dependent anyway).
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-07 8:50 ` Eric W. Biederman
2002-11-07 15:44 ` Linus Torvalds
2002-11-07 15:48 ` Linus Torvalds
@ 2002-11-08 18:01 ` Alan Cox
2 siblings, 0 replies; 72+ messages in thread
From: Alan Cox @ 2002-11-08 18:01 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
Linux Kernel Mailing List, lkcd-general, lkcd-devel
On Thu, 2002-11-07 at 08:50, Eric W. Biederman wrote:
> panic does not call sys_reboot it rolls that functionality by hand.
> And to a certain extent it makes sense for panic to take a different
> path because we know something is badly wrong so we need to be extra
> careful.
However both of them should use the same end point routines and the
hooks should go there
> reboots or panics. This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
> Additionally in locking up megabytes for a long period of time we
> create unsolvable fragmentation issues for the mm layer to deal with.
We have an MMU so if you just n thousand "get me a page" calls its quite
happy.
> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.
> So factoring sys_kexec in to load and execute pieces only helps for
> executing the new image on a kernel panic, and that case does not
> actually work.
What if your user space is swapped out - you can't page it back in
safely
> - How should the pages allocated to an early loaded image be accounted
> for?
Just get_free_page them - if you can handle it over 4Gb then specify
that high pages are fine and kmap them to copy into them - that makes
the VM on giant boxes way happier. For bonus points also adjust the
virtual memory accounting.
> - How do we avoid making life hard for the mm system with an early
> loaded image?
Not really, especially if you are allowing high pages
> - Is it safe to call sys_reboot from panic?
No but both can call sys_machine_restart or whatever
> - Can we simply factor out the sequence:
> notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> system_running = 0;
> device_shutdown();
> And place it into it's own subroutine?
Don't do that sequence on a panic IMHO (this is a standing issue, we
should not pass NULL but REBOOT/PANIC/KEXEC/... so the drivers can make
that decision - then we can do it right
Alan
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-05 18:00 ` Werner Almesberger
2002-11-05 18:36 ` Alan Cox
@ 2002-11-09 21:21 ` Pavel Machek
2002-11-11 16:27 ` Eric W. Biederman
1 sibling, 1 reply; 72+ messages in thread
From: Pavel Machek @ 2002-11-09 21:21 UTC (permalink / raw)
To: Werner Almesberger
Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general,
lkcd-devel
Hi!
> > Yes, we are putting [MCORE] in as one of the alternative dump targets
> > available.
>
> Great !
>
> > Its not quite ready yet and we need something like kexec to be
> > available which we can use on Intel systems to achieve the softboot
> > (the acceptance status of that still doesn't seem to be clear),
>
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.
>
> > Why do we even consider the other options when we are doing
> > this already ? Well, as we discussed earlier there's non-disruptive
> > dumps for one, where this wouldn't work.
>
> But they're very different anyway, aren't they ? I mean, you could
> even implement them (well, almost) from user space, with today's
> kernels.
>
> > The other is that before overwriting
> > memory we need to be able to stop all activity in the system for certain
> > (system may appear hung/locked up) and I'm not fully certain about
> > how to do this for all environments. Maybe an answer lies in
> > rethinking some parts of the algorithm a bit.
>
> This is certainly the hairiest part, yes. I think we have about
> four types of devices/elements to worry about:
>
> - those that just sit there, and never talk unless spoken to
> - those that may generate interrupts
> - those that DMA if you ask them nicely
> - those that DMA when they feel like it (e.g. copy an incoming
> network packet to the next buffer in the free list)
>
> The latter are the real problem. I see the following possibilities
> for dealing with them:
>
> - faith-based computing: pray that nothing bad will befall your
> system :-)
> - de-activate them individually. There should be a lot of work
> that can be shared with power management. And that's one of
> the reasons why I think the memory target should be available
> early, or convergence will take forever.
I have very similar problem in swsusp (need to deactivate DMA
devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
for that.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-11-09 21:21 ` Pavel Machek
@ 2002-11-11 16:27 ` Eric W. Biederman
0 siblings, 0 replies; 72+ messages in thread
From: Eric W. Biederman @ 2002-11-11 16:27 UTC (permalink / raw)
To: Pavel Machek
Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
lkcd-general, lkcd-devel
Pavel Machek <pavel@ucw.cz> writes:
> I have very similar problem in swsusp (need to deactivate DMA
> devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
> for that.
Yes. The problem and the solutions are very similar. Because you are
restoring the kernel code I don't think we can use the same functions,
but similar work needs to be done. The correct hook for reboots,
halts, kexec, and other cases where the kernel is going away is
device_shutdown which currently calls device->shutdown(). Since the
implementation has changed recently to avoid other problems no one
actually implements the shutdown method at the moment. Once that
happens we can probably kill the reboot notifiers. But there is a lot
of driver work to do on that score.
Eric
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: What's left over.
@ 2002-10-31 15:46 Linus Torvalds
2002-10-31 19:33 ` [lkcd-devel] " Castor Fu
0 siblings, 1 reply; 72+ messages in thread
From: Linus Torvalds @ 2002-10-31 15:46 UTC (permalink / raw)
To: Matt D. Robinson; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> Linus Torvalds wrote:
> > > Crash Dumping (LKCD)
> >
> > This is definitely a vendor-driven thing. I don't believe it has any
> > relevance unless vendors actively support it.
>
> There are people within IBM in Germany, India and England, as well as
> a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> that are PAID to support this.
That's fine. And since they are paid to support it, they can apply the
patches.
What I'm saying by "vendor driven" is that it has no relevance for the
standard kernel, and since it has no relevance to that, then I have no
incentives to merge it. The crash dump is only useful with people who
actively look at the dumps, and I don't know _anybody_ outside of the
specialized vendors you mention who actually do that.
I will merge it when there are real users who want it - usually as a
result of having gotten used to it through a vendor who supports it. (And
by "support" I do not mean "maintain the patches", but "actively uses it"
to work out the users problems or whatever).
Horse before the cart and all that thing.
People have to realize that my kernel is not for random new features. The
stuff I consider important are things that people use on their own, or
stuff that is the base for other work. Quite often I want vendors to merge
patches _they_ care about long long before I will merge them (examples of
this are quite common, things like reiserfs and ext3 etc).
THAT is what I mean by vendor-driven. If vendors decide they really want
the patches, and I actually start seeing noises on linux-kernel or getting
requests for it being merged from _users_ rather than developers, then
that means that the vendor is on to something.
Linus
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [lkcd-devel] Re: What's left over.
2002-10-31 15:46 Linus Torvalds
@ 2002-10-31 19:33 ` Castor Fu
0 siblings, 0 replies; 72+ messages in thread
From: Castor Fu @ 2002-10-31 19:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel
On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
Add 3PAR and probably a number of other small companies given the traffic
on the lists. Anyone building a new product on Linux and mucking
around inside the kernel, and having more than a handful of developers
is going to want LKCD, or Mission Critical's mcore, or netdump, or
something like it.
It's a shame that right out of the gate they'll have to spend time
figuring out which of these solutions work for them. I spent at least
a month of my life just looking at what's out there, and trying to make
each of them work with our product. It'd be nice if that time were
spent on making new "cool stuff".
Since then, we've put significant amounts of work into making LKCD
reliable on our system, and it's been incredibly useful in our
development. It's going to prove invaluable supporting our stuff in
the field.
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
>
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
If you asked me if 3PAR is a "vendor" or a "user" I'd have to say "yes".
As a vendor we sell our system to customers. They could not care less
that LKCD is in the linux kernel distribution. All they care about is
that we fix their problems as fast as possible. They probably have
no idea that this is the underlying technology, so you will never
hear from them about us.
However, we also use linux for desktops, build servers, database servers, etc.
When we have problems with these systems, we'd LOVE to be able to use the
same expertise and technology which we've developed for our system, but
all too often we find that someone just grabbed a Redhat 7.x disk or
standard debian distro to build the system.
So as a "user" I'm asking the distribution vendors, please make it easy
for me to use the same damn tools everywhere by providing some sort
of common crash dump mechanism. It'll make it easier for me to consider new
hardware, new software, etc. One thing that's awesome is Dave Anderson's
"crash" tool. It works with LKCD dumps, netdump dumps, etc. It's an example
of a tool which has leveraged all the different dump communities.
As a "vendor" please put LKCD or something like it into the main line
kernel. LKCD works. It has an active developer community which has
been extending it to work over networks, onto disks, developing new
analysis tools, etc. If we can settle on one such tool, we'll get
more cool stuff like lock analyzers, etc. Until then, we WILL keep
re-inventing the wheel because this is one of the first steps to
collect significant amounts of real data.
-castor
^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2002-11-11 17:59 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-31 20:22 [lkcd-devel] Re: What's left over Andreas Herrmann
2002-10-31 20:40 ` Linus Torvalds
2002-10-31 20:54 ` Patrick Finnegan
2002-10-31 21:08 ` Benjamin LaHaise
2002-10-31 22:04 ` Bernhard Kaindl
2002-11-01 0:33 ` Werner Almesberger
-- strict thread matches above, loose matches on Subject: below --
2002-11-02 10:36 Brad Hards
2002-11-02 19:28 ` [lkcd-devel] " Matt D. Robinson
2002-11-01 19:18 Linus Torvalds
2002-11-01 20:22 ` [lkcd-devel] " Matt D. Robinson
2002-11-02 13:02 ` Kai Henningsen
2002-11-01 6:36 Linus Torvalds
2002-11-01 7:00 ` [lkcd-devel] " Castor Fu
2002-10-31 22:47 Richard J Moore
2002-10-31 23:39 ` Werner Almesberger
2002-11-05 12:45 ` Suparna Bhattacharya
2002-10-31 21:33 Rusty Russell
2002-11-01 1:19 ` [lkcd-devel] " Matt D. Robinson
2002-11-01 2:59 ` Rusty Russell
2002-10-31 20:59 Dave Anderson
2002-11-01 1:25 ` [lkcd-devel] " Matt D. Robinson
2002-10-31 18:17 Deepak Kumar Gupta, Noida
2002-10-31 17:25 Linus Torvalds
2002-10-31 21:02 ` Jeff Garzik
2002-10-31 22:37 ` Werner Almesberger
2002-11-05 11:42 ` [lkcd-devel] " Suparna Bhattacharya
2002-11-05 18:00 ` Werner Almesberger
2002-11-05 18:36 ` Alan Cox
2002-11-05 19:19 ` Werner Almesberger
2002-11-05 20:10 ` Alan Cox
2002-11-05 23:25 ` Werner Almesberger
2002-11-06 0:21 ` Andy Pfiffer
2002-11-06 1:10 ` Werner Almesberger
2002-11-06 1:37 ` Alexander Viro
2002-11-06 2:05 ` Werner Almesberger
2002-11-07 6:04 ` Eric W. Biederman
2002-11-07 12:17 ` Werner Almesberger
2002-11-06 4:07 ` Eric W. Biederman
2002-11-06 4:47 ` Eric W. Biederman
2002-11-06 19:24 ` Rob Landley
2002-11-10 18:35 ` Pavel Machek
2002-11-06 2:48 ` Eric W. Biederman
2002-11-06 4:29 ` Eric W. Biederman
2002-11-06 6:25 ` Linus Torvalds
2002-11-06 6:38 ` Suparna Bhattacharya
2002-11-06 7:48 ` Eric W. Biederman
2002-11-06 9:11 ` Suparna Bhattacharya
2002-11-06 22:05 ` Michal Jaegermann
2002-11-06 16:13 ` Eric W. Biederman
2002-11-07 8:50 ` Eric W. Biederman
2002-11-07 15:44 ` Linus Torvalds
2002-11-09 23:05 ` Eric W. Biederman
2002-11-09 23:33 ` Linus Torvalds
2002-11-10 1:37 ` Eric W. Biederman
2002-11-10 2:12 ` Alan Cox
2002-11-10 2:16 ` Eric W. Biederman
2002-11-10 3:03 ` Werner Almesberger
2002-11-10 3:23 ` Eric W. Biederman
2002-11-10 14:30 ` Alan Cox
2002-11-10 16:56 ` Eric W. Biederman
2002-11-10 3:17 ` Linus Torvalds
2002-11-10 4:26 ` Eric W. Biederman
2002-11-11 18:03 ` Eric W. Biederman
2002-11-09 23:39 ` Randy.Dunlap
2002-11-10 2:58 ` Eric W. Biederman
2002-11-10 14:35 ` Alan Cox
2002-11-10 18:13 ` Eric W. Biederman
2002-11-10 1:31 ` Werner Almesberger
2002-11-10 3:10 ` Eric W. Biederman
2002-11-10 3:30 ` Werner Almesberger
2002-11-10 3:49 ` Eric W. Biederman
2002-11-10 3:49 ` Linus Torvalds
2002-11-10 2:08 ` Alan Cox
2002-11-10 2:18 ` Eric W. Biederman
2002-11-10 14:31 ` Alan Cox
2002-11-07 15:48 ` Linus Torvalds
2002-11-08 18:01 ` Alan Cox
2002-11-09 21:21 ` Pavel Machek
2002-11-11 16:27 ` Eric W. Biederman
2002-10-31 15:46 Linus Torvalds
2002-10-31 19:33 ` [lkcd-devel] " Castor Fu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).