linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* What's left over.
@ 2002-10-31  2:07 Rusty Russell
  2002-10-31  2:31 ` Linus Torvalds
  2002-11-05 17:29 ` kexec (was: Re: What's left over.) Werner Almesberger
  0 siblings, 2 replies; 333+ messages in thread
From: Rusty Russell @ 2002-10-31  2:07 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

Hi Linus,

	Here is the list of features which have are being actively
pushed, not NAK'ed, and are not in 2.5.45.  There are 13 of them, as
appropriate for Halloween.

	Most were submitted repeatedly *well* before the freeze.  It'd
be nice for you to give feedback, and decide which ones (if any) are
still up for review.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

From: http://www.kernel.org/pub/linux/kernel/people/rusty/2.6-not-in-yet/

Rusty's Remarkably Unreliable List of Pending 2.6 Features
[aka. Rusty's Snowball List]

A: Author
M: lkml posting describing patch
D: Download URL
S: Size of patch, number of files altered (source/config), number of new files.
X: Impact summary (only parts of patch which alter existing source files, not config/make files)
T: Diffstat of whole patch
N: Random notes

In rough order of invasiveness (number of altered source files):

In-kernel Module Loader and Unified parameter support
A: Rusty Russell
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/
S: 841 kbytes, 302/36 files altered, 22 new
T: Diffstat
X: Summary patch (598k)
N: Requires new modutils

Fbdev Rewrite
A: James Simmons
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.3/1267.html
D: http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz
S: 4852 kbytes, 168/29 files altered, 124 new
T: Diffstat
X: Summary patch (182k)

Linux Trace Toolkit (LTT)
A: Karim Yaghmour
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0204.1/0832.html
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103491640202541&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103423004321305&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103247532007850&w=2
D: http://opersys.com/ftp/pub/LTT/ExtraPatches/patch-ltt-linux-2.5.44-vanilla-021026-2.2.bz2
S: 257 kbytes, 67/4 files altered, 9 new
T: Diffstat
X: Summary patch (90k)

statfs64
A: Peter Chubb
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103490436228016&w=2
D: http://marc.theaimsgroup.com/?l=linux-kernel&m=103490436228016&w=2
S: 42 kbytes, 53/0 files altered, 1 new
T: Diffstat
X: Summary patch (32k)

ext2/ext3 ACLs and Extended Attributes
A: Ted Ts'o
M: http://lists.insecure.org/lists/linux-kernel/2002/Oct/6787.html
B: bk://extfs.bkbits.net/extfs-2.5-update
D: http://thunk.org/tytso/linux/extfs-2.5/
S: 497 kbytes, 96/34 files altered, 34 new
T: Diffstat
X: Summary patch (167k)

ucLinux Patch (MMU-less support)
A: Greg Ungerer
M: http://lwn.net/Articles/11016/
D: http://www.uclinux.org/pub/uClinux/uClinux-2.5.x/linux-2.5.44uc3.patch.gz
S: 2218 kbytes, 25/34 files altered, 429 new
T: Diffstat
X: Summary patch (40k)

Crash Dumping (LKCD)
A: Matt Robinson, LKCD team
M: http://lists.insecure.org/lists/linux-kernel/2002/Oct/8552.html
D: http://lkcd.sourceforge.net/download/latest/
S: 18479 kbytes, 18/10 files altered, 10 new
T: Diffstat
X: Summary patch (18k)

POSIX Timer API
A: George Anzinger
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103553654329827&w=2
D: http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-posix-2.5.44-1.0.patch
S: 66 kbytes, 18/2 files altered, 4 new
T: Diffstat
X: Summary patch (21k)

Hotplug CPU Removal Support
A: Rusty Russell
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Hotcpu/hotcpu-cpudown.patch.gz
S: 32 kbytes, 16/0 files altered, 0 new
T: Diffstat
X: Summary patch (29k)

Hires Timers
A: George Anzinger
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103557676007653&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103557677207693&w=2
M: http://marc.theaimsgroup.com/?l=linux-kernel&m=103558349714128&w=2
D: http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-core-2.5.44-1.0.patch http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-i386-2.5.44-1.0.patch http://unc.dl.sourceforge.net/sourceforge/high-res-timers/hrtimers-hrposix-2.5.44-1.1.patch
S: 132 kbytes, 15/4 files altered, 10 new
T: Diffstat
X: Summary patch (44k)
N: Requires POSIX Timer API patch

EVMS
A: EVMS Team
M: http://www.uwsg.iu.edu/hypermail/linux/kernel/0208.0/0109.html
D: http://evms.sourceforge.net/patches/2.5.44/
S: 1101 kbytes, 7/10 files altered, 44 new
T: Diffstat
X: Summary patch (4k)

initramfs
A: Al Viro
M: http://www.cs.helsinki.fi/linux/linux-kernel/2001-30/0110.html
D: ftp://ftp.math.psu.edu/pub/viro/N0-initramfs-C21
S: 16 kbytes, 5/1 files altered, 2 new
T: Diffstat
X: Summary patch (5k)

Kernel Probes
A: Vamsi Krishna S
M: lists.insecure.org/linux-kernel/2002/Aug/1299.html
D: http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Misc/kprobes.patch.gz
S: 18 kbytes, 4/2 files altered, 4 new
T: Diffstat
X: Summary patch (5k)

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:07 What's left over Rusty Russell
@ 2002-10-31  2:31 ` Linus Torvalds
  2002-10-31  2:43   ` Alexander Viro
                     ` (17 more replies)
  2002-11-05 17:29 ` kexec (was: Re: What's left over.) Werner Almesberger
  1 sibling, 18 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31  2:31 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel


On Thu, 31 Oct 2002, Rusty Russell wrote:
> 
> 	Here is the list of features which have are being actively
> pushed, not NAK'ed, and are not in 2.5.45.  There are 13 of them, as
> appropriate for Halloween.

I'm unlikely to be able to merge everything by tomorrow, so I will 
consider tomorrow a submission deadline to me, rather than a merge 
deadline. That said, I merged everything I'm sure I want to merge today, 
and the rest I simply haven't had time to look at very much.

> In-kernel Module Loader and Unified parameter support

This apparently breaks things like DRI, which I'm fairly unhappy about,
since I think 3D is important.

> Fbdev Rewrite

This one is just huge, and I have little personal judgement on it.

> Linux Trace Toolkit (LTT)

I don't know what this buys us.

> statfs64

I haven't even seen it.

> ext2/ext3 ACLs and Extended Attributes

I don't know why people still want ACL's. There were noises about them for 
samba, but I'v enot heard anything since. Are vendors using this?

> ucLinux Patch (MMU-less support)

I've seen this, it looks pretty ok.

> Crash Dumping (LKCD)

This is definitely a vendor-driven thing. I don't believe it has any 
relevance unless vendors actively support it.

> POSIX Timer API

I think I'll do at least the API, but there were some questions about the 
config options here, I think.

> Hotplug CPU Removal Support

No objections, but very little visibility into it either.

> Hires Timers

This one is likely another "vendor push" thing.

> EVMS

Not for the feature freeze, there are some noises that imply that SuSE may 
push it in their kernels. 

> initramfs

I want this.

> Kernel Probes

Probably.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
@ 2002-10-31  2:43   ` Alexander Viro
  2002-10-31 16:36     ` Oliver Xymoron
  2002-10-31 22:57     ` Pavel Machek
  2002-10-31  3:00   ` Rusty Russell
                     ` (16 subsequent siblings)
  17 siblings, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-10-31  2:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel



On Wed, 30 Oct 2002, Linus Torvalds wrote:

> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I'v enot heard anything since. Are vendors using this?

Because People Are Stupid(tm).  Because it's cheaper to put "ACL support: yes"
in the feature list under "Security" than to make sure than userland can cope
with anything more complex than  "Me Og.  Og see directory.  Directory Og's.
Nobody change it".  C.f. snake oil, P.T.Barnum and esp. LSM users


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
  2002-10-31  2:43   ` Alexander Viro
@ 2002-10-31  3:00   ` Rusty Russell
  2002-10-31  3:19     ` tridge
                       ` (3 more replies)
  2002-10-31  3:06   ` Rik van Riel
                     ` (15 subsequent siblings)
  17 siblings, 4 replies; 333+ messages in thread
From: Rusty Russell @ 2002-10-31  3:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Geert Uytterhoeven, Russell King, Peter Chubb,
	tridge, tytso

In message <Pine.LNX.4.44.0210301823120.1396-100000@home.transmeta.com> you wri
te:
> 
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > 
> > 	Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45.  There are 13 of them, as
> > appropriate for Halloween.
> 
> I'm unlikely to be able to merge everything by tomorrow, so I will 
> consider tomorrow a submission deadline to me, rather than a merge 
> deadline. That said, I merged everything I'm sure I want to merge today, 
> and the rest I simply haven't had time to look at very much.
> 
> > In-kernel Module Loader and Unified parameter support
> 
> This apparently breaks things like DRI, which I'm fairly unhappy about,
> since I think 3D is important.

Yes, the patch stubs out inter_module_*, in favor of get_symbol() &
put_symbol().

This breaks the three users: one in drivers/mtd/ and two in
drivers/char/drm/.  I have a patch which fixes them (untested), or I
can simply put the inter_module_* code back in.

> > Fbdev Rewrite
> 
> This one is just huge, and I have little personal judgement on it.

It's been around for a while.  Geert, Russell?

> > Linux Trace Toolkit (LTT)
> 
> I don't know what this buys us.

Haven't looked at it.

> > statfs64
> 
> I haven't even seen it.

It's fairly old, but Peter Chubb said there was some vendor interest
for v. large devices.  Peter?

> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I'v enot heard anything since. Are vendors using this?

SAMBA needs them, which is why serious Samba boxes use XFS.  Tridge,
Ted?

> > Hotplug CPU Removal Support
> 
> No objections, but very little visibility into it either.

The controls are in driverfs etc, and that's always been in flux. 8(

The rest is v. small, basically extending ksoftirqd, workqueues and
migration threads to disable them.  Then it's all arch-specific.

> > Hires Timers
> 
> This one is likely another "vendor push" thing.
> 
> > EVMS
> 
> Not for the feature freeze, there are some noises that imply that SuSE may 
> push it in their kernels. 

They have, IIRC.  Interestingly, it was less invasive (existing source
touched) than the LVM2/DM patch you merged.

> > initramfs
> 
> I want this.

Good.  The big payoff is moving stuff out of the kernel, which can't
really be done in a stable series.

> > Kernel Probes
> 
> Probably.

Sent.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
  2002-10-31  2:43   ` Alexander Viro
  2002-10-31  3:00   ` Rusty Russell
@ 2002-10-31  3:06   ` Rik van Riel
  2002-10-31  3:19     ` Stephen Frost
                       ` (2 more replies)
  2002-10-31  3:14   ` Karim Yaghmour
                     ` (14 subsequent siblings)
  17 siblings, 3 replies; 333+ messages in thread
From: Rik van Riel @ 2002-10-31  3:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Wed, 30 Oct 2002, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Rusty Russell wrote:

> > ext2/ext3 ACLs and Extended Attributes
>
> I don't know why people still want ACL's. There were noises about them for
> samba, but I'v enot heard anything since. Are vendors using this?

Yes, people use it.  Not quite sure why though, I guess ACLs
buy some flexibility over the user/group/other model but if
the "unlimited groups" patch goes in (is in?) I'm happy ;)

Personally I do think either the unlimited groups patch or
ACLs are needed in order to sanely run a large anoncvs setup.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://distro.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2002-10-31  3:06   ` Rik van Riel
@ 2002-10-31  3:14   ` Karim Yaghmour
  2002-10-31 16:00     ` LTT for inclusion into 2.5 bob
  2002-10-31  3:21   ` What's left over Stephen Lord
                     ` (13 subsequent siblings)
  17 siblings, 1 reply; 333+ messages in thread
From: Karim Yaghmour @ 2002-10-31  3:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, LTT-Dev


Linus Torvalds wrote:
> > Linux Trace Toolkit (LTT)
> 
> I don't know what this buys us.

How about being able to:
- Debug synchronization problems among processes (there is no other
tool to do this, not gdb, not strace, not printf, ...)
- Measure exact time spent wainting for kernel and which other
processes a process had to wait for.
- Measure exact time it takes for an interrupt's effects to propagate
throughout the entire system.
- Understand the exact behavior the system has to input. (what is
the exact sequence of processes that run when I press a key).
- Identify sporadic problems in very saturated systems. (thousands
of servers and one of them is doing weird stuff).
- etc.

Providing system tracing is a necessity for any sort of complex
application development and system monitoring. Some people simply
can't use Linux without this sort of tool and I am at pains to
explain to them why they actually have to patch their kernel to
be able to debug their inter-process synchronization problems.

Users don't have to patch their kernel to use gdb and I don't
see why they should need to patch their kernel to understand how
their various processes interact with the kernel and vice-versa.

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:06   ` Rik van Riel
@ 2002-10-31  3:19     ` Stephen Frost
  2002-10-31 21:09       ` john stultz
  2002-10-31  6:22     ` Chris Wedgwood
  2002-10-31  9:44     ` Lech Szychowski
  2 siblings, 1 reply; 333+ messages in thread
From: Stephen Frost @ 2002-10-31  3:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

* Rik van Riel (riel@conectiva.com.br) wrote:
> On Wed, 30 Oct 2002, Linus Torvalds wrote:
> > On Thu, 31 Oct 2002, Rusty Russell wrote:
> 
> > > ext2/ext3 ACLs and Extended Attributes
> >
> > I don't know why people still want ACL's. There were noises about them for
> > samba, but I'v enot heard anything since. Are vendors using this?
> 
> Yes, people use it.  Not quite sure why though, I guess ACLs
> buy some flexibility over the user/group/other model but if
> the "unlimited groups" patch goes in (is in?) I'm happy ;)
> 
> Personally I do think either the unlimited groups patch or
> ACLs are needed in order to sanely run a large anoncvs setup.

The feeling I got on this was the ability to let users define their own
groups.  Perhaps I'm not following it closely enough but that was the
impression I got in terms of "what this does for us"; I'm probably
missing other things.  Just that ability would be nice in my view
though.  Isn't it something that's been in AFS for a long time too?
I've got a few friends who've played with AFS before (at CMU and the
like) and really enjoyed the ACLs there.

	Just my thoughts,

		Stephen

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:00   ` Rusty Russell
@ 2002-10-31  3:19     ` tridge
  2002-10-31  6:21       ` Chris Wedgwood
  2002-10-31  3:22     ` Christoph Hellwig
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 333+ messages in thread
From: tridge @ 2002-10-31  3:19 UTC (permalink / raw)
  To: torvalds; +Cc: rusty, linux-kernel, geert, rmk, peter, tytso

> > > ext2/ext3 ACLs and Extended Attributes
> > 
> > I don't know why people still want ACL's. There were noises about them for 
> > samba, but I'v enot heard anything since. Are vendors using this?
> 
> SAMBA needs them, which is why serious Samba boxes use XFS.  Tridge,
> Ted?

oh yes, all the Linux based storage appliances use ACLs. Posix ACLs
aren't ideal for Samba, but they are *much* better than having no ACLs
at all. The Posix ACL code has been in Samba for a long time (getting
close to 3 years now?). 

Eventually I'd like to see a combination of LSM with a new ACL system
give the ability to support full NT ACLs on Linux (which is also
needed for full nfsv4 support), but that is way too much to do for
the 2.6 kernel.

For the majority of windows users the mapping Samba does internally
between Posix ACLs and NT ACLs is sufficient for now. 

I think that it would be a very good thing for Posix ACLs to be
included in the 2.6 kernel, especially in ext3.

Extended attributes are also important as they give a place to store
all the extra DOS info that has no other logical place in a posix
filesystem. For example, we can put the 'read only', 'archive', 'hidden'
and 'system' attributes there. If we don't have extended attributes
then we need to use a nasty kludge where these map to various unix
permission bits, but the mapping is terrible and doesn't give the
correct semantics (especially for things like read only on
directories). 

My main concern with using extended attributes in this way is
performance. My experience with XFS is that as soon as you start
adding extended attributes then the performance drops a lot, but I
haven't tested performance with the ext3 extended attributes so maybe
they don't have the same problem.

Cheers, Tridge

--
http://samba.org/~tridge/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (3 preceding siblings ...)
  2002-10-31  3:14   ` Karim Yaghmour
@ 2002-10-31  3:21   ` Stephen Lord
  2002-10-31  3:59   ` Andreas Dilger
                     ` (12 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Stephen Lord @ 2002-10-31  3:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, Linux Kernel Mailing List

On Wed, 2002-10-30 at 20:31, Linus Torvalds wrote:
> 
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > 
> > 	Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45.  There are 13 of them, as
> > appropriate for Halloween.
> 
> I'm unlikely to be able to merge everything by tomorrow, so I will 
> consider tomorrow a submission deadline to me, rather than a merge 
> deadline. That said, I merged everything I'm sure I want to merge today, 
> and the rest I simply haven't had time to look at very much.
> 

> 
> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I'v enot heard anything since. Are vendors using this?
> 

There are a fair number of NAS vendors who do linux boxes with Samba
and XFS because of the ACL support, Quantum being the one Tridge now
works for by the way. The reason they want it is so they can support
the features NT folks are used to having in their file servers.
Now, we could just let the NT folks use NT servers instead....

Even getting XFS ACLs running in 2.5 requires part of this patch set.

Steve



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:00   ` Rusty Russell
  2002-10-31  3:19     ` tridge
@ 2002-10-31  3:22     ` Christoph Hellwig
  2002-10-31  3:31       ` tridge
  2002-10-31 10:15     ` Joe Thornber
  2002-10-31 11:03     ` Geert Uytterhoeven
  3 siblings, 1 reply; 333+ messages in thread
From: Christoph Hellwig @ 2002-10-31  3:22 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, Geert Uytterhoeven, Russell King, Peter Chubb,
	tridge, tytso

On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > I don't know why people still want ACL's. There were noises about them for 
> > samba, but I'v enot heard anything since. Are vendors using this?
> 
> SAMBA needs them, which is why serious Samba boxes use XFS.  Tridge,
> Ted?

XFS doesn't have ACLs either in plain 2.5.

> > Not for the feature freeze, there are some noises that imply that SuSE may 
> > push it in their kernels. 
> 
> They have, IIRC.  Interestingly, it was less invasive (existing source
> touched) than the LVM2/DM patch you merged.

But that only because dm added stuff to the generic code where we
told it. It's a lot more code than dm and it adds new discovery
code at the same time we start moving stuff _out_ of the kernel
to initramfs.

If you can SuSE has merged it any IBM patch posted here should get
in, coming from big blue seems to be a basic merge criteria in
Nuernberg :)


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:22     ` Christoph Hellwig
@ 2002-10-31  3:31       ` tridge
  0 siblings, 0 replies; 333+ messages in thread
From: tridge @ 2002-10-31  3:31 UTC (permalink / raw)
  To: hch; +Cc: rusty, linux-kernel, geert, rmk, peter, tytso

> XFS doesn't have ACLs either in plain 2.5.

The existing NAS boxes that use Linux and XFS tend to base their
kernels on the 2.4-xfs tree from cvs on sgi.com. It works well and the
SGI guys have been very good about fixing problems when they crop up.

I think that the biggest beneficiary of adding extended attributes and
ACLs into ext3 for 2.6 would be more casual users (home, small office
etc) as they will then be able to use ACLs in Samba without the pain
of switching to a different kernel.

Cheers, Tridge

--
http://samba.org/~tridge/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (4 preceding siblings ...)
  2002-10-31  3:21   ` What's left over Stephen Lord
@ 2002-10-31  3:59   ` Andreas Dilger
  2002-10-31  4:20   ` Patrick Finnegan
                     ` (11 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Andreas Dilger @ 2002-10-31  3:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Oct 30, 2002  18:31 -0800, Linus Torvalds wrote:
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I've not heard anything since. Are vendors using this?

I don't really care about ACLs so much one way or the other, but we
DEFINITELY use EAs with Lustre, so at the minimum if we could have
that part of the changes I'd be happy.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (5 preceding siblings ...)
  2002-10-31  3:59   ` Andreas Dilger
@ 2002-10-31  4:20   ` Patrick Finnegan
  2002-10-31  4:25     ` Christoph Hellwig
  2002-10-31  5:13   ` Dax Kelson
                     ` (10 subsequent siblings)
  17 siblings, 1 reply; 333+ messages in thread
From: Patrick Finnegan @ 2002-10-31  4:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell

I'm kind of new here, but I'll present my case in hope that someone
listens to me.



On Wed, 30 Oct 2002, Linus Torvalds wrote:

> On Thu, 31 Oct 2002, Rusty Russell wrote:
>
> > Crash Dumping (LKCD)
>
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.

This is something that we're just starting to use in my department in
Purdue - we work with clustering, and LKCD will let us determine why our
nodes decide to kernel panic since it's generally not worthwhile to
connect a head to each machine.

I see LKCD as having a big impact by allowing kernels to be debugged after
they have panic'd (and thus don't send out a message to syslog).  It can
especially be usful in compute farms, or other scenerios where it's
difficut or cost prohibitive to connect a console (or console server) to
each individual machine.

> > EVMS
>
> Not for the feature freeze, there are some noises that imply that SuSE may
> push it in their kernels.

I think that the integration between RAID and LVM is a good thing, and
EVMS's 'plug-in module' architecture will help tremendously to bring
interoperation with other systems' volume management subsystems.
Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be
helpful for people trying to migrate their servers over from those OS's to
GNU/Linux.

-- Pat

Purdue University ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  4:20   ` Patrick Finnegan
@ 2002-10-31  4:25     ` Christoph Hellwig
  2002-10-31  4:31       ` Patrick Finnegan
  0 siblings, 1 reply; 333+ messages in thread
From: Christoph Hellwig @ 2002-10-31  4:25 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel, Rusty Russell

On Wed, Oct 30, 2002 at 11:20:42PM -0500, Patrick Finnegan wrote:
> Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be

JFS has no lvm, it just sits on any blockdevice.  The support for Windows
dynamic disks actually layers ontop of the MD driver..


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  4:25     ` Christoph Hellwig
@ 2002-10-31  4:31       ` Patrick Finnegan
  0 siblings, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-10-31  4:31 UTC (permalink / raw)
  To: linux-kernel

On Thu, 31 Oct 2002, Christoph Hellwig wrote:

> On Wed, Oct 30, 2002 at 11:20:42PM -0500, Patrick Finnegan wrote:
> > Specifically, the interoperation with IBM's JFS LVM and MS's LVM will be
>
> JFS has no lvm, it just sits on any blockdevice.  The support for Windows
> dynamic disks actually layers ontop of the MD driver..

To be more specific, I'm talking about AIX's JFS, not linux's JFS...

--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (6 preceding siblings ...)
  2002-10-31  4:20   ` Patrick Finnegan
@ 2002-10-31  5:13   ` Dax Kelson
  2002-10-31  6:07   ` [PATCH] kexec for 2.5.45 Eric W. Biederman
                     ` (9 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Dax Kelson @ 2002-10-31  5:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Wed, 2002-10-30 at 19:31, Linus Torvalds wrote:
> 
> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I'v enot heard anything since. Are vendors using this?
> 

I teach Linux classes to corporate IT guys (~300 or so this year) and
many of them are migrating from Solaris or deploying Linux along side
Solaris.

Solaris has had ACLs since 2.5.1 (1996), and EAs since 2.9 (May 2002).

Having ACL in Linux is a VERY COMMON REQUEST that I hear from the
students.

FWIW.

Dax Kelson
Guru Labs


^ permalink raw reply	[flat|nested] 333+ messages in thread

* [PATCH] kexec for 2.5.45
  2002-10-31  2:31 ` Linus Torvalds
                     ` (7 preceding siblings ...)
  2002-10-31  5:13   ` Dax Kelson
@ 2002-10-31  6:07   ` Eric W. Biederman
  2002-10-31  6:25   ` What's left over Matt D. Robinson
                     ` (8 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-10-31  6:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel


Once again here is my kexec patch once again, updated to work with 2.5.45.

sys_kexec is a system call that allows linux to act as a bootloader for
another arbitrary kernel.   

What the code does:
It copies data from user space, into buffers in kernel space.
The buffers in kernel space are rearranged so that later I can use
   a simply memcpy, to put the data in the page at it's final destination.
The device_shutdown, and the reboot notifier are called.
   - This ensures the hardware devices are in a quiescent state
     so I do not have to worry about them messing up the transfer of control.
The final copy routine is copied to a buffer that won't get stomped.
The machine is placed into 32bit protected mode with paging disabled.
The final copy routine copies the buffers to their final destination
   (which is normally, very similar to where the kernel is running).
The final copy routine jumps to the new loaded kernel image.

At this point the interface is fixed.  Anything additional that needs
to happen, can be done in user space by adding a stub routine that
gets called before the loaded kernel is called.  In particular I can 
directly execute a bzImage which has a 16bit real mode interface.

There is kernel work left to get the device drivers to tell their
devices to shut up. (device_shutdown).  But device_shutdown already
exists, I just have a good test case for it.

Except for the final copy which is very machine specific the rest of
the code is generic and has actually been tested on alpha.  Eventually
I am hoping for ports to other platforms but I am concentrating on x86
so I can do a quality job. 

There has been testing and review on the Linux kernel mailing list.
Starting with a review of the syscall interface about six months ago.
And people testing to be certain they can use the code.  While not all
of the bugs are worked out in the user space code.  The system call is
solid.

Everything is configurable so there should be not footprint increase
for people who do not want this functionality.

Eric


 MAINTAINERS                        |    7 
 arch/i386/Kconfig                  |   17 +
 arch/i386/kernel/Makefile          |    1 
 arch/i386/kernel/entry.S           |    1 
 arch/i386/kernel/machine_kexec.c   |  142 +++++++++
 arch/i386/kernel/relocate_kernel.S |   99 ++++++
 include/asm-i386/kexec.h           |   25 +
 include/asm-i386/unistd.h          |    1 
 include/linux/kexec.h              |   48 +++
 kernel/Makefile                    |    1 
 kernel/kexec.c                     |  577 +++++++++++++++++++++++++++++++++++++
 kernel/sys.c                       |   61 +++
 12 files changed, 980 insertions

diff -uNr linux-2.5.45/MAINTAINERS linux-2.5.45.x86kexec/MAINTAINERS
--- linux-2.5.45/MAINTAINERS	Wed Oct 30 19:58:03 2002
+++ linux-2.5.45.x86kexec/MAINTAINERS	Wed Oct 30 21:05:37 2002
@@ -934,6 +934,13 @@
 W:	http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
 S:	Maintained
 
+KEXEC
+P:	Eric Biederman
+M:	ebiederm@xmission.com
+M:	ebiederman@lnxi.com
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+
 LANMEDIA WAN CARD DRIVER
 P:	Andrew Stanley-Jones
 M:	asj@lanmedia.com
diff -uNr linux-2.5.45/arch/i386/Kconfig linux-2.5.45.x86kexec/arch/i386/Kconfig
--- linux-2.5.45/arch/i386/Kconfig	Wed Oct 30 19:58:04 2002
+++ linux-2.5.45.x86kexec/arch/i386/Kconfig	Wed Oct 30 21:40:22 2002
@@ -784,6 +784,23 @@
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config KEXEC
+	bool "kexec system call (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  kexec is a system call that implements the ability to  shutdown your
+	  current kernel, and to start another kernel.  It is like a reboot
+	  but it is indepedent of the system firmware.   And like a reboot the
+	  you can start any kernel with it not just Linux.  
+	
+	  The name comes from the similiarity to the exec system call. 
+	
+	  It is on an going process to be certain the hardware in a machine
+	  is properly shutdown, so do not be surprised if this code does not
+	  initially work for you.  It may help to enable device hotplugging
+	  support.  As of this writing the exact hardware interface is
+	  strongly in flux, so no good recommendation can be made.
+
 endmenu
 
 
diff -uNr linux-2.5.45/arch/i386/kernel/Makefile linux-2.5.45.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.45/arch/i386/kernel/Makefile	Sat Oct 19 00:57:56 2002
+++ linux-2.5.45.x86kexec/arch/i386/kernel/Makefile	Wed Oct 30 21:05:43 2002
@@ -25,6 +25,7 @@
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o
 obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 obj-$(CONFIG_PROFILING)		+= profile.o
diff -uNr linux-2.5.45/arch/i386/kernel/entry.S linux-2.5.45.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.45/arch/i386/kernel/entry.S	Wed Oct 30 19:58:04 2002
+++ linux-2.5.45.x86kexec/arch/i386/kernel/entry.S	Wed Oct 30 21:06:39 2002
@@ -740,6 +740,7 @@
 	.long sys_epoll_create
 	.long sys_epoll_ctl	/* 255 */
 	.long sys_epoll_wait
+	.long sys_kexec
 
 
 	.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.45/arch/i386/kernel/machine_kexec.c linux-2.5.45.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.45/arch/i386/kernel/machine_kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/arch/i386/kernel/machine_kexec.c	Wed Oct 30 21:05:43 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+	unsigned char curidt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curidt)) = limit;
+	(*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+	__asm__ __volatile__ (
+		"lidt %0\n" 
+		: "=m" (curidt)
+		);
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+	unsigned char curgdt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curgdt)) = limit;
+	(*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+	__asm__ __volatile__ (
+		"lgdt %0\n" 
+		: "=m" (curgdt)
+		);
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+	__asm__ __volatile__ (
+		"\tljmp $"STR(__KERNEL_CS)",$1f\n"
+		"\t1:\n"
+		"\tmovl $"STR(__KERNEL_DS)",%eax\n"
+		"\tmovl %eax,%ds\n"
+		"\tmovl %eax,%es\n"
+		"\tmovl %eax,%fs\n"
+		"\tmovl %eax,%gs\n"
+		"\tmovl %eax,%ss\n"
+		);
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+	/* This code is x86 specific...
+	 * general purpose code must be more carful 
+	 * of caches and tlbs...
+	 */
+	pgd_t *pgd;
+	pmd_t *pmd;
+	struct mm_struct *mm = current->mm;
+	spin_lock(&mm->page_table_lock);
+	
+	pgd = pgd_offset(mm, address);
+	pmd = pmd_alloc(mm, pgd, address);
+
+	if (pmd) {
+		pte_t *pte = pte_alloc_map(mm, pmd, address);
+		if (pte) {
+			set_pte(pte, 
+				mk_pte(virt_to_page(phys_to_virt(address)), 
+					PAGE_SHARED));
+			__flush_tlb_one(address);
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+	unsigned long indirection_page, unsigned long reboot_code_buffer,
+	unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+	unsigned long *indirection_page;
+	void *reboot_code_buffer;
+	relocate_new_kernel_t rnk;
+
+	/* Interrupts aren't acceptable while we reboot */
+	local_irq_disable();
+	reboot_code_buffer = image->reboot_code_buffer;
+	indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+	identity_map_page(virt_to_phys(reboot_code_buffer));
+
+	/* copy it out */
+	memcpy(reboot_code_buffer, relocate_new_kernel, 
+		relocate_new_kernel_size);
+
+	/* The segment registers are funny things, they are
+	 * automatically loaded from a table, in memory wherever you
+	 * set them to a specific selector, but this table is never
+	 * accessed again you set the segment to a different selector.
+	 *
+	 * The more common model is are caches where the behide
+	 * the scenes work is done, but is also dropped at arbitrary
+	 * times.
+	 *
+	 * I take advantage of this here by force loading the
+	 * segments, before I zap the gdt with an invalid value.
+	 */
+	load_segments();
+	/* The gdt & idt are now invalid.
+	 * If you want to load them you must set up your own idt & gdt.
+	 */
+	set_gdt(phys_to_virt(0),0);
+	set_idt(phys_to_virt(0),0);
+
+	/* now call it */
+	rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+	(*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), 
+		image->start);
+}
+
diff -uNr linux-2.5.45/arch/i386/kernel/relocate_kernel.S linux-2.5.45.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.45/arch/i386/kernel/relocate_kernel.S	Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/arch/i386/kernel/relocate_kernel.S	Wed Oct 30 21:05:43 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+	/* Must be relocatable PIC code callable as a C function, that once
+	 * it starts can not use the previous processes stack.
+	 *
+	 */
+	.globl relocate_new_kernel
+relocate_new_kernel:
+	/* read the arguments and say goodbye to the stack */
+	movl  4(%esp), %ebx /* indirection_page */
+	movl  8(%esp), %ebp /* reboot_code_buffer */
+	movl  12(%esp), %edx /* start address */
+
+	/* zero out flags, and disable interrupts */
+	pushl $0
+	popfl
+
+	/* set a new stack at the bottom of our page... */
+	lea   4096(%ebp), %esp
+
+	/* store the parameters back on the stack */
+	pushl   %edx /* store the start address */
+
+	/* Set cr0 to a known state:
+	 * 31 0 == Paging disabled
+	 * 18 0 == Alignment check disabled
+	 * 16 0 == Write protect disabled
+	 * 3  0 == No task switch
+	 * 2  0 == Don't do FP software emulation.
+	 * 0  1 == Proctected mode enabled
+	 */
+	movl	%cr0, %eax
+	andl	$~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+	orl	$(1<<0), %eax
+	movl	%eax, %cr0
+	jmp 1f
+1:	
+
+	/* Flush the TLB (needed?) */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* Do the copies */
+	cld
+0:	/* top, read another word for the indirection page */
+	movl    %ebx, %ecx
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+	testl	$0x1,   %ecx  /* is it a destination page */
+	jz	1f
+	movl	%ecx,	%edi
+	andl	$0xfffff000, %edi
+	jmp     0b
+1:
+	testl	$0x2,	%ecx  /* is it an indirection page */
+	jz	1f
+	movl	%ecx,	%ebx
+	andl	$0xfffff000, %ebx
+	jmp     0b
+1:
+	testl   $0x4,   %ecx /* is it the done indicator */
+	jz      1f
+	jmp     2f
+1:
+	testl   $0x8,   %ecx /* is it the source indicator */
+	jz      0b	     /* Ignore it otherwise */
+	movl    %ecx,   %esi /* For every source page do a copy */
+	andl    $0xfffff000, %esi
+
+	movl    $1024, %ecx
+	rep ; movsl
+	jmp     0b
+
+2:
+
+	/* To be certain of avoiding problems with self modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+	
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+	
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+	ret
+relocate_new_kernel_end:
+
+	.globl relocate_new_kernel_size
+relocate_new_kernel_size:	
+	.long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.45/include/asm-i386/kexec.h linux-2.5.45.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.45/include/asm-i386/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/include/asm-i386/kexec.h	Wed Oct 30 21:05:43 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) 
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE	4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.45/include/asm-i386/unistd.h linux-2.5.45.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.45/include/asm-i386/unistd.h	Wed Oct 30 19:58:25 2002
+++ linux-2.5.45.x86kexec/include/asm-i386/unistd.h	Wed Oct 30 21:07:27 2002
@@ -261,6 +261,7 @@
 #define __NR_sys_epoll_create	254
 #define __NR_sys_epoll_ctl	255
 #define __NR_sys_epoll_wait	256
+#define __NR_sys_kexec		257
   
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.45/include/linux/kexec.h linux-2.5.45.x86kexec/include/linux/kexec.h
--- linux-2.5.45/include/linux/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/include/linux/kexec.h	Wed Oct 30 21:05:43 2002
@@ -0,0 +1,48 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/* 
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+
+struct kimage {
+	kimage_entry_t head;
+	kimage_entry_t *entry;
+	kimage_entry_t *last_entry;
+
+	unsigned long destination;
+	unsigned long offset;
+
+	unsigned long start;
+	void *reboot_code_buffer;
+};
+
+/* kexec helper functions */
+void kimage_init(struct kimage *image);
+void kimage_free(struct kimage *image);
+
+struct kexec_segment {
+	void *buf;
+	size_t bufsz;
+	void *mem;
+	size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern int do_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments, struct kimage *image);
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.45/kernel/Makefile linux-2.5.45.x86kexec/kernel/Makefile
--- linux-2.5.45/kernel/Makefile	Fri Oct 18 11:59:29 2002
+++ linux-2.5.45.x86kexec/kernel/Makefile	Wed Oct 30 21:05:43 2002
@@ -21,6 +21,7 @@
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.45/kernel/kexec.c linux-2.5.45.x86kexec/kernel/kexec.c
--- linux-2.5.45/kernel/kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.45.x86kexec/kernel/kexec.c	Wed Oct 30 21:31:20 2002
@@ -0,0 +1,577 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access.  Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory.  And this page must be identity
+ * mapped.  Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ * 
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set 
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ * 
+ */
+
+void kimage_init(struct kimage *image)
+{
+	memset(image, 0, sizeof(*image));
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (image->offset != 0) {
+		image->entry++;
+	}
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		ind_page = (void *)__get_free_page(GFP_KERNEL);
+		if (!ind_page) {
+			return -ENOMEM;
+		}
+		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = 
+			ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	image->offset = 0;
+	return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+	int result;
+	
+	/* Assume the page is bad unless we pass the checks */
+	result = -EADDRNOTAVAIL;
+
+	if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+		goto out;
+	}
+
+	/* NOTE: The caller is responsible for making certain we
+	 * don't attempt to load the new image into invalid or
+	 * reserved areas of RAM.
+	 */
+	result =  0;
+out:
+	return result;
+}
+
+static int kimage_set_destination(
+	struct kimage *image, unsigned long destination) 
+{
+	int result;
+	destination &= PAGE_MASK;
+	result = kimage_verify_destination(destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0) {
+		image->destination = destination;
+	}
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+	page &= PAGE_MASK;
+	result = kimage_verify_destination(image->destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0) {
+		image->destination += PAGE_SIZE;
+	}
+	return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+	int result;
+	result = kimage_add_entry(image, IND_DONE);
+	if (result == 0) {
+		/* Point at the terminating element */
+		image->entry--;
+	}
+	return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION) {
+				free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+			}
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE) {
+			free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+		}
+	}
+}
+
+static int kimage_is_destination_page(
+	struct kimage *image, unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination;
+	destination = 0;
+	page &= PAGE_MASK;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return 1;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_unused_area(
+	struct kimage *image, unsigned long size, unsigned long align,
+	unsigned long *area)
+{
+	/* Walk through mem_map and find the first chunk of
+	 * ununsed memory that is at least size bytes long.
+	 */
+	/* Since the kernel plays with Page_Reseved mem_map is less
+	 * than ideal for this purpose, but it will give us a correct
+	 * conservative estimate of what we need to do. 
+	 */
+	/* For now we take advantage of the fact that all kernel pages
+	 * are marked with PG_resereved to allocate a large
+	 * contiguous area for the reboot code buffer.
+	 */
+	unsigned long addr;
+	unsigned long start, end;
+	unsigned long mask;
+	mask = ((1 << align) -1);
+	start = end = PAGE_SIZE;
+	for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long aligned_start;
+		page = virt_to_page(phys_to_virt(addr));
+		if (PageReserved(page) ||
+			kimage_is_destination_page(image, addr)) {
+			/* The current page is reserved so the start &
+			 * end of the next area must be atleast at the
+			 * next page.
+			 */
+			start = end = addr + PAGE_SIZE;
+		}
+		else {
+			/* O.k.  The current page isn't reserved
+			 * so push up the end of the area.
+			 */
+			end = addr;
+		}
+		aligned_start = (start + mask) & ~mask;
+		if (aligned_start > start) {
+			continue;
+		}
+		if (aligned_start > end) {
+			continue;
+		}
+		if (end - aligned_start >= size) {
+			*area = aligned_start;
+			return 0;
+		}
+	}
+	*area = 0;
+	return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+	struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return ptr;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	for_each_kimage_entry(image, ptr, entry) {
+		unsigned long page;
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			/* nop */
+		}
+		else if (entry & IND_DONE) {
+			/* nop */
+		}
+		else {
+			/* SOURCE & INDIRECTION */
+			page = entry & PAGE_MASK;
+			if (page == destination) {
+				return ptr;
+			}
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+	kimage_entry_t *ptr, *cptr, entry;
+	unsigned long buffer, page;
+	unsigned long destination = 0;
+
+	/* Here we implement safe guards to insure that
+	 * a source page is not copied to it's destination
+	 * page before the data on the destination page is
+	 * no longer useful.
+	 *
+	 * To make it work we actually wind up with a 
+	 * stronger condition.  For every page considered
+	 * it is either it's own destination page or it is
+	 * not a destination page of any page considered.
+	 *
+	 * Invariants 
+	 * 1. buffer is not a destination of a previous page.
+	 * 2. page is not a destination of a previous page.
+	 * 3. destination is not a previous source page.
+	 *
+	 * Result: Either a source page and a destination page 
+	 * are the same or the page is not a destination page.
+	 *
+	 * These checks could be done when we allocate the pages,
+	 * but doing it as a final pass allows us more freedom
+	 * on how we allocate pages.
+	 * 
+	 * Also while the checks are necessary, in practice nothing
+	 * happens.  The destination kernel wants to sit in the
+	 * same physical addresses as the current kernel so we never
+	 * actually allocate a destination page.
+	 *
+	 * BUGS: This is a O(N^2) algorithm.
+	 */
+
+	
+	buffer = __get_free_page(GFP_KERNEL);
+	if (!buffer) {
+		return -ENOMEM;
+	}
+	buffer = virt_to_phys((void *)buffer);
+	for_each_kimage_entry(image, ptr, entry) {
+		/* Here we check to see if an allocated page */
+		kimage_entry_t *limit;
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_INDIRECTION) {
+			/* Indirection pages must include all of their
+			 * contents in limit checking.
+			 */
+			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+		}
+		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+			continue;
+		}
+		page = entry & PAGE_MASK;
+		limit = ptr;
+
+		/* See if a previous page has the current page as it's 
+		 * destination.
+		 * i.e. invariant 2
+		 */
+		cptr = kimage_dst_conflict(image, page, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+			*cptr = page | (centry & ~PAGE_MASK);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = cpage;
+		}
+		if (!(entry & IND_SOURCE)) {
+			continue;
+		}
+
+		/* See if a previous page is our destination page.
+		 * If so claim it now.
+		 * i.e. invariant 3
+		 */
+		cptr = kimage_src_conflict(image, destination, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+			*cptr = buffer | (centry & ~PAGE_MASK);
+			*ptr = cpage | ( entry & ~PAGE_MASK);
+			buffer = page;
+		}
+		/* If the buffer is my destination page do the copy now 
+		 * i.e. invariant 3 & 1
+		 */
+		if (buffer == destination) {
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = page;
+		}
+	}
+	free_page((unsigned long)phys_to_virt(buffer));
+	return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+	unsigned long len)
+{
+	unsigned long pos;
+	int result;
+	for(pos = 0; pos < len; pos += PAGE_SIZE) {
+		char *page;
+		result = -ENOMEM;
+		page = (void *)__get_free_page(GFP_KERNEL);
+		if (!page) {
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result) {
+			goto out;
+		}
+	}
+	result = 0;
+ out:
+	return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+	struct kexec_segment *segment)
+{	
+	unsigned long mstart;
+	int result;
+	unsigned long offset;
+	unsigned long offset_end;
+	unsigned char *buf;
+
+	result = 0;
+	buf = segment->buf;
+	mstart = (unsigned long)segment->mem;
+
+	offset_end = segment->memsz;
+
+	result = kimage_set_destination(image, mstart);
+	if (result < 0) {
+		goto out;
+	}
+	for(offset = 0;  offset < segment->memsz; offset += PAGE_SIZE) {
+		char *page;
+		size_t size, leader;
+		page = (char *)__get_free_page(GFP_KERNEL);
+		if (page == 0) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result < 0) {
+			goto out;
+		}
+		if (segment->bufsz < offset) {
+			/* We are past the end zero the whole page */
+			memset(page, 0, PAGE_SIZE);
+			continue;
+		}
+		size = PAGE_SIZE;
+		leader = 0;
+		if ((offset == 0)) {
+			leader = mstart & ~PAGE_MASK;
+		}
+		if (leader) {
+			/* We are on the first page zero the unused portion */
+			memset(page, 0, leader);
+			size -= leader;
+			page += leader;
+		}
+		if (size > (segment->bufsz - offset)) {
+			size = segment->bufsz - offset;
+		}
+		result = copy_from_user(page, buf + offset, size);
+		if (result) {
+			result = (result < 0)?result : -EIO;
+			goto out;
+		}
+		if (size < (PAGE_SIZE - leader)) {
+			/* zero the trailing part of the page */
+			memset(page + size, 0, (PAGE_SIZE - leader) - size);
+		}
+	}
+ out:
+	return result;
+}
+
+
+/* do_kexec executes a new kernel 
+ */
+int do_kexec(unsigned long start, long nr_segments,
+	struct kexec_segment *arg_segments, struct kimage *image)
+{
+	struct kexec_segment *segments;
+	size_t segment_bytes;
+	int i;
+
+	int result; 
+	unsigned long reboot_code_buffer;
+	kimage_entry_t *end;
+
+	/* Initialize variables */
+	segments = 0;
+
+	/* We only trust the superuser with rebooting the system. */
+	if (nr_segments <= 0) {
+		result = -EINVAL;
+		goto out;
+	}
+	segment_bytes = nr_segments * sizeof(*segments);
+	segments = kmalloc(GFP_KERNEL, segment_bytes);
+	if (segments == 0) {
+		result = -ENOMEM;
+		goto out;
+	}
+	result = copy_from_user(segments, arg_segments, segment_bytes);
+	if (result) {
+		goto out;
+	}
+
+	/* Read in the data from user space */
+	image->start = start;
+	for(i = 0; i < nr_segments; i++) {
+		result = kimage_load_segment(image, &segments[i]);
+		if (result) {
+			goto out;
+		}
+	}
+	
+	/* Terminate early so I can get a place holder. */
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+	end = image->entry;
+
+	/* Usage of the reboot code buffer is subtle.  We first
+	 * find a continguous area of ram, that is not one
+	 * of our destination pages.  We do not allocate the ram.
+	 *
+	 * The algorithm to make certain we do not have address
+	 * conflicts requires each destination region to have some
+	 * backing store so we allocate abitrary source pages.
+	 *
+	 * Later in machine_kexec when we copy data to the
+	 * reboot_code_buffer it still may be allocated for other
+	 * purposes, but we do know there are no source or destination
+	 * pages in that area.  And since the rest of the kernel
+	 * is already shutdown those pages are free for use,
+	 * regardless of their page->count values.
+	 *
+	 * The kernel mapping is of the reboot code buffer is passed to
+	 * the machine dependent code.  If it needs something else
+	 * it is free to set that up.
+	 */
+	result = kimage_get_unused_area(
+		image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+		&reboot_code_buffer);
+	if (result) 
+		goto out;
+
+	/* Allocating pages we should never need  is silly but the
+	 * code won't work correctly unless we have dummy pages to
+	 * work with. 
+	 */
+	result = kimage_set_destination(image, reboot_code_buffer);
+	if (result) 
+		goto out;
+	result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+	if (result)
+		goto out;
+	image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = kimage_get_off_destination_pages(image);
+	if (result)
+		goto out;
+
+	/* Now hide the extra source pages for the reboot code buffer.
+	 */
+	image->entry = end;
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = 0;
+ out:
+	/* cleanup and exit */
+	if (segments)	kfree(segments);
+	return result;
+}
+
diff -uNr linux-2.5.45/kernel/sys.c linux-2.5.45.x86kexec/kernel/sys.c
--- linux-2.5.45/kernel/sys.c	Fri Oct 18 11:59:29 2002
+++ linux-2.5.45.x86kexec/kernel/sys.c	Wed Oct 30 21:45:37 2002
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/workqueue.h>
 #include <linux/device.h>
 #include <linux/times.h>
@@ -430,6 +431,66 @@
 	unlock_kernel();
 	return 0;
 }
+
+#ifdef CONFIG_KEXEC
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ * 
+ * This call breaks up into three pieces.  
+ * - A generic part which loads the new kernel from the current
+ *   address space, and very carefully places the data in the
+ *   allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ *   the devices to shut down.  Preventing on-going dmas, and placing
+ *   the devices in a consistent state so a later kernel can
+ *   reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ *   and the copies the image to it's final destination.  And
+ *   jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments)
+{
+	/* Am I using to much stack space here? */
+	struct kimage image;
+	int result;
+		
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_BOOT))
+		return -EPERM;
+
+	lock_kernel();
+	kimage_init(&image);
+	result = do_kexec(entry, nr_segments, segments, &image);
+	if (result) {
+		kimage_free(&image);
+		unlock_kernel();
+		return result;
+	}
+	
+	/* The point of no return is here... */
+	notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+	system_running = 0;
+	device_shutdown();
+	printk(KERN_EMERG "Starting new kernel\n");
+	machine_kexec(&image);
+	/* We never get here but... */
+	kimage_free(&image);
+	unlock_kernel();
+	return -EINVAL; 
+}
+#else
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+	struct kexec_segment *segments)
+{
+	return -ENOSYS;
+}
+#endif /* CONFIG_KEXEC */
 
 static void deferred_cad(void *dummy)
 {

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:19     ` tridge
@ 2002-10-31  6:21       ` Chris Wedgwood
  2002-11-05  3:38         ` Andreas Gruenbacher
  0 siblings, 1 reply; 333+ messages in thread
From: Chris Wedgwood @ 2002-10-31  6:21 UTC (permalink / raw)
  To: tridge; +Cc: torvalds, rusty, linux-kernel, geert, rmk, peter, tytso

On Wed, Oct 30, 2002 at 10:19:54PM -0500, tridge@samba.org wrote:

> Eventually I'd like to see a combination of LSM with a new ACL
> system give the ability to support full NT ACLs on Linux (which is
> also needed for full nfsv4 support), but that is way too much to do
> for the 2.6 kernel.

Add bloat to make windows clients happy?

> Extended attributes are also important as they give a place to store
> all the extra DOS info that has no other logical place in a posix
> filesystem. For example, we can put the 'read only', 'archive',
> 'hidden' and 'system' attributes there. If we don't have extended
> attributes then we need to use a nasty kludge where these map to
> various unix permission bits, but the mapping is terrible and
> doesn't give the correct semantics (especially for things like read
> only on directories).

More bloat that does really solve Linux problems... sounds like nasty
hacks to make winduhs hacks work better.

Don't get me wrong, I'm not against sane ACLs (POSIX ACLs are not) os
EAs, but justification of "it makes windows clients easier" is pretty
horrendous IMO.

I'd would at some point like to see decent ACLs, but I don't want to
see 'windows ACLs' and all the SID nonsense.



  --cw

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:06   ` Rik van Riel
  2002-10-31  3:19     ` Stephen Frost
@ 2002-10-31  6:22     ` Chris Wedgwood
  2002-10-31  6:48       ` Dax Kelson
  2002-10-31  9:44     ` Lech Szychowski
  2 siblings, 1 reply; 333+ messages in thread
From: Chris Wedgwood @ 2002-10-31  6:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

On Thu, Oct 31, 2002 at 01:06:54AM -0200, Rik van Riel wrote:

> Personally I do think either the unlimited groups patch or ACLs are
> needed in order to sanely run a large anoncvs setup.

Processes need to be a member of 20+ groups to make anoncvs work?
Sounds like anoncvs is broken then.


  --cw

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (8 preceding siblings ...)
  2002-10-31  6:07   ` [PATCH] kexec for 2.5.45 Eric W. Biederman
@ 2002-10-31  6:25   ` Matt D. Robinson
  2002-10-31 15:46     ` Linus Torvalds
  2002-10-31  7:46   ` Ville Herva
                     ` (7 subsequent siblings)
  17 siblings, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-10-31  6:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

Linus Torvalds wrote:
> > Crash Dumping (LKCD)
> 
> This is definitely a vendor-driven thing. I don't believe it has any
> relevance unless vendors actively support it.

There are people within IBM in Germany, India and England, as well as
a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
that are PAID to support this.  In addition, Global Services at IBM
uses this as a front-line method for resolving customer problems.
If you're looking for names of people to sign up to support it
(both vendors and non-vendors), I can make that list up for you.

There are a number of us (developers, support staff, and other
interested parties) who bend over backwards, day in and day out
to make sure this stuff works and helps people, even if it isn't
kernel developers (directly -- indirectly, you get bug reports that
are sane and useful).

It's not sexy kernel stuff, but it is very important, and if you'd
like, I can have representatives from at least 10 major corporations
(Fortune 500 companies) contact you to request that this go in.

We're generating 2.5.45 patches now, and we ask that you include
the patches when they are posted.

I don't know what else to say except that people really want this
stuff and all of us in the LKCD community work really hard together
to make this project useful for everyone.

Please include this in your next snapshot.

--Matt

P.S.  Copying some of the users and developers.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:22     ` Chris Wedgwood
@ 2002-10-31  6:48       ` Dax Kelson
  2002-10-31  6:56         ` Chris Wedgwood
  2002-10-31  7:10         ` Alexander Viro
  0 siblings, 2 replies; 333+ messages in thread
From: Dax Kelson @ 2002-10-31  6:48 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: Rik van Riel, Linus Torvalds, Rusty Russell, linux-kernel

On Wed, 2002-10-30 at 23:22, Chris Wedgwood wrote:
> On Thu, Oct 31, 2002 at 01:06:54AM -0200, Rik van Riel wrote:
> 
> > Personally I do think either the unlimited groups patch or ACLs are
> > needed in order to sanely run a large anoncvs setup.
> 
> Processes need to be a member of 20+ groups to make anoncvs work?
> Sounds like anoncvs is broken then.

Technically speaking you can achieve ACL like permissions/behavior using
the historical UNIX security model by creating a group EACH time you run
into a unique case permission scenario.

Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
create another group with just those three people in.  Over time, of
course, this leads to massive group proliferation.  Without Tim Hockin's
patch, 32 groups is maximum number of groups a user can be a member of.

Dax


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:48       ` Dax Kelson
@ 2002-10-31  6:56         ` Chris Wedgwood
  2002-10-31 14:31           ` Jeff Garzik
  2002-10-31 18:28           ` Nicholas Wourms
  2002-10-31  7:10         ` Alexander Viro
  1 sibling, 2 replies; 333+ messages in thread
From: Chris Wedgwood @ 2002-10-31  6:56 UTC (permalink / raw)
  To: Dax Kelson; +Cc: Rik van Riel, Linus Torvalds, Rusty Russell, linux-kernel

On Wed, Oct 30, 2002 at 11:48:23PM -0700, Dax Kelson wrote:

> Technically speaking you can achieve ACL like permissions/behavior
> using the historical UNIX security model by creating a group EACH
> time you run into a unique case permission scenario.

I'm not arguing against this... I'm claiming POSIX ACLs are mostly
brain-dead and almost worthless (broken by committee pressure and too
many people making stupid concessions).

If we must have ACLs, why not do it right?

> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir,
> just create another group with just those three people in.  Over
> time, of course, this leads to massive group proliferation.  Without
> Tim Hockin's patch, 32 groups is maximum number of groups a user can
> be a member of.

How many people actually need this level of complexity?

Why are we adding all this shit and bloat because of perceived
problems most people don't have?  What next, some kind of misdesigned
in-kernel CryptoAPI?



  --cw

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:48       ` Dax Kelson
  2002-10-31  6:56         ` Chris Wedgwood
@ 2002-10-31  7:10         ` Alexander Viro
  2002-10-31  7:21           ` Dax Kelson
  2002-10-31 22:53           ` Pavel Machek
  1 sibling, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-10-31  7:10 UTC (permalink / raw)
  To: Dax Kelson
  Cc: Chris Wedgwood, Rik van Riel, Linus Torvalds, Rusty Russell,
	linux-kernel



On 30 Oct 2002, Dax Kelson wrote:

> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> create another group with just those three people in.  Over time, of

If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
are using existing userland (any OS I'd seen), then Sally can easily
fuck them into the next month and not in a good way.

_That_ is the real problem.  Until that is solved (i.e. until all
userland is written up to the standards allegedly followed in writing
suid-root programs wrt hostile filesystem modifications) NO mechanism
will help you.  ACLs, huge groups, whatever - setups with that sort
of access allowed are NOT SUSTAINABLE with the current userland(s).


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  7:10         ` Alexander Viro
@ 2002-10-31  7:21           ` Dax Kelson
  2002-10-31  7:42             ` Alexander Viro
  2002-10-31 22:53           ` Pavel Machek
  1 sibling, 1 reply; 333+ messages in thread
From: Dax Kelson @ 2002-10-31  7:21 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Chris Wedgwood, Rik van Riel, Linus Torvalds, Rusty Russell,
	linux-kernel

On Thu, 2002-10-31 at 00:10, Alexander Viro wrote:
> 
> 
> On 30 Oct 2002, Dax Kelson wrote:
> 
> > Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> > create another group with just those three people in.  Over time, of
> 
> If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
> are using existing userland (any OS I'd seen), then Sally can easily
> fuck them into the next month and not in a good way.

I think the normal intent is to let Sally, Joe, and Bill have their own
private directory protected from THE REST OF THE USERS.

If a member of your trusted circle goes rogue, then, yup you are screwed
for the moment. It shouldn't last a whole month though.

That is what backups, and employment termination is for.

Dax


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  7:21           ` Dax Kelson
@ 2002-10-31  7:42             ` Alexander Viro
  2002-10-31 16:24               ` Stephen Wille Padnos
  2002-11-02 17:35               ` LA Walsh
  0 siblings, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-10-31  7:42 UTC (permalink / raw)
  To: Dax Kelson
  Cc: Chris Wedgwood, Rik van Riel, Linus Torvalds, Rusty Russell,
	linux-kernel



On 31 Oct 2002, Dax Kelson wrote:

> I think the normal intent is to let Sally, Joe, and Bill have their own
> private directory protected from THE REST OF THE USERS.
> 
> If a member of your trusted circle goes rogue, then, yup you are screwed
> for the moment. It shouldn't last a whole month though.
> 
> That is what backups, and employment termination is for.

Then give them all the same account and be done with that.  Effect will
be the same.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (9 preceding siblings ...)
  2002-10-31  6:25   ` What's left over Matt D. Robinson
@ 2002-10-31  7:46   ` Ville Herva
  2002-10-31  9:23     ` Geert Uytterhoeven
  2002-10-31 10:16   ` Trever L. Adams
                     ` (6 subsequent siblings)
  17 siblings, 1 reply; 333+ messages in thread
From: Ville Herva @ 2002-10-31  7:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Wed, Oct 30, 2002 at 06:31:36PM -0800, you [Linus Torvalds] wrote:
> 
> > Crash Dumping (LKCD)
> 
> This is definitely a vendor-driven thing. I don't believe it has any 
> relevance unless vendors actively support it.

I don't think this is just a vendor thing. Currently, linux doesn't have any
way of saving the crash dump when the box crashes. So if it crashes, the
user needs to write the oops down by hand (error prone, the interesting part
has often scrolled off screen), or attach a serial console (then he needs to
reproduce it - not always possible, and actually majority of people (home
users) don't have second box and the cable. Nor the motivation.)

So, imho some kind of way of semi-automatically save the dumps is needed. If
vendors even support it - great - but it has value to mainline kernel as
well, as people can submit more accurate error reports. Besides, if it goes
in mainline, I believe vendors are likely to support it. (Why wouldn't they?
Currently there just isn't a standard way of doing this.)

There are a bunch of patches for this sort of thing (Willy Tarreau's
kmsgdump for dumping to floppy, Ingo's netconsole, Rusty's oopser for
dumping to ide device...), but lkcd is a more general framework, and can
support different ways of dumping.

I know you are not keen on kernel debuggers, but I can't see what's
fundamentally wrong with being able to save the crucial info when a crash
happens...


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  7:46   ` Ville Herva
@ 2002-10-31  9:23     ` Geert Uytterhoeven
  2002-10-31  9:39       ` Ville Herva
  0 siblings, 1 reply; 333+ messages in thread
From: Geert Uytterhoeven @ 2002-10-31  9:23 UTC (permalink / raw)
  To: Ville Herva; +Cc: Linus Torvalds, Linux Kernel Development

On Thu, 31 Oct 2002, Ville Herva wrote:
> On Wed, Oct 30, 2002 at 06:31:36PM -0800, you [Linus Torvalds] wrote:
> > > Crash Dumping (LKCD)
> > 
> > This is definitely a vendor-driven thing. I don't believe it has any 
> > relevance unless vendors actively support it.
> 
> I don't think this is just a vendor thing. Currently, linux doesn't have any
> way of saving the crash dump when the box crashes. So if it crashes, the
> user needs to write the oops down by hand (error prone, the interesting part
> has often scrolled off screen), or attach a serial console (then he needs to
> reproduce it - not always possible, and actually majority of people (home
> users) don't have second box and the cable. Nor the motivation.)

Except on m68k, where we've had a feature to store all kernel messages in an
unused portion of memory (e.g. some Chip RAM on Amiga) and recover them after
reboot since ages.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  9:23     ` Geert Uytterhoeven
@ 2002-10-31  9:39       ` Ville Herva
  0 siblings, 0 replies; 333+ messages in thread
From: Ville Herva @ 2002-10-31  9:39 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linux Kernel Development

On Thu, Oct 31, 2002 at 10:23:32AM +0100, you [Geert Uytterhoeven] wrote:
> 
> Except on m68k, where we've had a feature to store all kernel messages in an
> unused portion of memory (e.g. some Chip RAM on Amiga) and recover them after
> reboot since ages.

There was similar thing for x86 as well:

http://www.tux.org/hypermail/linux-kernel/1999week27/0782.html

Of course it never went to mainline (and I don't know how well it worked.)
>From what I understand, lkcd can support such method easily.


-- v --

v@iki.fi

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:06   ` Rik van Riel
  2002-10-31  3:19     ` Stephen Frost
  2002-10-31  6:22     ` Chris Wedgwood
@ 2002-10-31  9:44     ` Lech Szychowski
  2 siblings, 0 replies; 333+ messages in thread
From: Lech Szychowski @ 2002-10-31  9:44 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

> Yes, people use it.  Not quite sure why though, I guess ACLs
> buy some flexibility over the user/group/other model but if
> the "unlimited groups" patch goes in (is in?) I'm happy ;)

Correct me if I'm wrong but I believe a process has to be
restarted to have its group membership list changed? 

That's a huge difference from ACL behavior which allow for changes to
file access rights without the need to restart the accessing process.

-- 
	Leszek.

-- lech7@pse.pl 2:480/33.7          -- REAL programmers use INTEGERS --
-- speaking just for myself...

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:00   ` Rusty Russell
  2002-10-31  3:19     ` tridge
  2002-10-31  3:22     ` Christoph Hellwig
@ 2002-10-31 10:15     ` Joe Thornber
  2002-10-31 14:26       ` Jeff Garzik
  2002-10-31 21:14       ` Rusty Russell
  2002-10-31 11:03     ` Geert Uytterhoeven
  3 siblings, 2 replies; 333+ messages in thread
From: Joe Thornber @ 2002-10-31 10:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, linux-kernel, Geert Uytterhoeven, Russell King,
	Peter Chubb, tridge, tytso

On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > > EVMS
> > 
> > Not for the feature freeze, there are some noises that imply that SuSE may 
> > push it in their kernels. 
> 
> They have, IIRC.  Interestingly, it was less invasive (existing source
> touched) than the LVM2/DM patch you merged.

FUD.  I added to three areas of existing code:

i) Every man and his dog uses mempools in conjuction with slabs, so
   rather than having everyone redefining their own alloc/free
   functions I added the following huge functions to mempool.c.  In no
   way were they mandatory.

    /*
     * A commonly used alloc and free fn.
     */
    void *mempool_alloc_slab(int gfp_mask, void *pool_data)
    {
            kmem_cache_t *mem = (kmem_cache_t *) pool_data;
            return kmem_cache_alloc(mem, gfp_mask);
    }

    void mempool_free_slab(void *element, void *pool_data)
    {
            kmem_cache_t *mem = (kmem_cache_t *) pool_data;
            kmem_cache_free(mem, element);
    }

ii) vcalloc, this *didn't* get merged, and will probably end up getting
    moved into dm.h.

iii) ioctl32 support: people have argued against an ioctl interface,
     and I'm inclined to agree with them, which is why I'm going to
     publish an fs interface shortly.  However, given that we are
     currently using an ioctl interface how do we avoid adding support for
     32bit userland/64 kernel space ?  If EVMS isn't touching these
     files does that mean they're not supporting these architectures ?

        arch/mips64/kernel/ioctl32.c
        arch/ppc64/kernel/ioctl32.c
        arch/s390x/kernel/ioctl32.c
        arch/sparc64/kernel/ioctl32.c


So given that (ii) didn't get merged, which of (i) and (iii) were you
objecting to ?

- Joe

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (10 preceding siblings ...)
  2002-10-31  7:46   ` Ville Herva
@ 2002-10-31 10:16   ` Trever L. Adams
  2002-10-31 18:08     ` Nicholas Wourms
  2002-10-31 13:36   ` mbs
                     ` (5 subsequent siblings)
  17 siblings, 1 reply; 333+ messages in thread
From: Trever L. Adams @ 2002-10-31 10:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, Linux Kernel Mailing List

On Wed, 2002-10-30 at 21:31, Linus Torvalds wrote:

> > ext2/ext3 ACLs and Extended Attributes
> 
> I don't know why people still want ACL's. There were noises about them for 
> samba, but I'v enot heard anything since. Are vendors using this?
> 

I am sure I don't count (not being a vendor), but Intermezzo offers
support for this (they are waiting on feature freeze to redo it to 2.5
according to an email I have).  I want this stuff.  Yes, u+g+w is nice,
but good ACLs are even better.  Please, if this is technically correct
in implementation, do put it in.

Thank you,
Trever


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:00   ` Rusty Russell
                       ` (2 preceding siblings ...)
  2002-10-31 10:15     ` Joe Thornber
@ 2002-10-31 11:03     ` Geert Uytterhoeven
  2002-10-31 21:17       ` James Simmons
  3 siblings, 1 reply; 333+ messages in thread
From: Geert Uytterhoeven @ 2002-10-31 11:03 UTC (permalink / raw)
  To: Rusty Russell, James Simmons
  Cc: Linus Torvalds, Linux Kernel Development, Russell King,
	Peter Chubb, tridge, Theodore Ts'o

On Thu, 31 Oct 2002, Rusty Russell wrote:
> In message <Pine.LNX.4.44.0210301823120.1396-100000@home.transmeta.com> you wri
> te:
> > On Thu, 31 Oct 2002, Rusty Russell wrote:
> > > Fbdev Rewrite
> > 
> > This one is just huge, and I have little personal judgement on it.
> 
> It's been around for a while.  Geert, Russell?

It's huge because it moves a lot of files around:
  1. drivers/char/agp/ -> drivers/video/agp/
  2. drivers/char/drm/ -> drivers/video/drm/
  3. console related files in drivers/video/ -> drivers/video/console/

(1) and (2) should be reverted, but apparently they aren't reverted in the
patch at http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz yet. The patch
also seems to remove some drivers. Haven't checked the bk repo yet.

James, can you please fix that (and the .Config files)?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (11 preceding siblings ...)
  2002-10-31 10:16   ` Trever L. Adams
@ 2002-10-31 13:36   ` mbs
  2002-10-31 14:21   ` Chris Friesen
                     ` (4 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: mbs @ 2002-10-31 13:36 UTC (permalink / raw)
  To: Linus Torvalds, Rusty Russell; +Cc: linux-kernel

> > POSIX Timer API
>
> I think I'll do at least the API, but there were some questions about the
> config options here, I think.

I think george just posted a config optionless patch.

WOOHOO!  Thanks!

>
> > Hires Timers
>
> This one is likely another "vendor push" thing.
>

I work for a vendor who really wants this.  

we have customers who demand it.

I am sure we are not alone (mvista? concurrent? any embedded space people for 
whom 10msec is not good enough and the extra overhead of a higer frequency 
fixed interval timer is unacceptable please speak up, if we don't get it in 
now, we probably won't get it for 2 years.)

-- 
/**************************************************
**   Mark Salisbury       ||      mbs@mc.com     **
**************************************************/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (12 preceding siblings ...)
  2002-10-31 13:36   ` mbs
@ 2002-10-31 14:21   ` Chris Friesen
  2002-10-31 14:52   ` Suparna Bhattacharya
                     ` (3 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Chris Friesen @ 2002-10-31 14:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:
>>Linux Trace Toolkit (LTT
> I don't know what this buys us.

I'd like to add a request for this to be in mainstream.  The benefits 
have already been stated in this thread, and it has been used here to 
good effect.

>>Crash Dumping (LKCD
> This is definitely a vendor-driven thing. I don't believe it has any 
> relevance unless vendors actively support it.

I'd like to see this too.  The more debug information the better as far 
as I'm concerned.


>>Hires Timer
> This one is likely another "vendor push" thing.

It doesn't hurt performance when turned off, and allows for 
finer-grained timing when turned on.  What's not to like?  I can't 
comment on the actual code, but I really like the idea.


Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 10:15     ` Joe Thornber
@ 2002-10-31 14:26       ` Jeff Garzik
  2002-10-31 14:55         ` Alan Cox
  2002-10-31 21:14       ` Rusty Russell
  1 sibling, 1 reply; 333+ messages in thread
From: Jeff Garzik @ 2002-10-31 14:26 UTC (permalink / raw)
  To: Joe Thornber
  Cc: Rusty Russell, Linus Torvalds, linux-kernel, Geert Uytterhoeven,
	Russell King, Peter Chubb, tridge, tytso

Joe Thornber wrote:

>ii) vcalloc, this *didn't* get merged, and will probably end up getting
>    moved into dm.h.
>

Yeah, historically we have avoided things like this.

kcalloc gets proposed every year or so too.

>iii) ioctl32 support: people have argued against an ioctl interface,
>     and I'm inclined to agree with them, which is why I'm going to
>     publish an fs interface shortly.  However, given that we are
>     currently using an ioctl interface how do we avoid adding support for
>     32bit userland/64 kernel space ?  If EVMS isn't touching these
>     files does that mean they're not supporting these architectures ?
>
>        arch/mips64/kernel/ioctl32.c
>        arch/ppc64/kernel/ioctl32.c
>        arch/s390x/kernel/ioctl32.c
>        arch/sparc64/kernel/ioctl32.c
>  
>

Well, I'll note that ALSA compartmentalizes their ioctl32 handling 
within their own subsystem, which seems like a decent solution.

That said, [maybe I'm biased <g>], using an fs interface allows one to 
completely eliminate an ioctl32 interface.  That would be the direction 
I would greatly prefer by the time 2.5.x hits the code freeze.

Best regards, and congrats for getting it merged,

    Jeff





^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:56         ` Chris Wedgwood
@ 2002-10-31 14:31           ` Jeff Garzik
  2002-10-31 18:12             ` Chris Wedgwood
  2002-10-31 18:28           ` Nicholas Wourms
  1 sibling, 1 reply; 333+ messages in thread
From: Jeff Garzik @ 2002-10-31 14:31 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Dax Kelson, Rik van Riel, Linus Torvalds, Rusty Russell, linux-kernel

Chris Wedgwood wrote:

>problems most people don't have?  What next, some kind of misdesigned
>in-kernel CryptoAPI?
>  
>


Ok, I'll allow myself to be trolled.

What's wrong with our current 2.5.45 crypto api?



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over
  2002-10-31  2:31 ` Linus Torvalds
                     ` (13 preceding siblings ...)
  2002-10-31 14:21   ` Chris Friesen
@ 2002-10-31 14:52   ` Suparna Bhattacharya
  2002-10-31 16:37   ` Henning P. Schmiedehausen
                     ` (2 subsequent siblings)
  17 siblings, 0 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-10-31 14:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, lkcd-devel, lkcd-general

On Thu, Oct 31, 2002 at 02:39:23AM +0000, Linus Torvalds wrote:
> 
> On Thu, 31 Oct 2002, Rusty Russell wrote:
> > 
> > 	Here is the list of features which have are being actively
> > pushed, not NAK'ed, and are not in 2.5.45.  There are 13 of them, as
> > appropriate for Halloween.
> 
> I'm unlikely to be able to merge everything by tomorrow, so I will 
> consider tomorrow a submission deadline to me, rather than a merge 
> deadline. That said, I merged everything I'm sure I want to merge today, 
> and the rest I simply haven't had time to look at very much.
> 
> 
> > Crash Dumping (LKCD)
> 
> This is definitely a vendor-driven thing. I don't believe it has any 
> relevance unless vendors actively support it.
> 

Linus,

I wish you could have made it to the OLS RAS BOF and seen this for
yourself - the vendor support, the need and the drive towards a 
unified and flexible dumping framework. 

The problem with dump has not been lack of vendor interest. There
wouldn't have been multiple dump type implementations floating around 
if there wasn't a need  --  LKCD, Mission Critical dump, Ingo's
network dump, kmsgdump, Rusty's oops dumper to cite some. The difficulty
has been technical and hence the diversity of approaches that different
projects came up with to tackle the problem (arising from slightly
different priorities and environments in each case). The second has
been related to preferences in the kind of user level analysis tools.

And the LKCD project has been evolving to address these very 
problems to bring the best of these worlds together and also allow
flexibility on the choice of analysis tools !

Mission critical Linux project code base for example is now being 
maintained as part of the LKCD project. Either lcrash or mission 
critical linux crash can be used for analysing LKCD dumps. 

And on the kernel side of things:

(a) The dump driver interface in LKCD has been specifically 
    designed to enable different kinds of dumping mechanisms and 
    targets to be supported -- generic block, network dump , 
    polled-IDE (Rusty style) etc, even alternate dump targets failover 
    and multiple dump devices in the future if required. We are also 
    experimenting with a memory dump driver to save dump to memory 
    and dump after a memory preserving soft-boot, reusing the mission 
    critical mcore technique.
(b) Selective dumping, for different levels of dump data - one
    option that was added recently would dump all kernel pages
    and is likely to be commonly used (gzip compressed dump). Its
    pretty easy to extend to more selectivity or different levels
    and the dump also occurs in passes from more critical data to 
    less critical.
    (The page in use flag was added to help with this)
(c) The core pieces which touch the kernel as such just add basic 
    infrastructure that is needed in the kernel for any dumping 
    facility. Includes:
	- Enabling IPI to collect CPU state on all processors in the
	  system right when dump is triggered (may not be a normal
	  situation, so NMIs where supported are the best option)
	- Ability to quiesce (silence) the system before dumping 
	  (and if in non-disruptive mode, then restore it back)
	- Calls into dump from kernel paths (panic, oops, sysrq
	  etc). 
	- Exports of symbols to help with physical memory 
	  traversal and verification

As Matt has said there is an active development community behind 
LKCD and lot of the drive for that has come from companies who use it 
and are really hoping hard that it becomes part of the mainline.

BTW, the code has also been scrutinised and reviewed over
lkml as well and undergone iterations of releases following 
that. Anything else there that you think needs to be fixed please
do let us know.

Regards
Suparna


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 14:26       ` Jeff Garzik
@ 2002-10-31 14:55         ` Alan Cox
  0 siblings, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-10-31 14:55 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Joe Thornber, Rusty Russell, Linus Torvalds,
	Linux Kernel Mailing List, Geert Uytterhoeven, Russell King,
	Peter Chubb, tridge, tytso

On Thu, 2002-10-31 at 14:26, Jeff Garzik wrote:
> Yeah, historically we have avoided things like this.
> kcalloc gets proposed every year or so too.

I would like to see both of these in because tons of kernel fixing that
has been done through audits has been about


	get_user(a, ...)
	kmalloc(a * sizeof(b), ..)

We end up with loads of ugly  > MAXINT/sizeof(foo) if checks in the code
that ought to be in one place



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:25   ` What's left over Matt D. Robinson
@ 2002-10-31 15:46     ` Linus Torvalds
  2002-10-31 17:10       ` Patrick Finnegan
                         ` (4 more replies)
  0 siblings, 5 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 15:46 UTC (permalink / raw)
  To: Matt D. Robinson; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


On Wed, 30 Oct 2002, Matt D. Robinson wrote:

> Linus Torvalds wrote:
> > > Crash Dumping (LKCD)
> > 
> > This is definitely a vendor-driven thing. I don't believe it has any
> > relevance unless vendors actively support it.
> 
> There are people within IBM in Germany, India and England, as well as
> a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> that are PAID to support this.

That's fine. And since they are paid to support it, they can apply the 
patches.  

What I'm saying by "vendor driven" is that it has no relevance for the 
standard kernel, and since it has no relevance to that, then I have no 
incentives to merge it. The crash dump is only useful with people who 
actively look at the dumps, and I don't know _anybody_ outside of the 
specialized vendors you mention who actually do that.

I will merge it when there are real users who want it - usually as a
result of having gotten used to it through a vendor who supports it. (And
by "support" I do not mean "maintain the patches", but "actively uses it"
to work out the users problems or whatever).

Horse before the cart and all that thing.

People have to realize that my kernel is not for random new features. The
stuff I consider important are things that people use on their own, or
stuff that is the base for other work. Quite often I want vendors to merge
patches _they_ care about long long before I will merge them (examples of
this are quite common, things like reiserfs and ext3 etc).

THAT is what I mean by vendor-driven. If vendors decide they really want
the patches, and I actually start seeing noises on linux-kernel or getting
requests for it being merged from _users_ rather than developers, then
that means that the vendor is on to something.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* LTT for inclusion into 2.5
  2002-10-31  3:14   ` Karim Yaghmour
@ 2002-10-31 16:00     ` bob
  2002-10-31 16:19       ` Is your idea good? [was: Re: LTT for inclusion into 2.5] Larry McVoy
  0 siblings, 1 reply; 333+ messages in thread
From: bob @ 2002-10-31 16:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: karim, Rusty Russell, linux-kernel, okrieg, okrieg, frankeh, LTT-Dev

Linus,

     LTT is one step in allowing Linux to continue to move towards being a
viable alternative for more than just hackers.  It is part of a larger
effort to provide reliability and serviceability.  Concretely it allows
application/subsystem programmers to understand the performance of their
applications and the system.  I should note, it also allows people to
improve kernel behavior as well.  As we have communicated in the past, the
ability to gather and analyze this data is vital.  From my correspondences
with Ingo

"If you care about performance you will want to trace.  On two previous
kernels I have worked on I've heard this comment ["we don't need tracing"].
Once the infrastructure was in it was used and appreciated."  There were
world-class programmers involved in these projects that did not see the
value of such infrastructure until they were able to use it.

I think Karim provided a list of possible uses, there are countless
applications of this - I'll list some more: 
 seeing where unexplained idle tie is occurring
 understanding where interrupt processing time is going
 understanding interactions between applications - which is running when
 etc etc etc

If you look around the kernel, subsystems, and applications, you will find
growing numbers of one-off-ways of gathering this information.  Providing a
unified way for different developers to communicate about performance will
significantly improve the ability to performance debug different
applications, drivers, system/application interaction, etc.

LTT has existed for a long time now and recent additions have been well
motivated: For a while now I have been working with the RAS team at IBM and
with Karim Yaghmour to streamline LTT and make it perform well on MPs.  We
have addressed all the concerns raised by yourself, Ingo, and others from
previous postings.  If there remains concern, it is also possible for one
to disable tracing.  Some of the features we put into LTT came from ideas
we prototyped in K42 (www.research.ibm.com/K42) which in turn was developed
based on my experience writing a tracing infrastructure for IRIX while
working for SGI, and other's experiences with AIX's tracing facilities.

LTT is a valuable aspect in allowing developers using Linux to understand
their application's and the system's behavior.  It serves to strengthen
Linux's RAS capabilities and would be great to get included into 2.5.
Thanks.

Thank you.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
bob@watson.ibm.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Is your idea good?  [was: Re: LTT for inclusion into 2.5]
  2002-10-31 16:00     ` LTT for inclusion into 2.5 bob
@ 2002-10-31 16:19       ` Larry McVoy
  2002-10-31 16:38         ` Cort Dougan
                           ` (2 more replies)
  0 siblings, 3 replies; 333+ messages in thread
From: Larry McVoy @ 2002-10-31 16:19 UTC (permalink / raw)
  To: bob
  Cc: Linus Torvalds, karim, Rusty Russell, linux-kernel, okrieg,
	okrieg, frankeh, LTT-Dev

I don't mean to pick on LTT, I haven't used it, it may be the best thing
since sliced bread.

I can tell you how to present this and any other feature similar to this
in a way which would make me a lot more willing to accept it, which
presupposes I'm doing Linus' job which of course I am not.  However,
it's likely that Linus has similar views but he gets to chime in and
speak for himself.

All of these tools/features/whatever add some cost.  The cost can be 
measured in lots of different ways:

    - lines of code
    - lines of code which can't be configed out
    - call depth increases
    - stack size increases
    - cache foot print increases
    - parallelism (think preempt)
    - interface changes

I suspect there are other metrics and it would be very cool if others would
chime in with their pet peeves.

What would be cool is if there was some way to quantify as much as possible 
of the accepted set of costs so that that could be balanced against the 
value of the change, right?

The one that always gets me is

    "I've added feature XYZ, I benchmarked it with <whatever, usually
    LMbench> and it didn't make a difference"

That is almost certainly misleading.  The real thing you want to do
is quantify the actual costs because there can be non-zero costs that
do not show up in benchmarks.  For example, suppose that the benchmark
neatly fits in the onchip caches and it only uses 1/2 of those caches.
Your change could increase the cache foot print to just fill the caches,
the benchmark says no difference, you declare success and move on.
The problem is that almost all changes are good enough that they match
this description.  Measuring them in isolation doesn't tell us enough.
If I combine two changes, both of which use up 1/2 the cache, there is
no longer any room for anything else in the cache.

I'd love to see a trend where patch requests for any non-trivial patch
included before/after data for the above metrics (and any others that 
people see as useful).  I'd love to see some people taking just one of 
the above and making a tool which measures that metric.  Then we combine
the tools into a "patch measurement suite" and start prefixing patches
with

    Code changes:
	+1234 -5678 = -4444	(all code)
	+123 -567 = -444	(all code subject to CONFIG_XYZ)

    Call depth:
	+2 for read()
	+2 for write()
	no change for all other system calls

    Stack size:
	+2099 bytes for read()/write() path

    Cache misses:
	No change for benchmark1, 2, 3
	12,000 data read misses for lat_ctx ....
    
    Etc.

What does the list think of this?
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  7:42             ` Alexander Viro
@ 2002-10-31 16:24               ` Stephen Wille Padnos
  2002-10-31 16:44                 ` Alexander Viro
  2002-11-02 17:35               ` LA Walsh
  1 sibling, 1 reply; 333+ messages in thread
From: Stephen Wille Padnos @ 2002-10-31 16:24 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Dax Kelson, Chris Wedgwood, Rik van Riel, Linus Torvalds,
	Rusty Russell, linux-kernel



Alexander Viro wrote:

>On 31 Oct 2002, Dax Kelson wrote:
>
>>I think the normal intent is to let Sally, Joe, and Bill have their own
>>private directory protected from THE REST OF THE USERS.
>>
>>If a member of your trusted circle goes rogue, then, yup you are screwed
>>for the moment. It shouldn't last a whole month though.
>>
>>That is what backups, and employment termination is for.
>>    
>>
>
>Then give them all the same account and be done with that.  Effect will
>be the same.
>  
>

Unless I'm missing something, that only works if all the users need 
*exactly* the same permissions to all files, which isn't a good assumption.

Example:  Sally is an accountant, Joe and Bill are engineers.

Bill and Joe are working on a project, and Sally is cost control for 
that project - they all need access to the project files.  Bill and Joe 
do not need access to officer salary data, but Sally does.  Bill and Joe 
need access to other projects (not necessarily the same ones), but Sally 
doesn't.  Oops.

- Steve



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:43   ` Alexander Viro
@ 2002-10-31 16:36     ` Oliver Xymoron
  2002-10-31 17:04       ` Stephen Frost
  2002-10-31 17:38       ` Linus Torvalds
  2002-10-31 22:57     ` Pavel Machek
  1 sibling, 2 replies; 333+ messages in thread
From: Oliver Xymoron @ 2002-10-31 16:36 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

On Wed, Oct 30, 2002 at 09:43:29PM -0500, Alexander Viro wrote:
> 
> 
> On Wed, 30 Oct 2002, Linus Torvalds wrote:
> 
> > > ext2/ext3 ACLs and Extended Attributes
> > 
> > I don't know why people still want ACL's. There were noises about them for 
> > samba, but I'v enot heard anything since. Are vendors using this?
> 
> Because People Are Stupid(tm).  Because it's cheaper to put "ACL support: yes"
> in the feature list under "Security" than to make sure than userland can cope
> with anything more complex than  "Me Og.  Og see directory.  Directory Og's.
> Nobody change it".  C.f. snake oil, P.T.Barnum and esp. LSM users

It's nearly useless in a Unix-only context, true, however there's a rather
serious impedance mismatch for serving files to Windows that this
addresses. Emulating ACLs on the fly with groups to fit into the
Windows model is mostly doable but ain't pretty. 

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (14 preceding siblings ...)
  2002-10-31 14:52   ` Suparna Bhattacharya
@ 2002-10-31 16:37   ` Henning P. Schmiedehausen
  2002-11-01  0:52   ` James Simmons
  2002-11-01 10:24   ` What's left over. (Fbdev rewrite) Helge Hafting
  17 siblings, 0 replies; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-10-31 16:37 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@transmeta.com> writes:

>> ext2/ext3 ACLs and Extended Attributes

>I don't know why people still want ACL's. There were noises about them for 
>samba, but I'v enot heard anything since. Are vendors using this?

CIFS/SMB. Replacing Windows Fileservers. Supporting the required Windows
semantics. World domination.

That's one patch I personally consider really important. Getting the API in
place and a couple of FSses supporting it. The rest is up to user space.

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Is your idea good?  [was: Re: LTT for inclusion into 2.5]
  2002-10-31 16:19       ` Is your idea good? [was: Re: LTT for inclusion into 2.5] Larry McVoy
@ 2002-10-31 16:38         ` Cort Dougan
  2002-10-31 16:47         ` bob
  2002-10-31 17:35         ` Karim Yaghmour
  2 siblings, 0 replies; 333+ messages in thread
From: Cort Dougan @ 2002-10-31 16:38 UTC (permalink / raw)
  To: Larry McVoy, bob, Linus Torvalds, karim, Rusty Russell,
	linux-kernel, okrieg, okrieg, frankeh, LTT-Dev

An excellent engineering practice but extremely difficult to do.  This is
the holy-grail of software design and I don't think it would work for an
extremely loosely connected set of developers.

There is no central control of the system (or chain of accountability) and
that knocks down the practicality of this plan.  It would work extremely
well in another project, though.

} What does the list think of this?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 16:24               ` Stephen Wille Padnos
@ 2002-10-31 16:44                 ` Alexander Viro
  2002-10-31 17:11                   ` Stephen Frost
  2002-10-31 17:36                   ` Richard Gooch
  0 siblings, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-10-31 16:44 UTC (permalink / raw)
  To: Stephen Wille Padnos
  Cc: Dax Kelson, Chris Wedgwood, Rik van Riel, Linus Torvalds,
	Rusty Russell, linux-kernel



On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:

> >Then give them all the same account and be done with that.  Effect will
> >be the same.
> >  
> >
> 
> Unless I'm missing something, that only works if all the users need 
> *exactly* the same permissions to all files, which isn't a good assumption.

That's the point.  In practice shared writable access to a directory can be
easily elevated to full control of each others' accounts, since most of
userland code is written in implicit assumption that nothing bad happens with
directory structure under it.  And there is nothing kernel can do about that -
attacker does action you had explicitly allowed and your program goes bonkers
since it can't cope with that.  Mechanism used to allow that action doesn't
enter the picture - be it ACLs, groups or something else.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Is your idea good?  [was: Re: LTT for inclusion into 2.5]
  2002-10-31 16:19       ` Is your idea good? [was: Re: LTT for inclusion into 2.5] Larry McVoy
  2002-10-31 16:38         ` Cort Dougan
@ 2002-10-31 16:47         ` bob
  2002-10-31 17:35         ` Karim Yaghmour
  2 siblings, 0 replies; 333+ messages in thread
From: bob @ 2002-10-31 16:47 UTC (permalink / raw)
  To: Larry McVoy
  Cc: bob, Linus Torvalds, karim, Rusty Russell, linux-kernel, okrieg,
	okrieg, frankeh, LTT-Dev

Larry McVoy writes:
 > I don't mean to pick on LTT, I haven't used it, it may be the best thing
 > since sliced bread.
...
 >  > The one that always gets me is
 > 
 >     "I've added feature XYZ, I benchmarked it with <whatever, usually
 >     LMbench> and it didn't make a difference"

Larry,
     You're right - whoever wrote that useless LMbench anyway :-)

I agree it would be great to have have a tool that allows us to gather
information on some of what you suggest below - but it's hard - people in
software engineering have been working on such things for a long time.
Further, what you mention below does not make sense in isolation.  For
example a package could add 1000 lines of code and have almost no impact,
while another 10 lines of code could make a huge difference.  So while the
below metrics are fine, without arguing about the expected impact they're
not necessarily helpful.

That's why benchmarks are still helpful as they are indicative of what
expected performance might be.  If you're trying to get at maintainability
then I might (being a K42 convert) argue for a different strategy
altogether.

So what about LTT then.  Well sure enough we did run LMbench as some other
tests.  We ran a kernel compile, a tar, and LMbench - and posted results to
lkml.  While this hardly represents all possibilities, showing little
performance impact on these is a positive statement about impact on other
applications.

To address some of the list below: 
 lines of code: a lot - almost all can be configed out, 
 call depth increase: we can analyze - complicated since while it is a
                      couple levels - other calls in the code may be to
 cache footprint: how? - simulate?  this is tough - qualitatively I think for
                  ltt is small because the same code is used across all trace
                  events.  And less frequent trace events won't interfere
 parallelism: not quite sure what you mean here - we not have a non-blocking
              lockless scheme to address what I think the concern is here
 interface changes: I argue very very positive - as in my letter to Linus
                   getting various developers to talk about performance
                   with a common mechanism would be a big win


I'm sure this doesn't fully address your concerns - but if others feel some 
of the below numbers are really important we can certainly go about getting 
more accurate results then my above off-the-cuff info.

Robert Wisniewski
The K42 MP OS Project
Advanced Operating Systems
Scalable Parallel Systems
IBM T.J. Watson Research Center
914-945-3181
http://www.research.ibm.com/K42/
bob@watson.ibm.com

----

Larry McVoy writes:
 > I don't mean to pick on LTT, I haven't used it, it may be the best thing
 > since sliced bread.
 > 
 > I can tell you how to present this and any other feature similar to this
 > in a way which would make me a lot more willing to accept it, which
 > presupposes I'm doing Linus' job which of course I am not.  However,
 > it's likely that Linus has similar views but he gets to chime in and
 > speak for himself.
 > 
 > All of these tools/features/whatever add some cost.  The cost can be 
 > measured in lots of different ways:
 > 
 >     - lines of code
 >     - lines of code which can't be configed out
 >     - call depth increases
 >     - stack size increases
 >     - cache foot print increases
 >     - parallelism (think preempt)
 >     - interface changes
 > 
 > I suspect there are other metrics and it would be very cool if others would
 > chime in with their pet peeves.
 > 
 > What would be cool is if there was some way to quantify as much as possible 
 > of the accepted set of costs so that that could be balanced against the 
 > value of the change, right?
 > 
 > The one that always gets me is
 > 
 >     "I've added feature XYZ, I benchmarked it with <whatever, usually
 >     LMbench> and it didn't make a difference"
 > 
 > That is almost certainly misleading.  The real thing you want to do
 > is quantify the actual costs because there can be non-zero costs that
 > do not show up in benchmarks.  For example, suppose that the benchmark
 > neatly fits in the onchip caches and it only uses 1/2 of those caches.
 > Your change could increase the cache foot print to just fill the caches,
 > the benchmark says no difference, you declare success and move on.
 > The problem is that almost all changes are good enough that they match
 > this description.  Measuring them in isolation doesn't tell us enough.
 > If I combine two changes, both of which use up 1/2 the cache, there is
 > no longer any room for anything else in the cache.
 > 
 > I'd love to see a trend where patch requests for any non-trivial patch
 > included before/after data for the above metrics (and any others that 
 > people see as useful).  I'd love to see some people taking just one of 
 > the above and making a tool which measures that metric.  Then we combine
 > the tools into a "patch measurement suite" and start prefixing patches
 > with
 > 
 >     Code changes:
 > 	+1234 -5678 = -4444	(all code)
 > 	+123 -567 = -444	(all code subject to CONFIG_XYZ)
 > 
 >     Call depth:
 > 	+2 for read()
 > 	+2 for write()
 > 	no change for all other system calls
 > 
 >     Stack size:
 > 	+2099 bytes for read()/write() path
 > 
 >     Cache misses:
 > 	No change for benchmark1, 2, 3
 > 	12,000 data read misses for lat_ctx ....
 >     
 >     Etc.
 > 
 > What does the list think of this?
 > -- 
 > ---
 > Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 16:36     ` Oliver Xymoron
@ 2002-10-31 17:04       ` Stephen Frost
  2002-10-31 17:38       ` Linus Torvalds
  1 sibling, 0 replies; 333+ messages in thread
From: Stephen Frost @ 2002-10-31 17:04 UTC (permalink / raw)
  To: Oliver Xymoron
  Cc: Alexander Viro, Linus Torvalds, Rusty Russell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1240 bytes --]

* Oliver Xymoron (oxymoron@waste.org) wrote:
> On Wed, Oct 30, 2002 at 09:43:29PM -0500, Alexander Viro wrote:
> > Because People Are Stupid(tm).  Because it's cheaper to put "ACL support: yes"
> > in the feature list under "Security" than to make sure than userland can cope
> > with anything more complex than  "Me Og.  Og see directory.  Directory Og's.
> > Nobody change it".  C.f. snake oil, P.T.Barnum and esp. LSM users
> 
> It's nearly useless in a Unix-only context, true, however there's a rather
> serious impedance mismatch for serving files to Windows that this
> addresses. Emulating ACLs on the fly with groups to fit into the
> Windows model is mostly doable but ain't pretty. 

It's only nearly useless if you have some desire as an admin to
constantly be creating groups and changing group lists for users.  This
is not a feature which is useful only when serving files to Windows
machines, not even nearly.  AFS, Solaris, Irix etc have support for ACLs
and have a great deal of people who use them.  The simple yet common
situation of one user who wants to give even just read access to
another specific user for a given file is a pain in the ass to deal with
given the current structure.

	Stephen

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 15:46     ` Linus Torvalds
@ 2002-10-31 17:10       ` Patrick Finnegan
  2002-10-31 17:13       ` Michael Shuey
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-10-31 17:10 UTC (permalink / raw)
  To: linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.

To add to that list, here at Purdue University, we actively look at crash
dumps on other architectures, such as IBM AIX, and are starting to do the
same on Linux machines, after discovery of LKCD.

> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.

This has much relevance for the standard kernel, as much relevance as gdb
has for people using applications.  While a majority of non-techno-geek
end-users probably don't care about the patch, I'm certain that there are
plenty of organizations out there like Purdue that WANT lkcd to become a
standard part of the Linux kernel.   Until then, we're forced to do our
own kernel patching every time we push out a new kernel.

> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).

We actively use it.

> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).

LKCD isn't a 'random new feature'.  It's something that is present in
nearly ever other "Unix" on the market. (Yes I know Unix != Linux).  It's
a feature that should have been integrated by now IMHO.

> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.

Again, we're the end-user, not the vendor, and we're trying to drive to
have it included.  I've talked with outher sys admins in my department
here at Purdue, and have gotten a unanimous response that "It would be a
good and useful feature to have."

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif





^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 16:44                 ` Alexander Viro
@ 2002-10-31 17:11                   ` Stephen Frost
  2002-10-31 17:30                     ` Alexander Viro
  2002-10-31 17:36                   ` Richard Gooch
  1 sibling, 1 reply; 333+ messages in thread
From: Stephen Frost @ 2002-10-31 17:11 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen Wille Padnos, Dax Kelson, Chris Wedgwood, Rik van Riel,
	Linus Torvalds, Rusty Russell, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

* Alexander Viro (viro@math.psu.edu) wrote:
> On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:
> > Unless I'm missing something, that only works if all the users need 
> > *exactly* the same permissions to all files, which isn't a good assumption.
> 
> That's the point.  In practice shared writable access to a directory can be
> easily elevated to full control of each others' accounts, since most of
> userland code is written in implicit assumption that nothing bad happens with
> directory structure under it.  And there is nothing kernel can do about that -
> attacker does action you had explicitly allowed and your program goes bonkers
> since it can't cope with that.  Mechanism used to allow that action doesn't
> enter the picture - be it ACLs, groups or something else.

So you're not really arguing against ACLs, you're complaining that
userspace is broken when there's shared write access.  That's fine,
userspace should be fixed, inclusion of ACLs into the kernel shouldn't
be denied because of this.  ACLs should be optional, of course, and if
you want them some really noisy warnings about the problems of shared
writeable area with current userspace tools.  Of course, that same
warning should probably be included in 'groupadd'.

	Stephen

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 15:46     ` Linus Torvalds
  2002-10-31 17:10       ` Patrick Finnegan
@ 2002-10-31 17:13       ` Michael Shuey
  2002-10-31 19:04         ` Alan Cox
  2002-10-31 17:18       ` Matt D. Robinson
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 333+ messages in thread
From: Michael Shuey @ 2002-10-31 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

I'm a user, and I request that LKCD get merged into the kernel. :-)

On Thu, Oct 31, 2002 at 07:46:08AM -0800, Linus Torvalds wrote:
> What I'm saying by "vendor driven" is that it has no relevance for the 
> standard kernel, and since it has no relevance to that, then I have no 
> incentives to merge it. The crash dump is only useful with people who 
> actively look at the dumps, and I don't know _anybody_ outside of the 
> specialized vendors you mention who actually do that.

I actively look at LKCD dumps.  I have no affiliation with SGI, IBM, or any
of the previously mentioned companies.  I'm not aware of any vendors providing
pre-patched kernels with LKCD; right now my only option for reasonable crash
data is to patch and build my own kernel.

> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).

Here at Purdue University we're building several Linux clusters.  LKCD is
most useful to help find in-kernel problems.  Most of the time our crashes
are due to a flakey stick of RAM or a dying disk (or controller), but LKCD
dumps are still useful.  With a crash dump I can analyze the cause of the
crash after the fact, but without a dump my only option to get _any_ crash
data is to leave a console plugged into each node of my clusters.

Do you feel like donating a 700-port console server?  Right, so it's LKCD
for me then.

> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).
> 
> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.

I understand that Linux can't have random new features (especially going into
a feature-freeze).  However, any additions that provide better debugging info
are (in my opinion, at any rate) worth it.  Every other UNIX I've used (with
the possible exception of an early Ultrix) has some facility to inspect the
kernel - all have _at_least_ dumps that get written to a swap disk on a crash
and many have an in-core debugger.  Running gdb on a live kernel from a
remote machine isn't unheard of, at least with other OSes.  Unfortunately,
only aid you'll get in debugging a Linux kernel is the source code.  Sure,
you can add a mess of printk's all over suspect code, and yes, the console
gets a register dump on a panic, but that really isn't enough.  Some times
it's nice to be able to walk through the kernel's data structures and figure
out just what was going on when things died.  I get this with LKCD.

To that end, it'd be nice if the trace toolkit and SGI's kernel debugger were
added.  No, I haven't used them, but then I don't do much kernel development
either.  I'd bet that LTT and the kernel debugger would be very useful to
those who do, though.

-- 
Mike Shuey

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 15:46     ` Linus Torvalds
  2002-10-31 17:10       ` Patrick Finnegan
  2002-10-31 17:13       ` Michael Shuey
@ 2002-10-31 17:18       ` Matt D. Robinson
  2002-10-31 17:25         ` Linus Torvalds
  2002-10-31 22:20         ` Shawn
  2002-10-31 17:55       ` [lkcd-general] " Dave Craft
  2002-10-31 19:33       ` [lkcd-devel] " Castor Fu
  4 siblings, 2 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-10-31 17:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
|>That's fine. And since they are paid to support it, they can apply the 
|>patches.  

Sure, but why should they have to?  What technical reason is there
for not including it, Linus?

I completely don't understand your reasoning here.  I use it for my
home, not for work, and that's important for me.  And not everyone
can spend their evenings rolling up the next set of patches for
a distribution.  Yes, vendors want it, they need it, but there are
plenty of people like me that want this in too!

We want to see this in the kernel, frankly, because it's a pain
in the butt keeping up with your kernel revisions and everything
else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
(hopefully) Red Hat don't want to spend their time having to roll
this stuff in each and every time you roll a new kernel.

I mean, PLEASE, Linus, what do we have to do?  There are so many
interests in this stuff, and I really, truly don't get what's wrong
with putting this in the kernel?

Have you looked at it?  Have you looked at how it is now structure
to be non-invasive?  How it will allow other kernel developers to
generate their own dumping methods?  I mean, we sent you E-mails
weeks ago, and you didn't respond to any of them with even a word
of acknowledgement of receipt.

|>What I'm saying by "vendor driven" is that it has no relevance for the 
|>standard kernel, and since it has no relevance to that, then I have no 
|>incentives to merge it. The crash dump is only useful with people who 
|>actively look at the dumps, and I don't know _anybody_ outside of the 
|>specialized vendors you mention who actually do that.

I do.  Others like myself do.  And not just for development
purposes.  I don't like to see my system crash after installing one
of your new kernels and not be able to figure out what's wrong.
The nice thing is that LKCD there, it works, and I can just look
at the crash report instead of wishing that my console buffer
didn't just scroll off.  Oh, I know, I'll just wait for it to
happen again ... yeah, like that's real intelligent.

|>I will merge it when there are real users who want it - usually as a
|>result of having gotten used to it through a vendor who supports it. (And
|>by "support" I do not mean "maintain the patches", but "actively uses it"
|>to work out the users problems or whatever).
|>
|>Horse before the cart and all that thing.
|>
|>People have to realize that my kernel is not for random new features. The
|>stuff I consider important are things that people use on their own, or
|>stuff that is the base for other work. Quite often I want vendors to merge
|>patches _they_ care about long long before I will merge them (examples of
|>this are quite common, things like reiserfs and ext3 etc).

Other vendors have merged LKCD a long time ago and use it, and
expect it to be there.  And users like myself find it valuable on
their desktops, their servers, etc.  I mean, there's someone using
this at Purdue that's responded to you, just another kernel user
that likes to have this stuff there automatically.

|>THAT is what I mean by vendor-driven. If vendors decide they really want
|>the patches, and I actually start seeing noises on linux-kernel or getting
|>requests for it being merged from _users_ rather than developers, then
|>that means that the vendor is on to something.

TurboLinux, MonteVista, Veritas, SuSE, and UnitedLinux have LKCD.
With the most recent changes, I think Red Hat can put LKCD in now
such that it isn't invasive to their distribution.

I think SuSE has already expressed a desire to have this in.  If
you want to hear from others, I'll asked them to respond to you.

|>		Linus

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:18       ` Matt D. Robinson
@ 2002-10-31 17:25         ` Linus Torvalds
  2002-10-31 17:54           ` Matt D. Robinson
                             ` (6 more replies)
  2002-10-31 22:20         ` Shawn
  1 sibling, 7 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 17:25 UTC (permalink / raw)
  To: Matt D. Robinson; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


[ Ok, this is a really serious email. If you don't get it, don't bother 
  emailing me. Instead, think about it for an hour, and if you still don't 
  get it, ask somebody you know to explain it to you. ]

On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> 
> Sure, but why should they have to?  What technical reason is there
> for not including it, Linus?

There are many:

 - bloat kills:

	My job is saying "NO!"

	In other words: the question is never EVER "Why shouldn't it be
	accepted?", but it is always "Why do we really not want to live 
	without this?"

 - included features kill off (potentially better) projects.

	There's a big "inertia" to features. It's often better to keep 
	features _off_ the standard kernel if they may end up being
	further developed in totally new directions.

	In particular when it comes to this project, I'm told about
	"netdump", which doesn't try to dump to a disk, but over the net.
	And quite frankly, my immediate reaction is to say "Hell, I
	_never_ want the dump touching my disk, but over the network
	sounds like a great idea".

To me this says "LKCD is stupid". Which means that I'm not going to apply 
it, and I'm going to need some real reason to do so - ie being proven 
wrong in the field.

(And don't get me wrong - I don't mind getting proven wrong. I change my 
opinions the way some people change underwear. And I think that's ok).

> I completely don't understand your reasoning here.

Tough. That's YOUR problem.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:11                   ` Stephen Frost
@ 2002-10-31 17:30                     ` Alexander Viro
  2002-10-31 17:39                       ` Linus Torvalds
  0 siblings, 1 reply; 333+ messages in thread
From: Alexander Viro @ 2002-10-31 17:30 UTC (permalink / raw)
  To: Stephen Frost
  Cc: Stephen Wille Padnos, Dax Kelson, Chris Wedgwood, Rik van Riel,
	Linus Torvalds, Rusty Russell, linux-kernel



On Thu, 31 Oct 2002, Stephen Frost wrote:

> So you're not really arguing against ACLs, you're complaining that
> userspace is broken when there's shared write access.  That's fine,
> userspace should be fixed, inclusion of ACLs into the kernel shouldn't
> be denied because of this.  ACLs should be optional, of course, and if
> you want them some really noisy warnings about the problems of shared
> writeable area with current userspace tools.  Of course, that same
> warning should probably be included in 'groupadd'.

	No.  I'm saying that ACLs do not have a point until at least basic
userland gets ready for setups people want ACLs for.  Adding features that
can't be used until $BIG_WORK is done is idiocy in the best case and
danger in the worst.  Especially since $BIG_WORK does not depend on these
features.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Is your idea good?  [was: Re: LTT for inclusion into 2.5]
  2002-10-31 16:19       ` Is your idea good? [was: Re: LTT for inclusion into 2.5] Larry McVoy
  2002-10-31 16:38         ` Cort Dougan
  2002-10-31 16:47         ` bob
@ 2002-10-31 17:35         ` Karim Yaghmour
  2 siblings, 0 replies; 333+ messages in thread
From: Karim Yaghmour @ 2002-10-31 17:35 UTC (permalink / raw)
  To: Larry McVoy
  Cc: bob, Linus Torvalds, Rusty Russell, linux-kernel, okrieg, okrieg,
	frankeh, LTT-Dev


Hello Larry,

First, thanks for your feedback.

I understand and share you concern about the use of micro-benchmarks
to qualify/quantify the impact additional code on the kernel. This is
precisely the reason why I chose not to use micro-benchmarks in the
Usenix article I presented about LTT at the 2000 annual technical
conference. I was suprised to see some of the selection commitee
members actually come up to me and say: "I'm so glad to see a paper
that doesn't use micro-benchmarks."

That's why we elected to create 2 separate sets of benchmarks, one
using real-life applications (kernel build, bzip2, etc.) and one
using LMbench. Personnally, I would have been satisfied with just the
real-life applications, but I know that many folks on the LKML want
to see LMbench numbers, so we included those too. That said, I find
it very positive that you keep a healthy dose of self-criticism towards
your own tool, this is exactly the kind of stuff that makes LMbench so
good. So too is it with LTT. I've always been on the lookout for
reducing costs here and there while acheiving maximal functionality.

Fortunately, repeated testing and analysis on LTT by many parties
using many tools have confirmed that the current LTT has very low
impact on many fronts, including static code modifications.

So, for example, we had one example run of LMbench where we ran kernel
compiles in the background (i.e. a script restarted the kernel
compile every time it ended). To make it as simple as possible, here's
the elapsed time taken to run LMbench on 4x SMP system in the various
configurations:
---------------------------------------------------------------------
vanilla                         14:27
vanilla+ltt+ltt off             14:26
vanilla+ltt+ltt on              14:31
vanilla+ltt+ltt on+daemon on    14:32

vanilla+ltt+ltt on+kernel compile               15:03
vanilla+ltt+ltt on+kernel compiles+daemon on    15:13
---------------------------------------------------------------------

As you can see, the differences in percentages are all within the 2%
range we mentioned earlier.

To address the specific metrics you mentioned:

>     Code changes:
We've posted diffstats with every patch we published on the LKML.

>     Call depth:
We're talking 3 for syscalls and 2 for all other events in order to
reach the core tracing function proper (this could easily be reduced
by 1 if it's really a problem). Add 1 for locking scheme and 3 for
the non-locking scheme. I'm not counting the calls we make to kernel
services, which somewhat goes to show that this is a flawed measure
because I've never seen any thorough analysis of call depths for
kernel services. Can't say that it wouldn't be an interesting
research project to see someone do that for the entire kernel, we
may find some interesting results.

>     Stack size:
This really depends on the quantity of data being passed to the tracer,
which varies greatly from one event to the other. I can say this, however:
in all the testing I've seen done on LTT in the past, there has never
been a stack problem. This isn't an invitation for being reckless. I am
aware of stack issues and have been on the lookout for the any related
problem.

>     Cache misses:
Bob has said it best. I think the best that we can do about this is
to follow the known-to-be-good guidelines about cache interference.
The discussion Ingo and Bob had on this issue in relation to LTT,
for example, shows that we've thought this through.

Beyond everything I've said above, I'd invite you to download LTT and
try it out. I'm sure you'll see why this is important for Linux users.

BTW, while I'm on the subject of LMbench, I've been trying to find a
way to run it on an embedded system. The problem is that this thing
needs a compiler and that would mean having to cross-compile gcc itself
and so on, which creates storage problems etc. Are there any plans to
make a mini-LMbench?

Thanks again,

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 16:44                 ` Alexander Viro
  2002-10-31 17:11                   ` Stephen Frost
@ 2002-10-31 17:36                   ` Richard Gooch
  1 sibling, 0 replies; 333+ messages in thread
From: Richard Gooch @ 2002-10-31 17:36 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen Wille Padnos, Dax Kelson, Chris Wedgwood, Rik van Riel,
	Linus Torvalds, Rusty Russell, linux-kernel

Alexander Viro writes:
> On Thu, 31 Oct 2002, Stephen Wille Padnos wrote:
> 
> > >Then give them all the same account and be done with that.  Effect will
> > >be the same.
> > 
> > Unless I'm missing something, that only works if all the users need 
> > *exactly* the same permissions to all files, which isn't a good assumption.
> 
> That's the point.  In practice shared writable access to a directory
> can be easily elevated to full control of each others' accounts,
         ^^^^^^
While that may be true in theory, in practice it's not necessarily the
case. Many people don't have the expertise to make use of such
exploits. And before you say that they can download a pre-cooked
exploit kit, let me tell you that there are plenty of people who don't
have the time or inclination to do that.

I've seen you talk about these kinds of things before, and you always
seem to be talking about the typical nightmarish undergrad CS lab
where the kids spend all their time trying to crack each other and the
system. And I'm not saying that these don't exist: I've seen it.

But there are other environments (say a research lab with grad
students, post-docs and faculty) where the inhabitants either don't
have the skills or don't have the interest in cracking accounts.
Everyone is too busy doing their own research. Cracking the mysteries
of the universe seems to be more interesting.

So group write access and ACL's *can* lead to wanton cracking, but for
many environments it's not an issue. For many, the dangers lie outside
the firewall, not inside.

Note that I'm not specifically advocating ACL's, I'm just letting you
know that the problem you're concerned about is, for good reason, not
a problem for everyone.

I will note that one appealing aspect of ACL's is that they do not
require administrator intervention. That's good for a user who just
wants to set something up without having to wait for the sysadmin.
It's also good for the sysadmin (excepting control freaks) who doesn't
want to do things that the users can (or should) actually be doing by
themselves.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 16:36     ` Oliver Xymoron
  2002-10-31 17:04       ` Stephen Frost
@ 2002-10-31 17:38       ` Linus Torvalds
  2002-10-31 18:00         ` Oliver Xymoron
  1 sibling, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 17:38 UTC (permalink / raw)
  To: Oliver Xymoron; +Cc: Alexander Viro, Rusty Russell, linux-kernel


Note that as far as ACL's go, enough people have convinced me that we want 
them, with clear real-life issues. So don't worry about them, I'll merge 
it.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:30                     ` Alexander Viro
@ 2002-10-31 17:39                       ` Linus Torvalds
  0 siblings, 0 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 17:39 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Stephen Frost, Stephen Wille Padnos, Dax Kelson, Chris Wedgwood,
	Rik van Riel, Rusty Russell, linux-kernel


On Thu, 31 Oct 2002, Alexander Viro wrote:
> 
> 	No.  I'm saying that ACLs do not have a point until at least basic
> userland gets ready for setups people want ACLs for.  Adding features that
> can't be used until $BIG_WORK is done is idiocy in the best case and
> danger in the worst.  Especially since $BIG_WORK does not depend on these
> features.

I think samba alone counts as enough user-land usage. 

And if it turns out nobody else ever wants to use them, that's fine too.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
@ 2002-10-31 17:54           ` Matt D. Robinson
  2002-10-31 17:54             ` Linus Torvalds
  2002-11-02 23:44             ` Horst von Brand
  2002-10-31 18:10           ` Chris Friesen
                             ` (5 subsequent siblings)
  6 siblings, 2 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-10-31 17:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>[ Ok, this is a really serious email. If you don't get it, don't bother 
|>  emailing me. Instead, think about it for an hour, and if you still don't 
|>  get it, ask somebody you know to explain it to you. ]

Thanks for the response.  I don't think I need an hour.  This is
pretty simple.

|>On Thu, 31 Oct 2002, Matt D. Robinson wrote:
|>> 
|>> Sure, but why should they have to?  What technical reason is there
|>> for not including it, Linus?
|>
|>There are many:
|>
|> - bloat kills:
|>
|>	My job is saying "NO!"
|>
|>	In other words: the question is never EVER "Why shouldn't it be
|>	accepted?", but it is always "Why do we really not want to live 
|>	without this?"

This isn't bloat.  If you want, it can be built as a module, and
not as part of your kernel.  How can that be bloat?  People who
build kernels can optionally build it in, but we're not asking
that it be turned on by default, rather, built as a module so
people can load it if they want to.  We made it into a module
because 18 months ago you complained about it being bloat.  We
addressed your concerns.

Some people, particularly large SSI configurations, can't live
without this.  You shouldn't crash once.  Crashing twice, or
more often, is inexcusable.

|> - included features kill off (potentially better) projects.
|>
|>	There's a big "inertia" to features. It's often better to keep 
|>	features _off_ the standard kernel if they may end up being
|>	further developed in totally new directions.

I can't argue against this ... to do so would mean that you don't
accept any new features for 2.5, and there are a lot of projects
like mine that need to go in, although I do understand your concerns.

|>	In particular when it comes to this project, I'm told about
|>	"netdump", which doesn't try to dump to a disk, but over the net.
|>	And quite frankly, my immediate reaction is to say "Hell, I
|>	_never_ want the dump touching my disk, but over the network
|>	sounds like a great idea".

We've integrated the "netdump" capabilities as a dump method
for LKCD.  It's an option for dumping, just like all the other
dump methods available to people?  Want to dump to disk?  Use
LKCD.  Want to dump on the network?  USE LKCD.  What's wrong
with that?

We've created a net dump method that allows you to dump across the
network from Mohammed Abbas (modified from Ingo's netconsole dump).
It integrates into LKCD beautifully.  If you want that patch with
the rest of our LKCD patches, we can include it, no problem.

|>To me this says "LKCD is stupid". Which means that I'm not going to apply 
|>it, and I'm going to need some real reason to do so - ie being proven 
|>wrong in the field.

Hopefully some of this changes your mind.

|>(And don't get me wrong - I don't mind getting proven wrong. I change my 
|>opinions the way some people change underwear. And I think that's ok).
|>
|>> I completely don't understand your reasoning here.
|>
|>Tough. That's YOUR problem.

It is.  I lose sleep because this is my problem.  I lose time on
the weekends because this is my problem.

If you've _reviewed_ the LKCD patches and still have the opinions
you've mentioned above, then I'll consider this your position and
be done with it.  Otherwise, please accept the code.

We'll keep doing our best to keep up with your kernels in the
meantime.

|>		Linus

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:54           ` Matt D. Robinson
@ 2002-10-31 17:54             ` Linus Torvalds
  2002-10-31 18:21               ` Patrick Finnegan
  2002-10-31 18:31               ` John Alvord
  2002-11-02 23:44             ` Horst von Brand
  1 sibling, 2 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 17:54 UTC (permalink / raw)
  To: Matt D. Robinson; +Cc: Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> 
> This isn't bloat.  If you want, it can be built as a module, and
> not as part of your kernel.  How can that be bloat? 

I don't care one _whit_ about the size of the binary. I don't maintain 
binaries, adn the binary can be gigabytes for all I care.

The only thing I care about is source code. So the "build it as a module 
and it is not bloat" argument is a total nonsense thing as far as I'm 
concerned. 

Anyway, new code is always bloat to me, unless I see people using them.

Guys, why do you even bother trying to convince me? If you are right, you 
will be able to convince other people, and that's the whole point of open 
source.

Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
personally convinced. I'm only one person.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 15:46     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2002-10-31 17:18       ` Matt D. Robinson
@ 2002-10-31 17:55       ` Dave Craft
  2002-10-31 18:45         ` Patrick Mochel
  2002-10-31 19:33       ` [lkcd-devel] " Castor Fu
  4 siblings, 1 reply; 333+ messages in thread
From: Dave Craft @ 2002-10-31 17:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.

  Unfortunately the vast majority of the customers I deal with
  buy a distribution and then put a kernel from kernel.org
  on.  I believe this comes about because of either needing fixes
  or function that appear in later kernels that have not made
  it to the distributions kernels yet.

  Even if the distribution included LKCD in their kernel,
  I lose lots of debug ability once customers switch over to
  kernel.org and no longer have the LKCD patch.

  Thus we are currently left with having to maintain LKCD patches for
  many arbitrary kernel.org kernels and convince customers to apply
  it BEFORE they start encountering problems that we'll have to look at.
  Application of patches that aren't automatically included in kernel.org
  rarely happens with our customer set (before problems occur),
  no matter how much we flag the issue to them up front.

  I realize that while my current capacity makes me fall into
  the 'vendor' support you speak of, I believe I am actually
  advocating its inclusion on behalf of real live customers.

  Vendors can and do actually help linux development, by screening,
  researching fixes, and or directly fixing lots of customer
  problems that you never have to deal with.  To do that, LKCD
  is the debug weapon of choice.

  I request you reconsider the inclusion of LKCD.

  Regards, Dave

	Mail : dave@austin.ibm.com	Phone : 512-838-8248


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:38       ` Linus Torvalds
@ 2002-10-31 18:00         ` Oliver Xymoron
  2002-11-06 20:52           ` Florian Weimer
  0 siblings, 1 reply; 333+ messages in thread
From: Oliver Xymoron @ 2002-10-31 18:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Alexander Viro, Rusty Russell, linux-kernel

On Thu, Oct 31, 2002 at 09:38:41AM -0800, Linus Torvalds wrote:
> 
> Note that as far as ACL's go, enough people have convinced me that we want 
> them, with clear real-life issues. So don't worry about them, I'll merge 
> it.

Ok, so now lets work on a Documentation/filesystems patch pointing
out a few of the common pitfalls, as I definitely agree they invite
some grave mistakes and are best avoided in most scenarios.

- /tmp-style symlink issues on shared directories
- vast majority of software (including security tools) ACL-unaware
- much harder to check for correctness

Al, I'm sure you have more..

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 10:16   ` Trever L. Adams
@ 2002-10-31 18:08     ` Nicholas Wourms
  0 siblings, 0 replies; 333+ messages in thread
From: Nicholas Wourms @ 2002-10-31 18:08 UTC (permalink / raw)
  To: linux-kernel

Trever L. Adams wrote:

> On Wed, 2002-10-30 at 21:31, Linus Torvalds wrote:
> 
>> > ext2/ext3 ACLs and Extended Attributes
>> 
>> I don't know why people still want ACL's. There were noises about them
>> for samba, but I'v enot heard anything since. Are vendors using this?
>> 
> 
> I am sure I don't count (not being a vendor), but Intermezzo offers
> support for this (they are waiting on feature freeze to redo it to 2.5
> according to an email I have).  I want this stuff.  Yes, u+g+w is nice,
> but good ACLs are even better.  Please, if this is technically correct
> in implementation, do put it in.
> 

I agree, having them is far better then the standard u+g+w that's been 
around for ages.  I think it gives the "finer" grain of control over your 
system that a lot of users may desire.  Not to mention the fact that ACL's 
are well supported by the recently merged XFS.  If I'm not mistaken, AFS 
uses them as well.  I *really* don't see the overhead cost here in terms of 
compiled kernel size when they are turned off.  As for the size of the 
source tarball, who cares?  People should quit whining about the size of 
the sources and get over it!  Storage is cheap and broadband is in 
widespread use.

Cheers,
Nicholas



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
  2002-10-31 17:54           ` Matt D. Robinson
@ 2002-10-31 18:10           ` Chris Friesen
  2002-10-31 18:22             ` Linus Torvalds
                               ` (2 more replies)
  2002-10-31 18:15           ` Andrew Morton
                             ` (4 subsequent siblings)
  6 siblings, 3 replies; 333+ messages in thread
From: Chris Friesen @ 2002-10-31 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

Linus Torvalds wrote:

> 	In particular when it comes to this project, I'm told about
> 	"netdump", which doesn't try to dump to a disk, but over the net.
> 	And quite frankly, my immediate reaction is to say "Hell, I
> 	_never_ want the dump touching my disk, but over the network
> 	sounds like a great idea".
> 
> To me this says "LKCD is stupid". Which means that I'm not going to apply 
> it, and I'm going to need some real reason to do so - ie being proven 
> wrong in the field.

How do you deal with netdump when your network driver is what caused the 
crash?

Ideally I would like to see a dump framework that can have a number of 
possible dump targets.  We should be able to dump to any combination of 
network, serial, disk, flash, unused ram that isn't wiped over restarts, 
etc...

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 14:31           ` Jeff Garzik
@ 2002-10-31 18:12             ` Chris Wedgwood
  2002-10-31 18:49               ` Linus Torvalds
  0 siblings, 1 reply; 333+ messages in thread
From: Chris Wedgwood @ 2002-10-31 18:12 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Dax Kelson, Rik van Riel, Linus Torvalds, Rusty Russell, linux-kernel

On Thu, Oct 31, 2002 at 09:31:09AM -0500, Jeff Garzik wrote:

> What's wrong with our current 2.5.45 crypto api?

It's synchronous and assume everything is synchronous.  Lots of
hardware (most) doesn't work that way.


  --cw


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
  2002-10-31 17:54           ` Matt D. Robinson
  2002-10-31 18:10           ` Chris Friesen
@ 2002-10-31 18:15           ` Andrew Morton
  2002-10-31 19:58             ` Bernhard Kaindl
  2002-11-02  0:49             ` What's left over. - Dave's crash code supports a gdb interface for LKCD crash dumps Piet Delaney
  2002-10-31 18:16           ` What's left over Oliver Xymoron
                             ` (3 subsequent siblings)
  6 siblings, 2 replies; 333+ messages in thread
From: Andrew Morton @ 2002-10-31 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

Linus Torvalds wrote:
> 
> [ lkcd ]
>

We'll be spending the next six months stabilising and hardening
the used-to-be-2.5 kernel.  If grunts like me can get hold a
copy of the other person's kernel image from time-of-crash, that
has a ton of value.

(Disclaimer: I've never used lkcd.  I'm assuming that it's
possible to gdb around in a dump)

>         In particular when it comes to this project, I'm told about
>         "netdump", which doesn't try to dump to a disk, but over the net.

It could help.  But like serial console, the random person whose
kernel just died often can't be bothered setting it up, or simply
doesn't have the gear, or the crash is not repeatable.


So.  _If_ lkcd gives me gdb-able images from time-of-crash, I'd
like it please.  And I'm the grunt who spent nearly two years
doing not much else apart from working 2.3/2.4 oops reports.


Oh, and as Rusty has pointed out, we lose a *lot* of oops reports
because users are in X and the backtrace doesn't make it to the
logs.  Rusty has a little app which dumps just the oops report to
disk somewhere.    Want that too.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
                             ` (2 preceding siblings ...)
  2002-10-31 18:15           ` Andrew Morton
@ 2002-10-31 18:16           ` Oliver Xymoron
  2002-10-31 18:26             ` Linus Torvalds
  2002-10-31 18:49           ` Rik van Riel
                             ` (2 subsequent siblings)
  6 siblings, 1 reply; 333+ messages in thread
From: Oliver Xymoron @ 2002-10-31 18:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, Oct 31, 2002 at 09:25:21AM -0800, Linus Torvalds wrote:
> (And don't get me wrong - I don't mind getting proven wrong. I change my 
> opinions the way some people change underwear. And I think that's ok).

As in 'sometimes not even when hundreds of people start haranguing me
about it in public forums'? 

Perhaps not the best analogy.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:54             ` Linus Torvalds
@ 2002-10-31 18:21               ` Patrick Finnegan
  2002-10-31 18:31               ` John Alvord
  1 sibling, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-10-31 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

>
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > This isn't bloat.  If you want, it can be built as a module, and
> > not as part of your kernel.  How can that be bloat?
>
> I don't care one _whit_ about the size of the binary. I don't maintain
> binaries, adn the binary can be gigabytes for all I care.
>
> The only thing I care about is source code. So the "build it as a module
> and it is not bloat" argument is a total nonsense thing as far as I'm
> concerned.

So, you don't like bloat, such as having 22 different file systems (only
including the ones that can be placed on disk, not things like devfs or
smbfs...). That's more filesystems than I have dollars in my wallet at
the moment.   For the amount of utility that this code provides, it's
definately not 'bloat'.

> Anyway, new code is always bloat to me, unless I see people using them.

HEY!!! WE'RE USING IT!!!

> Guys, why do you even bother trying to convince me? If you are right, you
> will be able to convince other people, and that's the whole point of open
> source.

Now this sounds more like something I'd hear from Sun trying to get a fix
for a version of Solaris without having to buy a new one.  I thought the
whole point of Free Software was sharing with the community, and doing
what's best for the community.

> Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
> personally convinced. I'm only one person.

That's the same as claiming that George W. Bush is just one person....

So I'll plea yet again, please add LKCD!

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:10           ` Chris Friesen
@ 2002-10-31 18:22             ` Linus Torvalds
  2002-10-31 20:59               ` Dave Anderson
  2002-11-01  6:34               ` Bill Davidsen
  2002-10-31 18:50             ` Alan Cox
  2002-10-31 21:33             ` Rusty Russell
  2 siblings, 2 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 18:22 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


On Thu, 31 Oct 2002, Chris Friesen wrote:
> 
> How do you deal with netdump when your network driver is what caused the 
> crash?

Actually, from a driver perspective, _the_ most likely driver to crash is 
the disk driver. 

That's from years of experience. The network drivers are a lot simpler,
the hardware is simpler and more standardized, and doesn't do as many
things. It's just plain _easier_ to write a network driver than a disk
driver.

Ask anybody who has done both.

But that's not the real issue. The real issue is that I have no personal
incentives to try to merge the thing, and as a result I think I'm the
wrong person to do so. I've told people over and over again that I think
this is a "vendor merge", and I'm fed up with people not _getting_ it.

Don't bother to ask me to merge the thing, that only makes me get even
more fed up with the whole discussion. This is open source, guys. Anybody 
can merge it. Because I don't particularly believe in it doesn't mean that 
it cannot be used. It only means that I want to see users flock to it and 
show my beliefs wrong.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:16           ` What's left over Oliver Xymoron
@ 2002-10-31 18:26             ` Linus Torvalds
  0 siblings, 0 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 18:26 UTC (permalink / raw)
  To: Oliver Xymoron; +Cc: linux-kernel


On Thu, 31 Oct 2002, Oliver Xymoron wrote:
> 
> Perhaps not the best analogy.

Heh. I like my analogies bad. The best analogies should make you go 
"huh!" - kind of like a pink poodle in a tutu.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:56         ` Chris Wedgwood
  2002-10-31 14:31           ` Jeff Garzik
@ 2002-10-31 18:28           ` Nicholas Wourms
  2002-10-31 18:58             ` Alexander Viro
  2002-10-31 19:20             ` Alan Cox
  1 sibling, 2 replies; 333+ messages in thread
From: Nicholas Wourms @ 2002-10-31 18:28 UTC (permalink / raw)
  To: linux-kernel

Chris Wedgwood wrote:

> On Wed, Oct 30, 2002 at 11:48:23PM -0700, Dax Kelson wrote:
> 
>> Technically speaking you can achieve ACL like permissions/behavior
>> using the historical UNIX security model by creating a group EACH
>> time you run into a unique case permission scenario.
> 
> I'm not arguing against this... I'm claiming POSIX ACLs are mostly
> brain-dead and almost worthless (broken by committee pressure and too
> many people making stupid concessions).
> 
> If we must have ACLs, why not do it right?
> 
>> Without ACLs, if Sally, Joe and Bill need rw access to a file/dir,
>> just create another group with just those three people in.  Over
>> time, of course, this leads to massive group proliferation.  Without
>> Tim Hockin's patch, 32 groups is maximum number of groups a user can
>> be a member of.
> 
> How many people actually need this level of complexity?
> 
> Why are we adding all this shit and bloat because of perceived
> problems most people don't have?  What next, some kind of misdesigned
> in-kernel CryptoAPI?

Get over it!  If you haven't noticed, CryptoAPI is merged already.  The only 
bloat ACLs cause is the size of the source tarball.  If your connection is 
slow or you are out of diskspace, too bad!  I'm sure I'm not the only one 
who is tired of hearing people whine about "bloat" wrt the sources and 
demanding that features they don't use be ignored.  No one (non-core) 
feature will be useful to everyone, that is a given fact.  The point is 
that while you see no use for it, there are many others out there who do.  
ACLs are something which have existed in the Solaris/BSD world for a long 
time now, and people who have admin these boxen find ACLs to be quite 
useful.

Cheers,
Nicholas



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:54             ` Linus Torvalds
  2002-10-31 18:21               ` Patrick Finnegan
@ 2002-10-31 18:31               ` John Alvord
  1 sibling, 0 replies; 333+ messages in thread
From: John Alvord @ 2002-10-31 18:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002 09:54:54 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> wrote:


>Guys, why do you even bother trying to convince me? If you are right, you 
>will be able to convince other people, and that's the whole point of open 
>source.
>
>Being "vendor-driven" is _not_ a bad thing. It only means that _I_ am not
>personally convinced. I'm only one person.

It sounds to me like there needs to be L-K traffic when problems are
solved using LKCD.

Personally I love crash dumps... in 33 years of computing I have spent
a total of 1-2 years doing nothing but enhancing and developing
post-processing facilities. The true benefit is not just the "crashed
here, add a null check nonsense". It is the ability to examine the
whole system state. With an inboard trace table, you can even go back
in time. You can look at call stacks, locks held, state of allocated
memory, etc etc. If you save callstacks and time with allocated
memory, you can track down storage growth problems. I have spent weeks
winkling problems out of crash dumps, solving problems the developers
didn't even know existed.

With the right facility you can take crash dump snapshots and keep on
running. It is a great tool for understanding a system.

But until there is a flow of results - good quality fixes - resulting
from such analysis, I can see exactly why LT is doubtful. 

john alvord

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 17:55       ` [lkcd-general] " Dave Craft
@ 2002-10-31 18:45         ` Patrick Mochel
  2002-10-31 19:16           ` Stephen Hemminger
  0 siblings, 1 reply; 333+ messages in thread
From: Patrick Mochel @ 2002-10-31 18:45 UTC (permalink / raw)
  To: Dave Craft
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel


On Thu, 31 Oct 2002, Dave Craft wrote:

> On Thu, 31 Oct 2002, Linus Torvalds wrote:
> 
> > What I'm saying by "vendor driven" is that it has no relevance for the
> > standard kernel, and since it has no relevance to that, then I have no
> > incentives to merge it. The crash dump is only useful with people who
> > actively look at the dumps, and I don't know _anybody_ outside of the
> > specialized vendors you mention who actually do that.
> 
>   Unfortunately the vast majority of the customers I deal with
>   buy a distribution and then put a kernel from kernel.org
>   on.  I believe this comes about because of either needing fixes
>   or function that appear in later kernels that have not made
>   it to the distributions kernels yet.
> 
>   Even if the distribution included LKCD in their kernel,
>   I lose lots of debug ability once customers switch over to
>   kernel.org and no longer have the LKCD patch.
> 
>   Thus we are currently left with having to maintain LKCD patches for
>   many arbitrary kernel.org kernels and convince customers to apply
>   it BEFORE they start encountering problems that we'll have to look at.
>   Application of patches that aren't automatically included in kernel.org
>   rarely happens with our customer set (before problems occur),
>   no matter how much we flag the issue to them up front.


So, this is precisely where something like OSDL's Carrier Grade and Data 
Center working groups can come into play, amazingly enough. 

By now, nearly everyone has heard about the working groups and nearly
every developer that has, despises them. Even I resist association with
them. But, they can have some real value to the vendors and the OEMs in 
exactly the way you describe. 

Take for example DCL. It's a kernel tree with several base patches 
intended to make Linux better in the data center. The base is not fancy, 
and includes things like LKCD and kdb (I think). It's actively maintained 
and updated more often than Linus makes a release (by virtue of 
bitkeeper).

The intent is to later have multiple child trees that implement features
for a specific application space (e.g. databases), while maintainig the
same base set of features. People wishing to use the most recent kernel 
with those features can use the DCL tree directly. Or an OEM FAE can use 
the tree to build something for the vendor, or add extra features.

Note that it's not a distribution. We don't even make real releases, since 
we don't create tarballs or patches (it's only in BK, which actually kinda 
sucks). It's merely a means to have these features actively maintained and 
kept in synch. 

And really, that's what everyone wants. Linus doesn't want the features,
as don't other developers, regardless of the Buzzword or Coolness factors.
Some vendors and users do want them. The developers of the features and
distributors of features don't want to deal with the tedium and pain of
updating patches each and every release.

In the end, it comes down to the fact that Linus's tree is Linus's tree. 
Other people can have their trees. I'm not going to tell you go off and 
make your own if you want those features so bad, because I know what a 
pain in the ass it is, and I know having someone else do it is a lot 
easier.

DCL and CGL have their trees, for purposes probably very very similar to 
what your customers need. I encourage you to check them out and work with 
them (or talk to people in your company that are). Try and make it work, 
and everyone can be happy (relativey). And, if DCL and CGL aren't 
satisfying the space that you need, please speak up to OSDL and the 
working groups. People are listening, and willing to take your suggestions 
into consideration. 

Relevant URLs:

http://osdl.org/projects/cgl/
http://osdl.org/projects/dcl/

	-pat "kissing serious butt" mochel


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:12             ` Chris Wedgwood
@ 2002-10-31 18:49               ` Linus Torvalds
  2002-10-31 19:43                 ` Chris Wedgwood
  0 siblings, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 18:49 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Jeff Garzik, Dax Kelson, Rik van Riel, Rusty Russell, linux-kernel


On Thu, 31 Oct 2002, Chris Wedgwood wrote:
> 
> It's synchronous and assume everything is synchronous.  Lots of
> hardware (most) doesn't work that way.

Think of it another way: many users will likely _require_ atomic
encryption / decryption (done in softirq contexts etc), and thus a 
synchronous interface. Also, it simplifies the code and makes it more 
efficient.

Any hardware that needs to go off and think about how to encrypt something
sounds like it's so slow as to be unusable. I suspect that anything that
is over the PCI bus is already so slow (even if it adds no extra cycles of
its own) that you're better off using the CPU for the encryption rather
than some external hardware.

In short, from what I can tell, there is no huge actual reason to ever
allow a asynchronous interface. Such interfaces are likely fine for things
like network cards that can do encryption on their own on outgoing or
incoming packets, but that is not a general-purpose encryption engine, and
would not merit being part of an encryption library anyway.

[ Such a card is just a way to _avoid_ using the encryption library - the
  same way we can avoid using the checksumming stuff for network cards 
  that can do their own checksums ]

We'll see. I'd rather have a simpler interface that works for all relevant
cases today, and then if external crypto chips end up being common and
sufficiently efficient, we can always re-consider. Are the DMA-over-PCI
roundtrip (and resulting cache invalidations) overheads really worth the
extra hardware?

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
                             ` (3 preceding siblings ...)
  2002-10-31 18:16           ` What's left over Oliver Xymoron
@ 2002-10-31 18:49           ` Rik van Riel
  2002-10-31 21:02           ` Jeff Garzik
  2002-11-01  6:27           ` Bill Davidsen
  6 siblings, 0 replies; 333+ messages in thread
From: Rik van Riel @ 2002-10-31 18:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> 	In particular when it comes to this project, I'm told about
> 	"netdump", which doesn't try to dump to a disk, but over the net.

And guess what ?   Netdump is one of various LKCD dump methods ...

regards,

Rik
-- 
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:10           ` Chris Friesen
  2002-10-31 18:22             ` Linus Torvalds
@ 2002-10-31 18:50             ` Alan Cox
  2002-10-31 21:33             ` Rusty Russell
  2 siblings, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-10-31 18:50 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 18:10, Chris Friesen wrote:
> > To me this says "LKCD is stupid". Which means that I'm not going to apply 
> > it, and I'm going to need some real reason to do so - ie being proven 
> > wrong in the field.
> 
> How do you deal with netdump when your network driver is what caused the 
> crash?

Netdump drives the system itself. Any dump driver has to as it cant
assume the system is in a remotely sane state



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:28           ` Nicholas Wourms
@ 2002-10-31 18:58             ` Alexander Viro
  2002-10-31 19:14               ` Nicholas Wourms
  2002-10-31 19:20             ` Alan Cox
  1 sibling, 1 reply; 333+ messages in thread
From: Alexander Viro @ 2002-10-31 18:58 UTC (permalink / raw)
  To: Nicholas Wourms; +Cc: linux-kernel



On Thu, 31 Oct 2002, Nicholas Wourms wrote:

> slow or you are out of diskspace, too bad!  I'm sure I'm not the only one 
> who is tired of hearing people whine about "bloat" wrt the sources and 
> demanding that features they don't use be ignored.  No one (non-core) 

One look at the From:
understanding has blossomed
.procmailrc grows


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:13       ` Michael Shuey
@ 2002-10-31 19:04         ` Alan Cox
  2002-10-31 19:42           ` Michael Shuey
  2002-11-01 22:25           ` Pavel Machek
  0 siblings, 2 replies; 333+ messages in thread
From: Alan Cox @ 2002-10-31 19:04 UTC (permalink / raw)
  To: shuey
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 17:13, Michael Shuey wrote:
> I'm a user, and I request that LKCD get merged into the kernel. :-)
> Do you feel like donating a 700-port console server?  Right, so it's LKCD
> for me then.

Wouldn't you rather they neatly tftp'd dumps to a nominated central
server which noticed the arrival, did the initial processing with a perl
script and mailed you a summary ?



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:58             ` Alexander Viro
@ 2002-10-31 19:14               ` Nicholas Wourms
  0 siblings, 0 replies; 333+ messages in thread
From: Nicholas Wourms @ 2002-10-31 19:14 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

Alexander Viro wrote:
> 
> On Thu, 31 Oct 2002, Nicholas Wourms wrote:
> 
> 
>>slow or you are out of diskspace, too bad!  I'm sure I'm not the only one 
>>who is tired of hearing people whine about "bloat" wrt the sources and 
>>demanding that features they don't use be ignored.  No one (non-core) 
> 
> 
> One look at the From:
> understanding has blossomed
> .procmailrc grows
> 

Your point is?


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 18:45         ` Patrick Mochel
@ 2002-10-31 19:16           ` Stephen Hemminger
  2002-10-31 19:57             ` george anzinger
  0 siblings, 1 reply; 333+ messages in thread
From: Stephen Hemminger @ 2002-10-31 19:16 UTC (permalink / raw)
  To: Patrick Mochel
  Cc: Dave Craft, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Kernel List, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 10:45, Patrick Mochel wrote:
> 
> So, this is precisely where something like OSDL's Carrier Grade and Data 
> Center working groups can come into play, amazingly enough. 
> 
> By now, nearly everyone has heard about the working groups and nearly
> every developer that has, despises them. Even I resist association with
> them. But, they can have some real value to the vendors and the OEMs in 
> exactly the way you describe. 
>
> Take for example DCL. It's a kernel tree with several base patches 
> intended to make Linux better in the data center. The base is not fancy, 
> and includes things like LKCD and kdb (I think). It's actively maintained 
> and updated more often than Linus makes a release (by virtue of 
> bitkeeper).

LKCD is in and I try to keep it up to date with the patch stream.
KDB is not in yet, because the current posted patches are not up to date
to apply cleanly against 2.5.44 or 2.5.45.

> The intent is to later have multiple child trees that implement features
> for a specific application space (e.g. databases), while maintainig the
> same base set of features. People wishing to use the most recent kernel 
> with those features can use the DCL tree directly. Or an OEM FAE can use 
> the tree to build something for the vendor, or add extra features.

CGL hasn't decided what they want to change to.
DCL is going to have one tree focused on databases.

> Note that it's not a distribution. We don't even make real releases, since 
> we don't create tarballs or patches (it's only in BK, which actually kinda 
> sucks). It's merely a means to have these features actively maintained and 
> kept in synch. 

For DCL there is both a bitkeeper tree bk://bk.osdl.org/dcl-2.5 and
regular snapshots available on sourceforge
http://osdldcl.sourceforge.net
 
> And really, that's what everyone wants. Linus doesn't want the features,
> as don't other developers, regardless of the Buzzword or Coolness factors.
> Some vendors and users do want them. The developers of the features and
> distributors of features don't want to deal with the tedium and pain of
> updating patches each and every release.
> 
> In the end, it comes down to the fact that Linus's tree is Linus's tree. 
> Other people can have their trees. I'm not going to tell you go off and 
> make your own if you want those features so bad, because I know what a 
> pain in the ass it is, and I know having someone else do it is a lot 
> easier.
> 

FYI the criteria I apply for what goes into DCL is:
* Applys to large systems and databases
* Vendor support
* Conforms to Linux standard style
* Active project and maintainer that accepts feedback
* Community rejection has been mostly positive.


> DCL and CGL have their trees, for purposes probably very very similar to 
> what your customers need. I encourage you to check them out and work with 
> them (or talk to people in your company that are). Try and make it work, 
> and everyone can be happy (relativey). And, if DCL and CGL aren't 
> satisfying the space that you need, please speak up to OSDL and the 
> working groups. People are listening, and willing to take your suggestions 
> into consideration. 
> 
> Relevant URLs:
> 
> http://osdl.org/projects/cgl/
> http://osdl.org/projects/dcl/

Stephen Hemminger
Data Center Linux (DCL) Maintainer/Coordinater



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:20             ` Alan Cox
@ 2002-10-31 19:17               ` Nicholas Wourms
  2002-10-31 20:45               ` Jeff Garzik
  2002-11-01  6:00               ` James Morris
  2 siblings, 0 replies; 333+ messages in thread
From: Nicholas Wourms @ 2002-10-31 19:17 UTC (permalink / raw)
  To: Linux Kernel Mailing List

Alan Cox wrote:
> On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
> 
>>>problems most people don't have?  What next, some kind of misdesigned
>>>in-kernel CryptoAPI?
>>
>>Get over it!  If you haven't noticed, CryptoAPI is merged already.  The only 
> 
> 
> Chris is write that crypto api is misdesigned if we want to use hardware
> cryptocards
> 

Alan,

Thanks for setting me straight, your assertion is correct, 
of course.  I was under the impression that the CryptoAPI 
code was merged initially for IPSEC support and would be 
revamped and expanded at a later date to support a wide 
variety of interfaces?

Cheers,
Nicholas


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:28           ` Nicholas Wourms
  2002-10-31 18:58             ` Alexander Viro
@ 2002-10-31 19:20             ` Alan Cox
  2002-10-31 19:17               ` Nicholas Wourms
                                 ` (2 more replies)
  1 sibling, 3 replies; 333+ messages in thread
From: Alan Cox @ 2002-10-31 19:20 UTC (permalink / raw)
  To: nwourms; +Cc: Linux Kernel Mailing List

On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
> > problems most people don't have?  What next, some kind of misdesigned
> > in-kernel CryptoAPI?
> 
> Get over it!  If you haven't noticed, CryptoAPI is merged already.  The only 

Chris is write that crypto api is misdesigned if we want to use hardware
cryptocards


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 15:46     ` Linus Torvalds
                         ` (3 preceding siblings ...)
  2002-10-31 17:55       ` [lkcd-general] " Dave Craft
@ 2002-10-31 19:33       ` Castor Fu
  4 siblings, 0 replies; 333+ messages in thread
From: Castor Fu @ 2002-10-31 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.

Add 3PAR and probably a number of other small companies given the traffic
on the lists.    Anyone building a new product on Linux and mucking
around inside the kernel, and having more than a handful of developers
is going to want LKCD, or Mission Critical's mcore,  or netdump, or
something like it.

It's a shame that right out of the gate they'll have to spend time
figuring out which of these solutions work for them.  I spent at least
a month of my life just looking at what's out there, and trying to make
each of them work with our product.  It'd be nice if that time were
spent on making new "cool stuff".

Since then, we've put significant amounts of work into making LKCD
reliable on our system, and it's been incredibly useful in our
development.  It's going to prove invaluable supporting our stuff in
the field.

> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
>
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).

If you asked me if 3PAR is a "vendor" or a "user" I'd have to say "yes".
As a vendor we sell our system to customers.  They could not care less
that LKCD is in the linux kernel distribution.  All they care about is
that we fix their problems as fast as possible.  They probably have
no idea that this is the underlying technology, so you will never
hear from them about us.

However, we also use linux for desktops, build servers, database servers, etc.
When we have problems with these systems, we'd LOVE to be able to use the
same expertise and technology which we've developed for our system, but
all too often we find that someone just grabbed a Redhat 7.x disk or
standard debian distro to build the system.

So as a "user" I'm asking the distribution vendors, please make it easy
for me to use the same damn tools everywhere by providing some sort
of common crash dump mechanism.  It'll make it easier for me to consider new
hardware, new software, etc.  One thing that's awesome is Dave Anderson's
"crash" tool.  It works with LKCD dumps, netdump dumps, etc.  It's an example
of a tool which has leveraged all the different dump communities.

As a "vendor" please put LKCD or something like it into the main line
kernel.   LKCD works.  It has an active developer community which has
been extending it to work over networks, onto disks, developing new
analysis tools, etc.  If we can settle on one such tool, we'll get
more cool stuff like lock analyzers, etc.  Until then, we WILL keep
re-inventing the wheel because this is one of the first steps to
collect significant amounts of real data.

    -castor


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:04         ` Alan Cox
@ 2002-10-31 19:42           ` Michael Shuey
  2002-11-01 22:25           ` Pavel Machek
  1 sibling, 0 replies; 333+ messages in thread
From: Michael Shuey @ 2002-10-31 19:42 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Thu, Oct 31, 2002 at 07:04:31PM +0000, Alan Cox wrote:
> On Thu, 2002-10-31 at 17:13, Michael Shuey wrote:
> > I'm a user, and I request that LKCD get merged into the kernel. :-)
> > Do you feel like donating a 700-port console server?  Right, so it's LKCD
> > for me then.
> 
> Wouldn't you rather they neatly tftp'd dumps to a nominated central
> server which noticed the arrival, did the initial processing with a perl
> script and mailed you a summary ?

Generally speaking, no.

A tftp server doesn't provide enough security (specifically authentication).
It would need to be accessible from clusters in multiple buildings and on
multiple networks (some of which must be public).

I've seen more network adapter issues than drive controller issues.  In
particular, some vendors (Compaq, listen up) can't implement an eepro100 to
save their asses, especially on older hardware.

>From time to time bandwidth issues and/or network splits can prevent dumps
from being reliably delivered.

Right now we use the presence of a local dump to indicate that a machine
should not join the PBS pool (and begin to run more jobs) on a reboot.  I'd
rather not have the nodes check a central server to see if it's okay to run
jobs.  And no, I don't want machines to stay down after a crash - many nodes
are in distant corners of campus and it's cold outside. :-)  If I can fix the
problem through software I'd prefer that the problematic host be up, rather
than having to walk over to it just to hit reset and load a new kernel.

That said, it would be really nice if LKCD would log dumps to both the swap
device and to a remote server.  That way if the machine crashed because of
disk failure I'd still have an uncorrupted dump image (and could then notice
all the little errors coming back out of the swap device).  A tool to
automatically analyze a dump and email back summaries would be much more
useful, though.  If someone were to write such a widget, that'd be swell. :-)

Right now I'm less concerned with getting dumps to exactly the right place
and a bit more concerned with getting dumps in the main kernel at all.

-- 
Mike Shuey

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:49               ` Linus Torvalds
@ 2002-10-31 19:43                 ` Chris Wedgwood
  2002-11-01 15:25                   ` Linus Torvalds
  0 siblings, 1 reply; 333+ messages in thread
From: Chris Wedgwood @ 2002-10-31 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Dax Kelson, Rik van Riel, Rusty Russell, linux-kernel

On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:

> Any hardware that needs to go off and think about how to encrypt
> something sounds like it's so slow as to be unusable. I suspect that
> anything that is over the PCI bus is already so slow (even if it
> adds no extra cycles of its own) that you're better off using the
> CPU for the encryption rather than some external hardware.

Except almost all hardware out there that does this stuff is async to
some extent...

I'm just speaking as someone who has (sadly) done this a couple of
times already for commercial real-world products.  I'm no expert, I
don't claim to be and admit there is still plenty to learn...

... that said, having access to lots of hardware, both our own and
other peoples, almost all of it needs to be driven asynchronously to
get good performance (or by a large number of threads).



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 19:16           ` Stephen Hemminger
@ 2002-10-31 19:57             ` george anzinger
  2002-10-31 20:48               ` Stephen Hemminger
  0 siblings, 1 reply; 333+ messages in thread
From: george anzinger @ 2002-10-31 19:57 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Patrick Mochel, Dave Craft, Linus Torvalds, Matt D. Robinson,
	Rusty Russell, Kernel List, lkcd-general, lkcd-devel

Stephen Hemminger wrote:
> FYI the criteria I apply for what goes into DCL is:
> * Applys to large systems and databases
> * Vendor support
> * Conforms to Linux standard style
> * Active project and maintainer that accepts feedback
> * Community rejection has been mostly positive.
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Could you decode this :)

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:15           ` Andrew Morton
@ 2002-10-31 19:58             ` Bernhard Kaindl
  2002-11-02  0:49             ` What's left over. - Dave's crash code supports a gdb interface for LKCD crash dumps Piet Delaney
  1 sibling, 0 replies; 333+ messages in thread
From: Bernhard Kaindl @ 2002-10-31 19:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, 31 Oct 2002, Andrew Morton wrote:
>
> We'll be spending the next six months stabilising and hardening
> the used-to-be-2.5 kernel.  If grunts like me can get hold a
> copy of the other person's kernel image from time-of-crash, that
> has a ton of value.

Exactly, sometimes you don't even need the dump itself, The user
who has the dump just types lcrash and report -w file.txt and
lcrash writes a consolidated report with the most interesting
information from the dump to the file.txt and he can sent it
to you and you've much information you often miss in problem
reports.

The report consists of: time when the dump was created, time
when the report was created, the architecture, the hostname,
kernel version and compile time, the kernel (dmesg) buffer
with all the oopses logged into it, a short task list with
process adress, id's, state, flags, cpu and process name,
and finally a full CPU dump of every CPU with all registers,
current process and function and a symbolic stack backtrace
of the CPU.

Sometimes this is all you need to know and if you need to
know e.g. the stack backtrace of a not running process at
the time of the dump, you can ask the user to simply run
trace <process address> and lcrash gives you the backtrace
of this process:

lcrash> t[race] 0x1408000
================================================================
STACK TRACE FOR TASK: 0x1408000 (kjournald)

 STACK:
 0 schedule+894 [0x3164e]
 1 interruptible_sleep_on+174 [0x31eae]
 2 journal_revoke+<ERROR> [0x10889c0c]
 3 kernel_thread+70 [0x18c1e]

showing the full task scruct, a sub-struct or a field is also simple:

p[rint] ((struct task_struct *)0x1408000)->pending
struct sigpending {
        head = (nil)
        tail = 0x1598700
        signal = sigset_t {
                sig = {
                        [0] 0
                }
        }
}

"feels" a bit like gdb

> (Disclaimer: I've never used lkcd.  I'm assuming that it's
> possible to gdb around in a dump)

I don't know if there is an lkcd->ELF core converter yet, but
it should be doable. However, lcrash is quite powerful, it comes
with sial, an integrated C interpreter that permits easy access to the
symbol and type information, obviosly, it allows to write code like this:

        void
        showprocs()
        {
        	struct proc* p;
                for(p=*(struct proc**)procs; p; p=p->p_next)
                        do something...
                }
        }

Looks nice... :-)

I also don't know if (k)gdb knows about tasks, lcrash at least
knows about them and this may when you look into a specific
task(I'm not an expert)

Of cource lcrash can do dissembing, find symbols,
> So.  _If_ lkcd gives me gdb-able images from time-of-crash, I'd
> like it please.  And I'm the grunt who spent nearly two years
> doing not much else apart from working 2.3/2.4 oops reports.

Maybe the lkcd people can do so, but I think they can also give
a hands-on workshop to lcrash.

You can use lcrash also on running system to browse around,
learn and save dumps from it without interrupting it, you
just need lcrash, the System.map and the Kerntypes file from
kernel for using it.

> Oh, and as Rusty has pointed out, we lose a *lot* of oops reports
> because users are in X and the backtrace doesn't make it to the
> logs.

Yep, I think it would be good even if Linus just accepts the
infrastructure patch of lkcd which needs to be in the kernel,
the vafourite dump method module can then be downloaded, compiled
installed and configured without much pain, I think that people
can start using it in a broader range without the hassle of
needing to patching and booting a special kernel.

Bernd

PS: lcrash is only one of the many frontends, as I've read in
this thread, there is also Dave Anderson's "crash" tool which
works with LKCD dumps, netdump dumps, etc. There is also qlcrash,
an qt frontend for lcrash for people who like to click!


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:20             ` Alan Cox
  2002-10-31 19:17               ` Nicholas Wourms
@ 2002-10-31 20:45               ` Jeff Garzik
  2002-11-01  6:00               ` James Morris
  2 siblings, 0 replies; 333+ messages in thread
From: Jeff Garzik @ 2002-10-31 20:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: nwourms, Linux Kernel Mailing List

Alan Cox wrote:

>On Thu, 2002-10-31 at 18:28, Nicholas Wourms wrote:
>  
>
>>>problems most people don't have?  What next, some kind of misdesigned
>>>in-kernel CryptoAPI?
>>>      
>>>
>>Get over it!  If you haven't noticed, CryptoAPI is merged already.  The only 
>>    
>>
>
>Chris is write that crypto api is misdesigned if we want to use hardware
>cryptocards
>  
>

I'll reserve judgement until we actually get access to some decent [made 
in the past few years] hardware crypto cards, and take a hard look at 
their PCI bus utilization... until then it is mostly vague handwaving...

[vendors - any takers?]



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 19:57             ` george anzinger
@ 2002-10-31 20:48               ` Stephen Hemminger
  0 siblings, 0 replies; 333+ messages in thread
From: Stephen Hemminger @ 2002-10-31 20:48 UTC (permalink / raw)
  To: george anzinger
  Cc: Patrick Mochel, Dave Craft, Linus Torvalds, Matt D. Robinson,
	Rusty Russell, Kernel List, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 11:57, george anzinger wrote:
> Stephen Hemminger wrote:
> > FYI the criteria I apply for what goes into DCL is:
> > * Applys to large systems and databases
> > * Vendor support
> > * Conforms to Linux standard style
> > * Active project and maintainer that accepts feedback
> > * Community rejection has been mostly positive.
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Could you decode this :)
s/rejection/reaction/


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:22             ` Linus Torvalds
@ 2002-10-31 20:59               ` Dave Anderson
  2002-10-31 21:49                 ` Oliver Xymoron
  2002-11-01  1:25                 ` [lkcd-devel] " Matt D. Robinson
  2002-11-01  6:34               ` Bill Davidsen
  1 sibling, 2 replies; 333+ messages in thread
From: Dave Anderson @ 2002-10-31 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


On Thu, 31 Oct 2002, Linus Torvalds wrote:

>  - included features kill off (potentially better) projects.
>
>         There's a big "inertia" to features. It's often better to keep
>         features _off_ the standard kernel if they may end up being
>         further developed in totally new directions.
>
>         In particular when it comes to this project, I'm told about
>         "netdump", which doesn't try to dump to a disk, but over the net.
>         And quite frankly, my immediate reaction is to say "Hell, I
>         _never_ want the dump touching my disk, but over the network
>         sounds like a great idea".
>
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
>
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).

It would be most unfortunate if the existance of netdump is used as a
reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.

On Thu, 31 Oct 2002, Matt D. Robinson wrote:

> We want to see this in the kernel, frankly, because it's a pain
> in the butt keeping up with your kernel revisions and everything
> else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
> (hopefully) Red Hat don't want to spend their time having to roll
> this stuff in each and every time you roll a new kernel.

While Red Hat advocates Ingo's netdump option, we have customer
requests that are requiring us to look at LKCD disk-based dumps as an
alternative, co-existing dump mechanism.  Since the two methods are not mutually
exclusive, LKCD will never kill off netdump -- nor certainly vice-versa.  We're
all just looking for a better means to be able to
provide support to our customers, not to mention its value as a
development aid.

Dave Anderson
Red Hat, Inc.




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
                             ` (4 preceding siblings ...)
  2002-10-31 18:49           ` Rik van Riel
@ 2002-10-31 21:02           ` Jeff Garzik
  2002-10-31 22:37             ` Werner Almesberger
                               ` (2 more replies)
  2002-11-01  6:27           ` Bill Davidsen
  6 siblings, 3 replies; 333+ messages in thread
From: Jeff Garzik @ 2002-10-31 21:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

Linus Torvalds wrote:

>	In particular when it comes to this project, I'm told about
>	"netdump", which doesn't try to dump to a disk, but over the net.
>	And quite frankly, my immediate reaction is to say "Hell, I
>	_never_ want the dump touching my disk, but over the network
>	sounds like a great idea".
>  
>

[yes, I realize the LKCD merge debate is over, bear with me :)]

I'm sort of in the middle on this issue:  The existence of netdump does 
not imply that disk dumps are a bad thing.

netdumps require a net dump server, and it is simply not realistic at 
all to assume that users seeing crashes will always have a netdump 
server set up in advance, or even have multiple machines to make that 
possible.  Disk dumps are valuable because their requirements are very 
low, and because of all the user-support reasons that Andrew Morton 
mentioned in this thread.

That said, I used to be an LKCD cheerleader until a couple people made 
some good points to me:  it is not nearly low-level enough to truly be 
of use in crash situations.  netdump can work if your interrupts are 
hosed/screaming, and various mid-layers are dying.  For LKCD to be of 
any use, it needs to _skip_ the block layer and talk directly to 
low-level drivers.

So, I think the stock kernel does need some form of disk dumping, 
regardless of any presence/absence of netdump.  But LKCD isn't there yet...

    Jeff




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  3:19     ` Stephen Frost
@ 2002-10-31 21:09       ` john stultz
  2002-10-31 21:49         ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: john stultz @ 2002-10-31 21:09 UTC (permalink / raw)
  To: Stephen Frost; +Cc: Rik van Riel, Linus Torvalds, Rusty Russell, lkml

On Wed, 2002-10-30 at 19:28, Stephen Frost wrote:
> The feeling I got on this was the ability to let users define their own
> groups.  Perhaps I'm not following it closely enough but that was the
> impression I got in terms of "what this does for us"; I'm probably
> missing other things.  Just that ability would be nice in my view
> though.  Isn't it something that's been in AFS for a long time too?
> I've got a few friends who've played with AFS before (at CMU and the
> like) and really enjoyed the ACLs there.

Yea, I haven't looked at the submitted implementation, but at CMU ACLs
were critical to be able to selectively share data between a very large
set of users w/o bugging an administrator. Given multiple classes per
semester which had multiple group projects, where you may have different
groups for each project, I have no clue how anyone would be able to
handle the (unix)group management required. ACLs let the users
themselves manage what people got what access to their data.

How else can I fix my partner's bugs (or vice-versa), give the clumsy TA
read only access, and let the cheat across the hall figure it out for
himself? (There may very well be a good solution to this w/o ACLs but
I've not seen it in use.)

So yea, I'd love to see a common ACLs API.
-john


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 10:15     ` Joe Thornber
  2002-10-31 14:26       ` Jeff Garzik
@ 2002-10-31 21:14       ` Rusty Russell
  2002-11-01  8:20         ` Joe Thornber
  1 sibling, 1 reply; 333+ messages in thread
From: Rusty Russell @ 2002-10-31 21:14 UTC (permalink / raw)
  To: Joe Thornber; +Cc: linux-kernel

In message <20021031101558.GB7487@fib011235813.fsnet.co.uk> you write:
> On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > They have, IIRC.  Interestingly, it was less invasive (existing source
> > touched) than the LVM2/DM patch you merged.
> 
> FUD.  I added to three areas of existing code:

[ 40-line detailed explanation snipped ]

Woah!  War's over dude!  We won!

I used Rusty's Unreliable Intrusiveness-o-meter (number of existing
non-config files touched), as I said.

I didn't read code or anything so unscientific or accurate.  But both
DM and EVMS were way down on the "intrusiveness" list.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 11:03     ` Geert Uytterhoeven
@ 2002-10-31 21:17       ` James Simmons
  0 siblings, 0 replies; 333+ messages in thread
From: James Simmons @ 2002-10-31 21:17 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Rusty Russell, Linus Torvalds, Linux Kernel Development,
	Russell King, Peter Chubb, tridge, Theodore Ts'o


> > > On Thu, 31 Oct 2002, Rusty Russell wrote:
> > > > Fbdev Rewrite
> > >
> > > This one is just huge, and I have little personal judgement on it.
> >
> > It's been around for a while.  Geert, Russell?
>
> It's huge because it moves a lot of files around:
>   1. drivers/char/agp/ -> drivers/video/agp/
>   2. drivers/char/drm/ -> drivers/video/drm/
>   3. console related files in drivers/video/ -> drivers/video/console/
>
> (1) and (2) should be reverted, but apparently they aren't reverted in the
> patch at http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz yet. The patch
> also seems to remove some drivers. Haven't checked the bk repo yet.
>
> James, can you please fix that (and the .Config files)?

Done. I have a new version of that patch at the same place. It is against
2.5.45.

http://phoenix.infradead.org/~jsimmons/fbdev.diff.gz

Its still pretty big. We can save the moving of the agp code for post
halloween.





^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:10           ` Chris Friesen
  2002-10-31 18:22             ` Linus Torvalds
  2002-10-31 18:50             ` Alan Cox
@ 2002-10-31 21:33             ` Rusty Russell
  2002-11-01  1:19               ` [lkcd-devel] " Matt D. Robinson
  2 siblings, 1 reply; 333+ messages in thread
From: Rusty Russell @ 2002-10-31 21:33 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

In message <3DC171FF.5000803@nortelnetworks.com> you write:
> Ideally I would like to see a dump framework that can have a number of 
> possible dump targets.  We should be able to dump to any combination of 
> network, serial, disk, flash, unused ram that isn't wiped over restarts, 
> etc...

Both the lkcd and ide mini-oopser have that (although the mini-oopser
has only x86-ide for now).

The mini-oopser has different aims than LCKD: they want to debug one
system, I want to make sure we're reaping OOPS reports from those 99%
of desktop users who run X and simply reboot when their machine
crashes once a month.

I did *not* put the mini-oopser on the Snowball list, because I don't
have time to polish it.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:09       ` john stultz
@ 2002-10-31 21:49         ` Werner Almesberger
  2002-10-31 22:32           ` john stultz
  0 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-10-31 21:49 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

[ Cc: trimmed ]

john stultz wrote:
> groups for each project, I have no clue how anyone would be able to
> handle the (unix)group management required. ACLs let the users
> themselves manage what people got what access to their data.

Note that POSIX ACLs don't seem to solve this either: they only
let you control access in terms of existing users or groups.

IMHO, this is one of the standard pitfalls of ACLs: if they don't
let you aggregate information, you quickly end up with huge ACLs
hanging off every file, and each of those ACLs wants to be
carefully maintained. I've seen a lot of this in my VMS days.
(Unix is a bit better, because you can control access at a
directory level, while VMS needs the ACL on each file, because
you can open files directly by VMS' equivalent to an inode
number, without traversing the directory hierarchy. Of course,
many users didn't know that :-)

To make ACLs truly scalable, it would be nice to be able to
express permissions in terms of access to other filesystem
objects. E.g. "everybody who can read file ~me/acls/my_friends
can write the directory on which this ACE hangs". This should
work like a symlink, i.e. if I add new friends to my_friends, I
don't have to update all my ACLs.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 20:59               ` Dave Anderson
@ 2002-10-31 21:49                 ` Oliver Xymoron
  2002-11-01  1:25                 ` [lkcd-devel] " Matt D. Robinson
  1 sibling, 0 replies; 333+ messages in thread
From: Oliver Xymoron @ 2002-10-31 21:49 UTC (permalink / raw)
  To: Dave Anderson
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

On Thu, Oct 31, 2002 at 03:59:34PM -0500, Dave Anderson wrote:
>
> > To me this says "LKCD is stupid". Which means that I'm not going to apply
> > it, and I'm going to need some real reason to do so - ie being proven
> > wrong in the field.
> >
> > (And don't get me wrong - I don't mind getting proven wrong. I change my
> > opinions the way some people change underwear. And I think that's ok).
> 
> It would be most unfortunate if the existance of netdump is used as a
> reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.

What he really wants is for Andrew or Alan or someone else he trusts
to merge it, get actual field results, and declare it useful. If
people start visibly passing around crash dump results on l-k and
solving problems with them, that'll help too. Until then all he has is
his gut feel to go on.

-- 
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.." 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:18       ` Matt D. Robinson
  2002-10-31 17:25         ` Linus Torvalds
@ 2002-10-31 22:20         ` Shawn
  2002-10-31 23:14           ` [lkcd-general] " Bernhard Kaindl
  2002-11-01  2:01           ` Matt D. Robinson
  1 sibling, 2 replies; 333+ messages in thread
From: Shawn @ 2002-10-31 22:20 UTC (permalink / raw)
  To: Matt D. Robinson
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On 10/31, Matt D. Robinson said something like:
> On Thu, 31 Oct 2002, Linus Torvalds wrote:
> |>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> |>That's fine. And since they are paid to support it, they can apply the 
> |>patches.  
> 
> We want to see this in the kernel, frankly, because it's a pain
> in the butt keeping up with your kernel revisions and everything
> else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
> (hopefully) Red Hat don't want to spend their time having to roll
> this stuff in each and every time you roll a new kernel.

I share some of your sentiment, but honestly, think about it.

Linus has to "keep up" with all the changees coming into his inbox as
well, and the more features, the more breakage that can happen when
Linus accepts a patch.

Really, Linus wants to push some of his maintanance overhead to distros,
who get paid to do it, but also to provide sexy bullet point items for
users, so they buy "Linux" stuff.

You try to find a better balance.

--
Shawn Leas
core@enodev.com

I installed a skylight in my apartment...
The people who live above me are furious!
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:57     ` Pavel Machek
@ 2002-10-31 22:28       ` Xavier Bestel
  2002-10-31 23:08         ` Pavel Machek
  2002-11-01  9:55         ` Miquel van Smoorenburg
  0 siblings, 2 replies; 333+ messages in thread
From: Xavier Bestel @ 2002-10-31 22:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alexander Viro, Linus Torvalds, Rusty Russell, Linux Kernel Mailing List

Le jeu 31/10/2002 à 23:57, Pavel Machek a écrit :

> This seems like a pretty common situation to me, and current solutions
> are not nice. [I guess ~/bin/ with --x and
> ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
> the problem, but...!]

Can't even this be spied from /proc/*/fd ?



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:49         ` Werner Almesberger
@ 2002-10-31 22:32           ` john stultz
  2002-10-31 22:54             ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: john stultz @ 2002-10-31 22:32 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: lkml

On Thu, 2002-10-31 at 13:49, Werner Almesberger wrote:
> john stultz wrote:
> > groups for each project, I have no clue how anyone would be able to
> > handle the (unix)group management required. ACLs let the users
> > themselves manage what people got what access to their data.
> 
> Note that POSIX ACLs don't seem to solve this either: they only
> let you control access in terms of existing users or groups.

I've never looked at the POSIX ACL spec, so forgive my ignorance.
 
> IMHO, this is one of the standard pitfalls of ACLs: if they don't
> let you aggregate information, you quickly end up with huge ACLs
> hanging off every file, and each of those ACLs wants to be
> carefully maintained. I've seen a lot of this in my VMS days.
> (Unix is a bit better, because you can control access at a
> directory level, while VMS needs the ACL on each file, because
> you can open files directly by VMS' equivalent to an inode
> number, without traversing the directory hierarchy. Of course,
> many users didn't know that :-)

While it would be nice to have user-definable ACL groups ("my friends"
or "History 255 TAs") in addition to existing users and groups, I still
don't find this to be critical. Sure, it adds (possibly quite a bit of)
extra data to every file, but it gives you the granularity you need for
the situation I described.  It seems like user-definable ACL groups
would be a nice extra feature on top of existing users or groups, but
not a necessity.

> To make ACLs truly scalable, it would be nice to be able to
> express permissions in terms of access to other filesystem
> objects. E.g. "everybody who can read file ~me/acls/my_friends
> can write the directory on which this ACE hangs". This should
> work like a symlink, i.e. if I add new friends to my_friends, I
> don't have to update all my ACLs.

Ugh, that seems dangerous. Too many forgotten ACL links and then I could
accidentally give a vague acquaintance access to all my data meant for
close friends. 

Regardless, while ACLs do result in extra data per file being used, it
is my understanding that ACLs allow you to solve problems that currently
aren't solvable w/o administrator intervention. In my experience using
them w/ AFS, they have been extremely useful. 


-john



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:02           ` Jeff Garzik
@ 2002-10-31 22:37             ` Werner Almesberger
  2002-11-05 11:42               ` [lkcd-devel] " Suparna Bhattacharya
  2002-11-01  1:35             ` Matt D. Robinson
  2002-11-01 13:30             ` Alan Cox
  2 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-10-31 22:37 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

Jeff Garzik wrote:
> That said, I used to be an LKCD cheerleader until a couple people made 
> some good points to me:  it is not nearly low-level enough to truly be 
> of use in crash situations.

I'm not so convinced about this. I like the Mission Critical
approach: save the dump to memory, then either boot through the
firmware or through bootimg (nowadays, that would be kexec),
then retrieve the dump from memory, and do whatever you like
with it.

The huge advantage here is that you don't need a ton of
specialized dump drivers and/or have much of the original kernel
infrastructure to be in a usable state. The rebooted system will
typically be stable enough to offer the full range of utilities,
including up to date drivers for all possible devices, so you
can safely write to disk, scp all the mess to your support
critter, or post an automatic flame to linux-kernel :-)

The weak points of the Mission Critical design are that early
memory allocation in the kernel needs to be tightly controlled,
that architectures that wipe CPU caches on reboot need to
commit them to memory before the firmware restart, and that
drivers need to be able to recover from an "unclean" hardware
state. (I think we'll see much of the latter happen as kexec
advances. The other two issues aren't really special.)

Actually, at the RAS BOF I thought that IBM were developing LKCD
in this direction, and had also eliminated a few not so elegant
choices of Mission Critical's original design. I haven't looked
at the LKCD code, but the descriptions sound as if all the
special-case cruft seems to be back again, which I would find a
little disappointing.

There might be a case for specialized low-overhead dump handlers
for small embedded systems and such, but they're probably better
maintained outside of the mainstream kernel. (They're more like
firmware anyway.)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  7:10         ` Alexander Viro
  2002-10-31  7:21           ` Dax Kelson
@ 2002-10-31 22:53           ` Pavel Machek
  1 sibling, 0 replies; 333+ messages in thread
From: Pavel Machek @ 2002-10-31 22:53 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Dax Kelson, Chris Wedgwood, Rik van Riel, Linus Torvalds,
	Rusty Russell, linux-kernel

Hi!

> > Without ACLs, if Sally, Joe and Bill need rw access to a file/dir, just
> > create another group with just those three people in.  Over time, of
> 
> If Sally, Joe and Bill need rw access to a directory, and Joe and Bill
> are using existing userland (any OS I'd seen), then Sally can easily
> fuck them into the next month and not in a good way.

Do you mean symlink attack?

> _That_ is the real problem.  Until that is solved (i.e. until all
> userland is written up to the standards allegedly followed in writing
> suid-root programs wrt hostile filesystem modifications) NO mechanism
> will help you.  ACLs, huge groups, whatever - setups with that sort
> of access allowed are NOT SUSTAINABLE with the current userland(s).

So userland needs to be improved. It already needs that modifications
because of /tmp. Is there any new issue there?
								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:32           ` john stultz
@ 2002-10-31 22:54             ` Werner Almesberger
  2002-11-01  0:54               ` john stultz
  0 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-10-31 22:54 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

john stultz wrote:
> Ugh, that seems dangerous. Too many forgotten ACL links and then I could
> accidentally give a vague acquaintance access to all my data meant for
> close friends. 

The idea is that you'd typically have (a) (small number of) specific
location(s) where you keep your files representing groups, e.g.
$HOME/acls/ for your personal lists, maybe ~project/acls/ for
projects, etc.

If you think already this is dangerous, then you should be
terrified by regular, non-aggregateable ACLs ;-)

I'm not saying that ACLs aren't useful, only that the lack of
aggregateability makes them hard to maintain, so that people
frequently fall back to setup scripts that simple re-create
their ACL configuration. Once you're at this point, ACLs have
lost much of their usefulness, and you might as well use some
suid program that creates groups for you.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:43   ` Alexander Viro
  2002-10-31 16:36     ` Oliver Xymoron
@ 2002-10-31 22:57     ` Pavel Machek
  2002-10-31 22:28       ` Xavier Bestel
  1 sibling, 1 reply; 333+ messages in thread
From: Pavel Machek @ 2002-10-31 22:57 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

Hi!

> > > ext2/ext3 ACLs and Extended Attributes
> > 
> > I don't know why people still want ACL's. There were noises about them for 
> > samba, but I'v enot heard anything since. Are vendors using this?
> 
> Because People Are Stupid(tm).  Because it's cheaper to put "ACL support: yes"
> in the feature list under "Security" than to make sure than userland can cope
> with anything more complex than  "Me Og.  Og see directory.  Directory Og's.
> Nobody change it".  C.f. snake oil, P.T.Barnum and esp. LSM users

Okay... Have ~/bin/phonebook and I'd like it to be rw- to me, r-- to
jarka and mj, and --- to everyone else. How do I do that without ACLs?
Adding a group is root-only operation.

This seems like a pretty common situation to me, and current solutions
are not nice. [I guess ~/bin/ with --x and
~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
the problem, but...!]
								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:28       ` Xavier Bestel
@ 2002-10-31 23:08         ` Pavel Machek
  2002-11-01  9:55         ` Miquel van Smoorenburg
  1 sibling, 0 replies; 333+ messages in thread
From: Pavel Machek @ 2002-10-31 23:08 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Alexander Viro, Linus Torvalds, Rusty Russell, Linux Kernel Mailing List

Hi!

> > This seems like a pretty common situation to me, and current solutions
> > are not nice. [I guess ~/bin/ with --x and
> > ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
> > the problem, but...!]
> 
> Can't even this be spied from /proc/*/fd ?

Not sure... Its true that if users are not carefull (i.e. do 

cat ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook

it can be seen on ps -aux ;-).
							Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-10-31 22:20         ` Shawn
@ 2002-10-31 23:14           ` Bernhard Kaindl
  2002-11-01  2:01           ` Matt D. Robinson
  1 sibling, 0 replies; 333+ messages in thread
From: Bernhard Kaindl @ 2002-10-31 23:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: lkcd-general

On Thu, 31 Oct 2002, Shawn wrote:
>
> Linus has to "keep up" with all the changees coming into his inbox as
> well, and the more features, the more breakage that can happen when
> Linus accepts a patch.

Yes, but lkcd differs from the other changes because it can make the
life of people easyer which don't need the patch in the first place,
and help quality and shorten the time to fix bugs.

If someone triggers a problem, one can take a free partition or setup
an network dump server, run and if it happens again, there is a good
chance that all that is needed to fix the problem is in the dump,
the System.map and the Kerntypes file from the kernel which can
be consolidatet into a report with symbolic stack traces of the
CPUs and Tasks quite easy.

Original source, patches and configuration options are good for
analysing but not required if the Kerntypes file is there. The
config options could be even read from the dump if this would
be a liked feature. :-)

> Really, Linus wants to push some of his maintanance overhead to distros,
> who get paid to do it, but also to provide sexy bullet point items for
> users, so they buy "Linux" stuff.

Sure, but the work of the distros could be even better if the base
kernel has lkcd, LTT and dprobes (you don't have to enable them if
you don't need them) because then they would have more resources
to make other even more useful things. But it's up to someone
who merges the stuff.

Bernd



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  2:31 ` Linus Torvalds
                     ` (15 preceding siblings ...)
  2002-10-31 16:37   ` Henning P. Schmiedehausen
@ 2002-11-01  0:52   ` James Simmons
  2002-11-01 10:24   ` What's left over. (Fbdev rewrite) Helge Hafting
  17 siblings, 0 replies; 333+ messages in thread
From: James Simmons @ 2002-11-01  0:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel


> > Fbdev Rewrite
>
> This one is just huge, and I have little personal judgement on it.

The size has been cut in half now that the issue of AGP being intialized
to late is on hold. We can discuss that move post-halloween. All that is
in the fbdev tree are fbdev changes. So it is safe to pull it.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:54             ` Werner Almesberger
@ 2002-11-01  0:54               ` john stultz
  2002-11-01  1:31                 ` Werner Almesberger
  2002-11-05  3:58                 ` Andreas Gruenbacher
  0 siblings, 2 replies; 333+ messages in thread
From: john stultz @ 2002-11-01  0:54 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: lkml

On Thu, 2002-10-31 at 14:54, Werner Almesberger wrote:
> john stultz wrote:
> > Ugh, that seems dangerous. Too many forgotten ACL links and then I could
> > accidentally give a vague acquaintance access to all my data meant for
> > close friends. 
> 
> The idea is that you'd typically have (a) (small number of) specific
> location(s) where you keep your files representing groups, e.g.
> $HOME/acls/ for your personal lists, maybe ~project/acls/ for
> projects, etc.

Oh! Ok, that's exactly like the user-definable ACL groups I was
describing. My mistake, I thought you were suggesting some crazy ACL
symlink like: "Make file foo's ACL be the same as file blah's ACL" and
if I then go and add some untrusted user to blah's ACL it would then
automatically change foo's ACL. That just seemed a bit out there, but it
was just my mis-interpretation. Sorry :)

> If you think already this is dangerous, then you should be
> terrified by regular, non-aggregateable ACLs ;-)

Eh, as long as the ACLs are per-file, I can't ever accidentally give
access to a file I didn't mean to. The corner cases of "remove my
ex-friend from all my files" could be annoying, but could be done w/ the
equiv of chgrp -r 

> I'm not saying that ACLs aren't useful, only that the lack of
> aggregateability makes them hard to maintain, so that people
> frequently fall back to setup scripts that simple re-create
> their ACL configuration. Once you're at this point, ACLs have
> lost much of their usefulness, and you might as well use some
> suid program that creates groups for you.

Hmmm. I'm way out of my realm of competency here. I just know ACLs were
*really* useful w/ AFS. 

I probably should just go read the specs. Anyone have a pointer, or care
to explain what the differences are between AFS's ACLs and POSIX ACLs?

thanks
-john





^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 21:33             ` Rusty Russell
@ 2002-11-01  1:19               ` Matt D. Robinson
  2002-11-01  2:59                 ` Rusty Russell
  0 siblings, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01  1:19 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Chris Friesen, Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Rusty Russell wrote:
|>The mini-oopser has different aims than LCKD: they want to debug one
|>system, I want to make sure we're reaping OOPS reports from those 99%
|>of desktop users who run X and simply reboot when their machine
|>crashes once a month.

I'd like to incorporate the mini-oopser as an LKCD dump method.
I'll chat with you off-line about this.  Shouldn't be that
difficult to do.

|>I did *not* put the mini-oopser on the Snowball list, because I don't
|>have time to polish it.
|>
|>Rusty.

Thanks,

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 20:59               ` Dave Anderson
  2002-10-31 21:49                 ` Oliver Xymoron
@ 2002-11-01  1:25                 ` Matt D. Robinson
  1 sibling, 0 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01  1:25 UTC (permalink / raw)
  To: Dave Anderson
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

|>On Thu, 31 Oct 2002, Matt D. Robinson wrote:
|>> We want to see this in the kernel, frankly, because it's a pain
|>> in the butt keeping up with your kernel revisions and everything
|>> else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
|>> (hopefully) Red Hat don't want to spend their time having to roll
|>> this stuff in each and every time you roll a new kernel.
|>
|>While Red Hat advocates Ingo's netdump option, we have customer
|>requests that are requiring us to look at LKCD disk-based dumps as an
|>alternative, co-existing dump mechanism.  Since the two methods are
|>not mutually exclusive, LKCD will never kill off netdump -- nor
|>certainly vice-versa.  We're all just looking for a better means
|>to be able to provide support to our customers, not to mention
|>its value as a development aid.

I think you and I are in agreement (as always has been in the
past), Dave.  LKCD is meant to create a base for disk, network,
or any dump method.  If Red Hat wants netdump to be the primary
dumping method, that's Red Hat's decision, and more power to
them.  If SuSE wants disk dumps, that's SuSE's decision.  But
for both of them to have to roll their own every single release
or kernel upgrade is unproductive.

What's most concerning about this entire discussion is that I
bet < 20% of the people discussing this have actually LOOKED at
the LKCD patches to see whether or not this is as invasive,
difficult, bloated, or anything negative.  We've spent over a
month now posting them, getting comments, responding to all of
the comments, making sure feedback is accounted for and
responded to, only to get an "LKCD is stupid" type response.

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  0:54               ` john stultz
@ 2002-11-01  1:31                 ` Werner Almesberger
  2002-11-05  3:58                 ` Andreas Gruenbacher
  1 sibling, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-01  1:31 UTC (permalink / raw)
  To: john stultz; +Cc: lkml

john stultz wrote:
> I thought you were suggesting some crazy ACL
> symlink like: "Make file foo's ACL be the same as file blah's ACL" and
> if I then go and add some untrusted user to blah's ACL it would then
> automatically change foo's ACL.

Well, with "foo" getting the ACL from "bar", changing the ACL of
"bar" would change "foo", but not vice versa. Of course, the idea
is that you're careful when changing "bar", just like you'd be
careful with your SSH keys.

> Eh, as long as the ACLs are per-file, I can't ever accidentally give
> access to a file I didn't mean to. The corner cases of "remove my
> ex-friend from all my files" could be annoying, but could be done w/ the
> equiv of chgrp -r 

chgrp -r gets nasty if you have files which are stored off-line.
On the other hand, using the concept that ACEs add rights, but
never take them away, even an off-line "ACL link target" would
fail on the safe side, by not adding more rights.

> I probably should just go read the specs. Anyone have a pointer, or care
> to explain what the differences are between AFS's ACLs and POSIX ACLs?

I've forgotten most things I knew about AFS ACLs (I used them at
IBM about eight years ago), but http://acl.bestbits.at/ and in
particular http://acl.bestbits.at/cgi-man/acl.5 seem to have
everything about POSIX ACLs. They're not very complicated.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:02           ` Jeff Garzik
  2002-10-31 22:37             ` Werner Almesberger
@ 2002-11-01  1:35             ` Matt D. Robinson
  2002-11-01  2:06               ` Jeff Garzik
  2002-11-01 13:30             ` Alan Cox
  2 siblings, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01  1:35 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Jeff Garzik wrote:
|>Linus Torvalds wrote:
|>[yes, I realize the LKCD merge debate is over, bear with me :)]

For Linus, it is.

|>That said, I used to be an LKCD cheerleader until a couple people made 
|>some good points to me:  it is not nearly low-level enough to truly be 
|>of use in crash situations.  netdump can work if your interrupts are 
|>hosed/screaming, and various mid-layers are dying.  For LKCD to be of 
|>any use, it needs to _skip_ the block layer and talk directly to 
|>low-level drivers.

Just to clarify, LKCD is NOT block based dumping, OR net based
dumping, or anything.  It's an infrastructure for dumping that
lets you, the user, the distributor, the customer, whatever,
make the decision for what's right for you.  Yes, we provide
disk based dumping now, and are including the net dump code
very soon, as well as some of these other smaller dump methods.

Has ANYONE other than Christoph and Stephen H. done a full review of
the LKCD patch set before commenting?  Or are people just making
this stuff up as they go along?  A ton of things have changed
over the past year just because people complained about only doing
disk dumping.  And then to hear this ...

|>So, I think the stock kernel does need some form of disk dumping, 
|>regardless of any presence/absence of netdump.  But LKCD isn't
|>there yet...

Please read the patches and decide again.  If you want the latest
net dump patch, let me know.

|>    Jeff

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:20         ` Shawn
  2002-10-31 23:14           ` [lkcd-general] " Bernhard Kaindl
@ 2002-11-01  2:01           ` Matt D. Robinson
  2002-11-02 10:36             ` Brad Hards
  1 sibling, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01  2:01 UTC (permalink / raw)
  To: Shawn
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Shawn wrote:
|>On 10/31, Matt D. Robinson said something like:
|>> On Thu, 31 Oct 2002, Linus Torvalds wrote:
|>> |>On Wed, 30 Oct 2002, Matt D. Robinson wrote:
|>> |>That's fine. And since they are paid to support it, they can apply the 
|>> |>patches.  
|>> 
|>> We want to see this in the kernel, frankly, because it's a pain
|>> in the butt keeping up with your kernel revisions and everything
|>> else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
|>> (hopefully) Red Hat don't want to spend their time having to roll
|>> this stuff in each and every time you roll a new kernel.
|>
|>I share some of your sentiment, but honestly, think about it.
|>
|>Linus has to "keep up" with all the changees coming into his inbox as
|>well, and the more features, the more breakage that can happen when
|>Linus accepts a patch.

Uh ... have you read the patches?  Do you see how few the
changes are to non-dump code?  Do you know that most of those
changes only get triggered in a crash situation anyway?

Breakage occurs when people change code areas that are used
all the time, like VM, network, block layer, etc.

Look at the patches and tell me where we are causing overhead
and and seriously potential breakage.  If you find problems,
then tell us, don't just comment on breakage scenarios.

|>Really, Linus wants to push some of his maintanance overhead to distros,
|>who get paid to do it, but also to provide sexy bullet point items for
|>users, so they buy "Linux" stuff.

Sure, then remove all of the extra filesystems, sound drivers,
etc., that are bulking up the kernel distribution now and give
them to the distributors to include.

|>You try to find a better balance.

If I could think of a better balance to ease his load, I would.
He's already made his mind up.  It doesn't mean it won't end up
merged by someone else (or everyone else for that matter).

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  1:35             ` Matt D. Robinson
@ 2002-11-01  2:06               ` Jeff Garzik
  2002-11-01  3:46                 ` Matt D. Robinson
  0 siblings, 1 reply; 333+ messages in thread
From: Jeff Garzik @ 2002-11-01  2:06 UTC (permalink / raw)
  To: Matt D. Robinson
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

Matt D. Robinson wrote:

>On Thu, 31 Oct 2002, Jeff Garzik wrote:
>|>Linus Torvalds wrote:
>|>[yes, I realize the LKCD merge debate is over, bear with me :)]
>
>For Linus, it is.
>
>|>That said, I used to be an LKCD cheerleader until a couple people made 
>|>some good points to me:  it is not nearly low-level enough to truly be 
>|>of use in crash situations.  netdump can work if your interrupts are 
>|>hosed/screaming, and various mid-layers are dying.  For LKCD to be of 
>|>any use, it needs to _skip_ the block layer and talk directly to 
>|>low-level drivers.
>
>Just to clarify, LKCD is NOT block based dumping, OR net based
>dumping, or anything.  It's an infrastructure for dumping that
>lets you, the user, the distributor, the customer, whatever,
>make the decision for what's right for you.  Yes, we provide
>disk based dumping now, and are including the net dump code
>very soon, as well as some of these other smaller dump methods.
>
>Has ANYONE other than Christoph and Stephen H. done a full review of
>the LKCD patch set before commenting?  Or are people just making
>this stuff up as they go along?  A ton of things have changed
>over the past year just because people complained about only doing
>disk dumping.  And then to hear this ...
>  
>
You are confusing review with perspective.  I've read 
http://lkcd.sourceforge.net/download/latest/ before, and just checked it 
again tonight before posting.

My view is:  LKCD becomes useful to merge when the average user can do 
"safe" disk dumps.  netdumps are better for corporate customers, but for 
average users, disk dumps are _the_ method which is easiest, most 
accessible, and thus most helpful to kernel hackers debugging their 
problems.  LKCD has a dump block dev driver, but it's not even close to 
being low-level enough to be "safe".

Re-read my other post(s) -- I have said repeatedly that LKCD's 
infrastructure is decent.  But it's completely pointless to merge a 
decent infrastructure unless the users are up to snuff.  It's much 
smarter to keep the infrastructure out of the kernel until the low-level 
dump drivers are hammered out and stable, because that gives you more 
freedom to change the API.


>|>So, I think the stock kernel does need some form of disk dumping, 
>|>regardless of any presence/absence of netdump.  But LKCD isn't
>|>there yet...
>
>Please read the patches and decide again.  If you want the latest
>net dump patch, let me know.
>  
>

I have.  Nothing has changed.  Stable, polling, low-level disk dumps are 
not in the LKCD patches.

IMO, net dump is what corporate customers and network admins want.  And 
overall, net dumps are probably easier and much safer than disk dumps, 
from an implementor's perspective.  However, disk dumps are what the 
average kernel hacker will find most useful, because it is the easiest 
for end users, and thus will generate a higher number of quality bug 
reports.

    Jeff




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-01  1:19               ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-01  2:59                 ` Rusty Russell
  0 siblings, 0 replies; 333+ messages in thread
From: Rusty Russell @ 2002-11-01  2:59 UTC (permalink / raw)
  To: Matt D. Robinson
  Cc: Chris Friesen, Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel

In message <Pine.LNX.4.44.0210311718140.23393-100000@nakedeye.aparity.com> you 
write:
> On Fri, 1 Nov 2002, Rusty Russell wrote:
> |>The mini-oopser has different aims than LCKD: they want to debug one
> |>system, I want to make sure we're reaping OOPS reports from those 99%
> |>of desktop users who run X and simply reboot when their machine
> |>crashes once a month.
> 
> I'd like to incorporate the mini-oopser as an LKCD dump method.
> I'll chat with you off-line about this.  Shouldn't be that
> difficult to do.

That would defeat the "mini" part 8)

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  2:06               ` Jeff Garzik
@ 2002-11-01  3:46                 ` Matt D. Robinson
  2002-11-01  4:45                   ` Linus Torvalds
  0 siblings, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01  3:46 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Jeff Garzik wrote:
|>Re-read my other post(s) -- I have said repeatedly that LKCD's 
|>infrastructure is decent.  But it's completely pointless to merge a 
|>decent infrastructure unless the users are up to snuff.  It's much 
|>smarter to keep the infrastructure out of the kernel until the low-level 
|>dump drivers are hammered out and stable, because that gives you more 
|>freedom to change the API.

This is where we disagree.  Without the base infrastructure, this
becomes an even larger and larger patch which needs testing and
verification with a massive number of configurations for each new
kernel release.  Do you know how much testing we go through for each
new kernel release?  Do you know that we actually try this stuff
out with panic(), die(), interrupt and sysrq() dumps before we send
it off?  Do you know we try this for SMP and UP?

If Linus would at least take the infrastructure patches and leave
out the drivers/dump code, that might be a good start.  Just take
the base code.  Just take the patches for panic.c, dump_ipi(), or
the rest of the other base kernel components,  But no.  Instead,
Linus just says "LKCD is stupid".

I also think you have completely misrepresented the LKCD user base,
but I'm sure our opinion on who those LKCD users are is different
and it's pointless to argue one person's experiences over another's.

I hate Linus' ego, I hate this whole damn discussion, and I find
it very irritating that I have to go through this process after
many people have created, enhanced and used LKCD for three years,
and this is where we're at.

To spend the last month and a half finalizing things for Linus,
sending this to him on multiple occasions, asking for his comments
and inclusion, asking for his feedback (as well as others), and
not hearing _one damn word_ from Linus all that time, and for him
to wait until now to just say "LKCD is stupid" is insulting.

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  3:46                 ` Matt D. Robinson
@ 2002-11-01  4:45                   ` Linus Torvalds
  2002-11-01  4:57                     ` Patrick Finnegan
  0 siblings, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-11-01  4:45 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.44.0210311923460.24182-100000@nakedeye.aparity.com>,
Matt D. Robinson <yakker@aparity.com> wrote:
>
>To spend the last month and a half finalizing things for Linus,
>sending this to him on multiple occasions, asking for his comments
>and inclusion, asking for his feedback (as well as others), and
>not hearing _one damn word_ from Linus all that time, and for him
>to wait until now to just say "LKCD is stupid" is insulting.

You got to hear my comment now, several times: convince somebody _else_.

But no, it wasn't the answer you wanted.  So you refuse to listen.  And
yes, I get irritated too.  So right now I won't touch LKCD with a
ten-foot pole, if only because I've been mail-bombed by people who argue
for it when I have better things to do than to explain myself over and
over again. 

What's so hard to understand about the "vendor-driven" thing, and why do
people continue to argue about it? 

			Linus

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  4:45                   ` Linus Torvalds
@ 2002-11-01  4:57                     ` Patrick Finnegan
  2002-11-01  9:18                       ` Henning P. Schmiedehausen
  0 siblings, 1 reply; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-01  4:57 UTC (permalink / raw)
  To: linux-kernel

On Fri, 1 Nov 2002, Linus Torvalds wrote:

> But no, it wasn't the answer you wanted.  So you refuse to listen.  And
> yes, I get irritated too.  So right now I won't touch LKCD with a
> ten-foot pole, if only because I've been mail-bombed by people who argue
> for it when I have better things to do than to explain myself over and
> over again.

Maybe it's because users are wanting it in the mainline kernel...  Notice
I said 'users' not 'vendors' or 'the code's maintainers'.

> What's so hard to understand about the "vendor-driven" thing, and why do
> people continue to argue about it?

Because I'm not a vendor, and I want it.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:20             ` Alan Cox
  2002-10-31 19:17               ` Nicholas Wourms
  2002-10-31 20:45               ` Jeff Garzik
@ 2002-11-01  6:00               ` James Morris
  2 siblings, 0 replies; 333+ messages in thread
From: James Morris @ 2002-11-01  6:00 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List, David S. Miller

On 31 Oct 2002, Alan Cox wrote:

> Chris is write that crypto api is misdesigned if we want to use hardware
> cryptocards

Hardware support was not an initial goal, as the requirements are not yet 
fully known.

>From Documentation/crypto/api-intro.txt:

  An asynchronous scheduling interface is in planning but not yet
  implemented, as we need to further analyze the requirements of all of
  the possible hardware scenarios (e.g. IPsec NIC offload).

Hardware accelerators are generally a known issue, with already proven 
solutions (e.g. the OpenBSD crypto queue).  We don't know much about IPSec 
NIC offload yet, however.


- James
-- 
James Morris
<jmorris@intercode.com.au>



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:25         ` Linus Torvalds
                             ` (5 preceding siblings ...)
  2002-10-31 21:02           ` Jeff Garzik
@ 2002-11-01  6:27           ` Bill Davidsen
  2002-11-01  6:36             ` Linus Torvalds
                               ` (2 more replies)
  6 siblings, 3 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-01  6:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> 
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> 
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > > 
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> > 
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
> 
> That's fine. And since they are paid to support it, they can apply the 
> patches.  
> 
> What I'm saying by "vendor driven" is that it has no relevance for the 
> standard kernel, and since it has no relevance to that, then I have no 
> incentives to merge it. The crash dump is only useful with people who 
> actively look at the dumps, and I don't know _anybody_ outside of the 
> specialized vendors you mention who actually do that.

  You're not listening! Screw the vendors! The users want this enough to
be patching it into their kernels now.

> 
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).

  Did you not read the input from the developers? From the people who have
headless clusters?

  I have Linux systems in fifteen locations, six states, for timezones.
They oops from time to time, and I can't get any clue why, because (a)
they have no console, (b) most are in secure locations like locked wiring
closets with no one to read a console, and (c) the systems are thousands
of miles away. I don't need a debugger, I'd love to just have ksysoops
output! And given the reality of using the network, I don't make kcore
world readable, I'm not about to send that information over a few
thousand miles of open net to save writing it to disk.

  I also have Solaris and AIX servers, and if they go down I send a crash
dump to the vendor who can then provide support. Big difference. Visible
even to management, who see a real support issue.

> 
> Horse before the cart and all that thing.
> 
> People have to realize that my kernel is not for random new features.

  Supportablility is not a "random new feature," it's something which was
developed because users had a need (not by a vendor looking for a feature
to advertize), and if you would read the mail it's mostly coming from
people who want to use the feature. This is a whole new kernel series, it
will be stable a hell of a lot sooner if people can find problems!

  Notice that developers want it, vendors want to provide it, and end
users want to be able to get support. In fact, other than one person who
had doubts about the implementation being optimal, your voice is the only
one I hear against it. That should tell you something.

  Sometimes the best way to lead is to look at where everyone is going on
their own, jump in front, and yell "Follow me!" a few times. If you put
half the energy into improving the implementation that you put into
telling us we're all wrong it would be a better kernel.



On Thu, 31 Oct 2002, Linus Torvalds wrote:

> 
> [ Ok, this is a really serious email. If you don't get it, don't bother 
>   emailing me. Instead, think about it for an hour, and if you still don't 
>   get it, ask somebody you know to explain it to you. ]
> 
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> > 
> > Sure, but why should they have to?  What technical reason is there
> > for not including it, Linus?
> 
> There are many:
> 
>  - bloat kills:
> 
> 	My job is saying "NO!"
> 
> 	In other words: the question is never EVER "Why shouldn't it be
> 	accepted?", but it is always "Why do we really not want to live 
> 	without this?"

  I suspect that you have not had to make any significant part of your
living administering systems, certainly not recently. Lack of this tool is
a one-to-one mapping to "no clue" if you can't get information from the
console.
 
>  - included features kill off (potentially better) projects.
> 
> 	There's a big "inertia" to features. It's often better to keep 
> 	features _off_ the standard kernel if they may end up being
> 	further developed in totally new directions.

  Yes, you can clearly see how that worked with ext2 stifling development
of... wait a minute, rethink that argument. This feature is years old, and
seems to be ready to add new destinations for the data, disk, net, high
memory, what elese is there? Once the data is saved people will be able to
develop any additional tools they want to read the raw data.
 
> 	In particular when it comes to this project, I'm told about
> 	"netdump", which doesn't try to dump to a disk, but over the net.
> 	And quite frankly, my immediate reaction is to say "Hell, I
> 	_never_ want the dump touching my disk, but over the network
> 	sounds like a great idea".

  You have this idea that the dump will go over a high reliability path,
and that's an option, but not in all cases true.

> To me this says "LKCD is stupid". Which means that I'm not going to apply 
> it, and I'm going to need some real reason to do so - ie being proven 
> wrong in the field.

  You've been proven wrong, you just don't want to look at the proof! You
can't say it doesn't work, it does. You can't say the (users, vendors,
developers} don't want it, because they do. You can't say it's untested,
it's been in use for several years, and you seem willing to take reiser4,
which isn't even finsished yet!

> (And don't get me wrong - I don't mind getting proven wrong. I change my 
> opinions the way some people change underwear. And I think that's ok).

  If you really believed the stuff you say you'd put it in and promise to
take it out if people didn't find it useful or there were inherent
limitations. It would probably take 10-30% off the time to a stable
release.
 
> > I completely don't understand your reasoning here.
> 
> Tough. That's YOUR problem.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:22             ` Linus Torvalds
  2002-10-31 20:59               ` Dave Anderson
@ 2002-11-01  6:34               ` Bill Davidsen
  2002-11-01 13:26                 ` Alan Cox
  1 sibling, 1 reply; 333+ messages in thread
From: Bill Davidsen @ 2002-11-01  6:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Friesen, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> 
> On Thu, 31 Oct 2002, Chris Friesen wrote:
> > 
> > How do you deal with netdump when your network driver is what caused the 
> > crash?
> 
> Actually, from a driver perspective, _the_ most likely driver to crash is 
> the disk driver. 
> 
> That's from years of experience. The network drivers are a lot simpler,
> the hardware is simpler and more standardized, and doesn't do as many
> things. It's just plain _easier_ to write a network driver than a disk
> driver.
> 
> Ask anybody who has done both.

  From the standpoint of just the driver that's true. However, the remote
machine and all the network bits between them are a string of single
points of failure. Isn't it good that both disk and network can be
supported.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:27           ` Bill Davidsen
@ 2002-11-01  6:36             ` Linus Torvalds
  2002-11-01  7:00               ` [lkcd-devel] " Castor Fu
                                 ` (2 more replies)
  2002-11-01  9:20             ` Henning P. Schmiedehausen
  2002-11-01 13:29             ` Alan Cox
  2 siblings, 3 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-01  6:36 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel


On Fri, 1 Nov 2002, Bill Davidsen wrote:
> 
>   If you really believed the stuff you say you'd put it in and promise to
> take it out if people didn't find it useful or there were inherent
> limitations.

This never works. Be honest. Nobody takes out features, they are stuck 
once they get in. Which is exactly why my job is to say "no", and why 
there is no "accepted unless proven bad". 

> It would probably take 10-30% off the time to a stable release.

Talk is cheap.

I've not seen a _single_ bug-report with a fix that attributed the
existing LKCD patches. I might be more impressed if I had. 

The basic issue is that we don't put patches in in the hope that they will
prove themselves later. Your argument is fundamentally flawed.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-01  6:36             ` Linus Torvalds
@ 2002-11-01  7:00               ` Castor Fu
  2002-11-01  8:23               ` Craig I. Hagan
  2002-11-01 13:28               ` Alan Cox
  2 siblings, 0 replies; 333+ messages in thread
From: Castor Fu @ 2002-11-01  7:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bill Davidsen, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

On Thu, 31 Oct 2002, Linus Torvalds wrote:

>
> On Fri, 1 Nov 2002, Bill Davidsen wrote:
> >
> >   If you really believed the stuff you say you'd put it in and promise to
> > take it out if people didn't find it useful or there were inherent
> > limitations.
>
> This never works. Be honest. Nobody takes out features, they are stuck
> once they get in. Which is exactly why my job is to say "no", and why
> there is no "accepted unless proven bad".
>
> > It would probably take 10-30% off the time to a stable release.
>
> Talk is cheap.
>
> I've not seen a _single_ bug-report with a fix that attributed the
> existing LKCD patches. I might be more impressed if I had.

Maybe people don't bother to spell out how they got there.  Here's one.

    -castor

:: Newsgroups: mlist.linux.kernel
:: Date:   Mon, 17 Dec 2001 09:48:53 -0800 (PST)
:: From: Castor Fu <castor@3pardata.com>
:: X-To: <linux-kernel@vger.kernel.org>
:: Subject: i386 machine_restart unsafe in interrupt context
:: Message-ID: <linux.kernel.Pine.LNX.4.33.0112170935520.1623-100000@marais.SOMEWHERE>
:: MIME-Version: 1.0
:: Content-Type: TEXT/PLAIN; charset=US-ASCII
:: Approved: news@nntp-server.caltech.edu
:: Lines: 27
::
::
:: I have a problem where systems fail to reboot on panic().  I've resolved
:: it by changing smp_send_stop() to use an NMI (like the KDB patch does to
:: manage communication).
::
:: The source of the problem is that the panic path has the following:
::
::     panic()
::         machine_restart()
::             machine_real_restart()
::                 smp_send_stop()
::                     smp_call_function()
::
:: and smp_call_function() is not safe in an interrupt context.
::
:: I imagine people might want to handle this differently, but I'd be
:: happy to diffs if there's interest.  It may be that there are enough
:: cases like this that smp_call_function might want a version that
:: uses an NMI. . .
::
::     -Castor Fu
::     castor@3par.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:14       ` Rusty Russell
@ 2002-11-01  8:20         ` Joe Thornber
  0 siblings, 0 replies; 333+ messages in thread
From: Joe Thornber @ 2002-11-01  8:20 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel

On Fri, Nov 01, 2002 at 08:14:16AM +1100, Rusty Russell wrote:
> In message <20021031101558.GB7487@fib011235813.fsnet.co.uk> you write:
> > On Thu, Oct 31, 2002 at 02:00:31PM +1100, Rusty Russell wrote:
> > > They have, IIRC.  Interestingly, it was less invasive (existing source
> > > touched) than the LVM2/DM patch you merged.
> > 
> > FUD.  I added to three areas of existing code:
> 
> [ 40-line detailed explanation snipped ]
> 
> Woah!  War's over dude!  We won!

:)

Sorry, it wasn't meant to be an agressive email.  However comments
like this do get picked up out of context and passed around until they
become the accepted truth.  I'm still trying to work out where 'dm
can't handle mirroring or raid' rumour came from.

- Joe

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:36             ` Linus Torvalds
  2002-11-01  7:00               ` [lkcd-devel] " Castor Fu
@ 2002-11-01  8:23               ` Craig I. Hagan
  2002-11-01 14:03                 ` Patrick Finnegan
  2002-11-02  4:57                 ` Bill Davidsen
  2002-11-01 13:28               ` Alan Cox
  2 siblings, 2 replies; 333+ messages in thread
From: Craig I. Hagan @ 2002-11-01  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bill Davidsen, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

> Talk is cheap.
> 
> I've not seen a _single_ bug-report with a fix that attributed the
> existing LKCD patches. I might be more impressed if I had. 
> 
> The basic issue is that we don't put patches in in the hope that they will
> prove themselves later. Your argument is fundamentally flawed.

comment from userspace:

I'm going to have to side with Linus here despite my desire to see LKCD merged.
However, we need to show him the money. This means:

	* making sure that the patches are kept up to date

	* keep the LKCD patches in the list/community spotlight in a positive
		manner ("please test this!", or  "please use this when
		looking for help debugging a system problem"). Perhaps
		a 2.5.x-lkcd bk tree or something like that.

	* make documentation/HOWTO's available for folks so that
		they'll know how to generate a crashdump
		and run a some utilities against it to generate
		a synopsis which can be submitted for debugging

	* most important: squash a whole lot of bugs with
		said dumps!

If it becomes apparent through empirical data that crash dumps are a useful
tool, I'm sure that Linus will become far more amenable. Until then, lets let
him handle all of his other work which needs to get done.

-- craig



	  .-    ... . -.-. .-. . -    -- . ... ... .- --. .

			    Craig I. Hagan
			   hagan(at)cih.com




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  4:57                     ` Patrick Finnegan
@ 2002-11-01  9:18                       ` Henning P. Schmiedehausen
  2002-11-01 14:55                         ` Patrick Finnegan
  0 siblings, 1 reply; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-11-01  9:18 UTC (permalink / raw)
  To: linux-kernel

Patrick Finnegan <pat@purdueriots.com> writes:

>> What's so hard to understand about the "vendor-driven" thing, and why do
>> people continue to argue about it?

>Because I'm not a vendor, and I want it.

So get your vendor to integrate it. 

You don't have a vendor, but roll your own kernels? Tough, so you're
are a "vendor". Surprise, surprise.

Replace "vendor" with "people who roll up and distribute kernels". So
one vendor (Linus) refuses to integrate LKCD. Tough. Use another
one. Think USP here. Think diversity. Think competition. Maybe "that
vendor" (Linus) will catch up one day. Maybe not. Maybe "competition"
is not on his agenda. So what?

Get SuSE. They will integrate everything and their grand mother in
their kernels.

Gee, most people seem to think that "vendor" means "big evil
corporation in Redmont, WA".

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:27           ` Bill Davidsen
  2002-11-01  6:36             ` Linus Torvalds
@ 2002-11-01  9:20             ` Henning P. Schmiedehausen
  2002-11-01 13:29             ` Alan Cox
  2 siblings, 0 replies; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-11-01  9:20 UTC (permalink / raw)
  To: linux-kernel

Bill Davidsen <davidsen@tmr.com> writes:

>  You're not listening! Screw the vendors! The users want this enough to
                         ^^^^^^^^^^^^^^^^^^
>be patching it into their kernels now.

[...]

>  I also have Solaris and AIX servers, and if they go down I send a crash
>dump to the vendor who can then provide support. Big difference. Visible
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

q.e.d. End of Discussion.

	Regard
		Henning


-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 22:28       ` Xavier Bestel
  2002-10-31 23:08         ` Pavel Machek
@ 2002-11-01  9:55         ` Miquel van Smoorenburg
  1 sibling, 0 replies; 333+ messages in thread
From: Miquel van Smoorenburg @ 2002-11-01  9:55 UTC (permalink / raw)
  To: linux-kernel

In article <1036103335.25512.40.camel@bip>,
Xavier Bestel  <xavier.bestel@free.fr> wrote:
>Le jeu 31/10/2002 à 23:57, Pavel Machek a écrit :
>
>> This seems like a pretty common situation to me, and current solutions
>> are not nice. [I guess ~/bin/ with --x and
>> ~/bin/my-secret-password-only-jarka-and-mj-knows/phonebook would solve
>> the problem, but...!]
>
>Can't even this be spied from /proc/*/fd ?

Or ptrace, /proc/pid/mem, etc. If you can execute a binary, it
has to be loaded into memory in a process running as you, so
you can read it.

Mike.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over. (Fbdev rewrite)
  2002-10-31  2:31 ` Linus Torvalds
                     ` (16 preceding siblings ...)
  2002-11-01  0:52   ` James Simmons
@ 2002-11-01 10:24   ` Helge Hafting
  17 siblings, 0 replies; 333+ messages in thread
From: Helge Hafting @ 2002-11-01 10:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds wrote:

> > Fbdev Rewrite
> 
> This one is just huge, and I have little personal judgement on it.

This lets me use a multihead console, which lets me
run two X workstations in one pc. Two (or more)
keyboards, mice and screens.  But only one
expensive mainboard, only one space-consuming case.

Great for home use, possibly at work too.

Those who care about clean code can enjoy
the separation of framebuffer and console too.

Helge Hafting

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:34               ` Bill Davidsen
@ 2002-11-01 13:26                 ` Alan Cox
  2002-11-01 19:00                   ` Joel Becker
  2002-11-03 13:48                   ` Bill Davidsen
  0 siblings, 2 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-01 13:26 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Linus Torvalds, Chris Friesen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
>   From the standpoint of just the driver that's true. However, the remote
> machine and all the network bits between them are a string of single
> points of failure. Isn't it good that both disk and network can be
> supported.

My concerns are solely with things like the correctness of the disk
dumper. Its obviously a good way to do a lot more damage if it isnt done
carefully. Quite clearly your dump system wants to support multiple dump
targets so you can dump to pci battery backed ram, down the parallel
port to an analysing box etc


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:36             ` Linus Torvalds
  2002-11-01  7:00               ` [lkcd-devel] " Castor Fu
  2002-11-01  8:23               ` Craig I. Hagan
@ 2002-11-01 13:28               ` Alan Cox
  2002-11-02  5:00                 ` Bill Davidsen
  2 siblings, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-01 13:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bill Davidsen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> This never works. Be honest. Nobody takes out features, they are stuck 
> once they get in. 

Linus I've asked a couple of times about killing sound/oss off now ALSA
is integrated 8) While you are on the rant how about that ;)


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  6:27           ` Bill Davidsen
  2002-11-01  6:36             ` Linus Torvalds
  2002-11-01  9:20             ` Henning P. Schmiedehausen
@ 2002-11-01 13:29             ` Alan Cox
  2 siblings, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-01 13:29 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Fri, 2002-11-01 at 06:27, Bill Davidsen wrote:
>   You're not listening! Screw the vendors! The users want this enough to
> be patching it into their kernels now.

Welcome to free software. If you can make a case for it go sell people
suitable kernels, build an "LKCD kernel site" whatever.



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 21:02           ` Jeff Garzik
  2002-10-31 22:37             ` Werner Almesberger
  2002-11-01  1:35             ` Matt D. Robinson
@ 2002-11-01 13:30             ` Alan Cox
  2002-11-01 22:28               ` Rusty Russell
  2 siblings, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-01 13:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 21:02, Jeff Garzik wrote:
> hosed/screaming, and various mid-layers are dying.  For LKCD to be of 
> any use, it needs to _skip_ the block layer and talk directly to 
> low-level drivers.

Rusty wrote a polled IDE driver that should handle some subset of that


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  8:23               ` Craig I. Hagan
@ 2002-11-01 14:03                 ` Patrick Finnegan
  2002-11-02  4:57                 ` Bill Davidsen
  1 sibling, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-01 14:03 UTC (permalink / raw)
  To: linux-kernel, lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Craig I. Hagan wrote:

> > Talk is cheap.
> >
> > I've not seen a _single_ bug-report with a fix that attributed the
> > existing LKCD patches. I might be more impressed if I had.
> >
> > The basic issue is that we don't put patches in in the hope that they will
> > prove themselves later. Your argument is fundamentally flawed.
>
> comment from userspace:
>
> I'm going to have to side with Linus here despite my desire to see LKCD
> merged.

I'll have to disagree with what you're saying, because:

> However, we need to show him the money. This means:
>
> 	* making sure that the patches are kept up to date

They are being kept up to date, and aparently have been for some time.

> 	* keep the LKCD patches in the list/community spotlight in a positive
> 		manner ("please test this!", or  "please use this when
> 		looking for help debugging a system problem"). Perhaps
> 		a 2.5.x-lkcd bk tree or something like that.

Umm, and the difference between maintaining a set of patches per kernel
version and something using bitkeeper (or heaven forbid, CVS)?  Even
Linus didn't starting using source code management until somewhat
recently.

> 	* make documentation/HOWTO's available for folks so that
> 		they'll know how to generate a crashdump
> 		and run a some utilities against it to generate
> 		a synopsis which can be submitted for debugging

Have you seen http://lkcd.sf.net ?  They have that there.   I've
successfully walked through their well-written tutorials and produced
crashdumps from machines that have failed.

> 	* most important: squash a whole lot of bugs with
> 		said dumps!

Perhaps people are but they're not posting the bugs to the list...

> If it becomes apparent through empirical data that crash dumps are a useful
> tool, I'm sure that Linus will become far more amenable. Until then, lets let
> him handle all of his other work which needs to get done.

The data is there, perhaps not for Linux, but for other Unixes -
including ones like the BSDs.  Crashdumps are an invaluable resource for
finding bugs that involve things like hardware that doesn't conform
exactly to specs, or deadlocks, or...

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  9:18                       ` Henning P. Schmiedehausen
@ 2002-11-01 14:55                         ` Patrick Finnegan
  2002-11-01 15:16                           ` Alexander Viro
  2002-11-01 15:32                           ` Richard B. Johnson
  0 siblings, 2 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-01 14:55 UTC (permalink / raw)
  To: linux-kernel

On Fri, 1 Nov 2002, Henning P. Schmiedehausen wrote:

> Patrick Finnegan <pat@purdueriots.com> writes:
>
> >Because I'm not a vendor, and I want it.
>
> You don't have a vendor, but roll your own kernels? Tough, so you're
> are a "vendor". Surprise, surprise.
>
> Replace "vendor" with "people who roll up and distribute kernels". So
> one vendor (Linus) refuses to integrate LKCD. Tough. Use another

I'm confused, you just said (1) I'm a vendor and then (2) Linus is my
vendor.  And besides, we don't distribute the kernels - we install them on
our own machines, and say 'done'.  The lack of distribution (at least IMO)
should make us not be a vendor.

> one. Think USP here. Think diversity. Think competition. Maybe "that
> vendor" (Linus) will catch up one day. Maybe not. Maybe "competition"
> is not on his agenda. So what?

This isn't about competition.  It's about integrating a core useful
feature that has been shown to be emperically useful by every other person
who writes an OS kernel.

> Get SuSE. They will integrate everything and their grand mother in
> their kernels.

That's not really an option at the moment.  We have a disto vendor
(RedHat) and were dissatisfied with its kernels so we are trying to use
*the*official* kernel (Linus's kernel).

> Gee, most people seem to think that "vendor" means "big evil
> corporation in Redmont, WA".

No, vendor == people who sold or gave us the softare.  Right now, Linus is
acting like he's a big evil corporation that won't add the change no
matter what we say:

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > Sure, but why should they have to?  What technical reason is there
> > for not including it, Linus?
<snipped reasons that are imho incorrect>
> To me this says "LKCD is stupid". Which means that I'm not going to
> apply it

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> Don't bother to ask me to merge the thing, that only makes me get even
> more fed up with the whole discussion.

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.

On Fri, 1 Nov 2002, Linus Torvalds wrote:

> You got to hear my comment now, several times: convince somebody _else_.
<snip>
> What's so hard to understand about the "vendor-driven" thing, and why do
> people continue to argue about it?

You know, considering the volume of people on this list that have been
saying "I want it, Linus, please integrated it"  and:

On Thu, 31 Oct 2002, Matt D. Robinson wrote:

> I hate Linus' ego, I hate this whole damn discussion, and I find
> it very irritating that I have to go through this process after
> many people have created, enhanced and used LKCD for three years,
> and this is where we're at.
>
> To spend the last month and a half finalizing things for Linus,
> sending this to him on multiple occasions, asking for his comments
> and inclusion, asking for his feedback (as well as others), and
> not hearing _one damn word_ from Linus all that time, and for him
> to wait until now to just say "LKCD is stupid" is insulting.

You know, pissing off core developers of projects that have been shown to
be (1) desired (2) potentially useful in Linux, even as an aid to other
Linux subsystem developers and (3) emperically show to be useful for other
Free *nixes such as the BSDs, is not what I would be doing as a project
maintainer.  Of course, I'm not Linus...

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif









^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 14:55                         ` Patrick Finnegan
@ 2002-11-01 15:16                           ` Alexander Viro
  2002-11-01 15:27                             ` Patrick Finnegan
  2002-11-01 16:16                             ` Patrick Finnegan
  2002-11-01 15:32                           ` Richard B. Johnson
  1 sibling, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-11-01 15:16 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel



On Fri, 1 Nov 2002, Patrick Finnegan wrote:

> No, vendor == people who sold or gave us the softare.  Right now, Linus is
> acting like he's a big evil corporation that won't add the change no
> matter what we say:

... to his tree.  Geez, why could that be?  Maybe because you don't have
any rights to decide what patches does anybody else apply to their trees?

It's not a fscking public service.  Linus has full control over his
tree.  You have equally full control over your tree.  Linus can't
tell you what patches to apply in your tree.  You can't tell Linus
what patches he should apply to his.

"I'm not satisfied with this tree, I'll try that one" is perfectly OK.
"I'm not satisfied with either, so bend the fsck over and change your
tree the way I want" is _NOT_.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:43                 ` Chris Wedgwood
@ 2002-11-01 15:25                   ` Linus Torvalds
  2002-11-01 15:35                     ` bert hubert
                                       ` (4 more replies)
  0 siblings, 5 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-01 15:25 UTC (permalink / raw)
  To: linux-kernel

In article <20021031194351.GA24676@tapu.f00f.org>,
Chris Wedgwood  <cw@f00f.org> wrote:
>On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:
>
>> Any hardware that needs to go off and think about how to encrypt
>> something sounds like it's so slow as to be unusable. I suspect that
>> anything that is over the PCI bus is already so slow (even if it
>> adds no extra cycles of its own) that you're better off using the
>> CPU for the encryption rather than some external hardware.
>
>Except almost all hardware out there that does this stuff is async to
>some extent...

That's not my argument.  I realize that external hardware on a PCI bus
_has_ to be asynchronous, simply because it is so slow. 

The question I have is whether such external hardware is even worth it
any more for any standard crypto work.  With a regular PCI bus
fundamentally limiting throughput to something like a maximum of 66MB/s
(copy-in and copy-out, and that's so theoretical that it's not even
funny - I'd be surprised if RL throughput copying back and forth over a
PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
faster on the CPU directly these days. 

Maybe not. The only numbers I have is the slowness of PCI.

		Linus

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:16                           ` Alexander Viro
@ 2002-11-01 15:27                             ` Patrick Finnegan
  2002-11-01 16:16                             ` Patrick Finnegan
  1 sibling, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-01 15:27 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel

On Fri, 1 Nov 2002, Alexander Viro wrote:

> On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
> > No, vendor == people who sold or gave us the softare.  Right now, Linus is
> > acting like he's a big evil corporation that won't add the change no
> > matter what we say:
>
> ... to his tree.  Geez, why could that be?  Maybe because you don't have
> any rights to decide what patches does anybody else apply to their trees?
>
> It's not a fscking public service.  Linus has full control over his
> tree.  You have equally full control over your tree.  Linus can't
> tell you what patches to apply in your tree.  You can't tell Linus
> what patches he should apply to his.
>
> "I'm not satisfied with this tree, I'll try that one" is perfectly OK.
> "I'm not satisfied with either, so bend the fsck over and change your
> tree the way I want" is _NOT_.

Yes, I recognise it's his right.  But what bothers me is that he says "I
want users to say they want it" and when user say they want it hey says
"It's a vendor thing, no users want it."

Linus, if you say you're going to listen, please try and listen.  This is
annoying and dissatisfying to all of us when you say you'll listen and you
blatantly ignore people.  Your tree is your tree, for now it's going to be
patching our own kernel, and then possibly moving to another vendor who
listens to their users.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 14:55                         ` Patrick Finnegan
  2002-11-01 15:16                           ` Alexander Viro
@ 2002-11-01 15:32                           ` Richard B. Johnson
  1 sibling, 0 replies; 333+ messages in thread
From: Richard B. Johnson @ 2002-11-01 15:32 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel

On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
[SNIPPED...] 
> You know, pissing off core developers of projects that have been shown to
> be (1) desired (2) potentially useful in Linux, even as an aid to other
> Linux subsystem developers and (3) emperically show to be useful for other
> Free *nixes such as the BSDs, is not what I would be doing as a project
> maintainer.  Of course, I'm not Linus...
> 
> Pat

Maybe somebody should at least say what it is that is:
"(1) desired (2) potentially useful in Linux, even as an aid to
other..."

It might be that you guys are so close to the project that you
lose sight of the fact that others, including Linus, might not
understand how important it is. It is quite possible that somebody
has developed a lot of excellent code that has absolutely no use
to anybody except a small group of intellectuals who use the
kernel to write poetry. In that case, regardless of how excellent
it is, it really should not be in the standard kernel. OTH, it
might be useful to the whole world, but nobody has bothered to
explain how this may be so.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
   Bush : The Fourth Reich of America



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:25                   ` Linus Torvalds
@ 2002-11-01 15:35                     ` bert hubert
  2002-11-01 15:50                     ` Gerald Britton
                                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 333+ messages in thread
From: bert hubert @ 2002-11-01 15:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:


> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days. 

I'd be amazed of current CPUs would be able to do asymmetric encryption at
anywhere within an order of magnitude of those rates.

Symmetric encryption is something else. This is the reason many encryption
products (ie, pgp) only use asymmetric encryption for encrypting a symmetric
session key, and not encrypting the entire message.

Regards,

bert hubert

-- 
http://www.PowerDNS.com          Versatile DNS Software & Services
http://lartc.org           Linux Advanced Routing & Traffic Control HOWTO

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:25                   ` Linus Torvalds
  2002-11-01 15:35                     ` bert hubert
@ 2002-11-01 15:50                     ` Gerald Britton
  2002-11-01 18:17                       ` Matt Porter
  2002-11-01 16:15                     ` Michael Clark
                                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 333+ messages in thread
From: Gerald Britton @ 2002-11-01 15:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> The question I have is whether such external hardware is even worth it
> any more for any standard crypto work.  With a regular PCI bus
> fundamentally limiting throughput to something like a maximum of 66MB/s
> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days. 

This may be true of a typical workstation or large server, but your router
may not have such a modern CPU in it.  Crypto accelerators are likely a
much bigger win on embedded routers or other small appliances with CPUs such
as the AMD Elan or other 486 to Pentium class processors.

				-- Gerald


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:25                   ` Linus Torvalds
  2002-11-01 15:35                     ` bert hubert
  2002-11-01 15:50                     ` Gerald Britton
@ 2002-11-01 16:15                     ` Michael Clark
  2002-11-01 16:16                     ` Erik Andersen
  2002-11-01 20:43                     ` romieu
  4 siblings, 0 replies; 333+ messages in thread
From: Michael Clark @ 2002-11-01 16:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Chris Wedgwood

On 11/01/02 23:25, Linus Torvalds wrote:
> In article <20021031194351.GA24676@tapu.f00f.org>,
> Chris Wedgwood  <cw@f00f.org> wrote:
> 
>>On Thu, Oct 31, 2002 at 10:49:10AM -0800, Linus Torvalds wrote:
>>
>>
>>>Any hardware that needs to go off and think about how to encrypt
>>>something sounds like it's so slow as to be unusable. I suspect that
>>>anything that is over the PCI bus is already so slow (even if it
>>>adds no extra cycles of its own) that you're better off using the
>>>CPU for the encryption rather than some external hardware.
>>
>>Except almost all hardware out there that does this stuff is async to
>>some extent...
> 
> 
> That's not my argument.  I realize that external hardware on a PCI bus
> _has_ to be asynchronous, simply because it is so slow. 
> 
> The question I have is whether such external hardware is even worth it
> any more for any standard crypto work.  With a regular PCI bus
> fundamentally limiting throughput to something like a maximum of 66MB/s
> (copy-in and copy-out, and that's so theoretical that it's not even
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days. 
> 
> Maybe not. The only numbers I have is the slowness of PCI.

A 1GHz PIII will do about 8MBytes/sec of 3DES

Plug in a 2.4Gbs broadcom crypto chip into a 64bit PCI-X slot with the
same CPU and you should be capable of doing at least 10 times that.

Stuff like RSA is much slower (and benefits more from hardware)

BTW - there are some outdated cryptolib patches with an async
interface around somewhere (along with patches for freeswan to use
the async api).

I guess the crypto guys like Chris will add the async API if they need
it (which they do i think ;).

~mc


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:25                   ` Linus Torvalds
                                       ` (2 preceding siblings ...)
  2002-11-01 16:15                     ` Michael Clark
@ 2002-11-01 16:16                     ` Erik Andersen
  2002-11-01 20:43                     ` romieu
  4 siblings, 0 replies; 333+ messages in thread
From: Erik Andersen @ 2002-11-01 16:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> funny - I'd be surprised if RL throughput copying back and forth over a
> PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> faster on the CPU directly these days. 
> 
> Maybe not. The only numbers I have is the slowness of PCI.

It may be faster on your beefy 8 CPU boxes.  But many people are
creating, for example, little wireless access points with 200 Mhz
StrongArm CPUs and similar little devices that lack the major CPU
horsepower of big-iron system.  Such boxes would be far better
off offloading crypto to a little crypto chip, right?

 -Erik

--
Erik B. Andersen             http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:16                           ` Alexander Viro
  2002-11-01 15:27                             ` Patrick Finnegan
@ 2002-11-01 16:16                             ` Patrick Finnegan
  2002-11-01 16:32                               ` Larry McVoy
                                                 ` (3 more replies)
  1 sibling, 4 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-01 16:16 UTC (permalink / raw)
  To: linux-kernel

What I'm going to say may not be popular, and probably won't win me
friends, but here it is anyhow:

On Fri, 1 Nov 2002, Alexander Viro wrote:

> On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>
> > No, vendor == people who sold or gave us the softare.  Right now, Linus is
> > acting like he's a big evil corporation that won't add the change no
> > matter what we say:
>
> ... to his tree.  Geez, why could that be?  Maybe because you don't have
> any rights to decide what patches does anybody else apply to their trees?
>
> It's not a fscking public service.  Linus has full control over his
> tree.  You have equally full control over your tree.  Linus can't
> tell you what patches to apply in your tree.  You can't tell Linus
> what patches he should apply to his.

I'm sorry it _is_ a public service.  Once tens of people started
contributing to it, it became one.  This is like saying that the
Washington Monument belongs to the peole that maintain it, any building
belongs to the repair crews and janitors.  I'm not saying that Linus is
necessarily a janitor, but when you consider how much of the Linux kernel
that he didn't write, you may relize that it's not just his kernel.  It
also belongs to every single person that has written even a single
line of code in it.

BTW, "My opinions do not represent the opinions of my employer" for at
least this email..

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 16:16                             ` Patrick Finnegan
@ 2002-11-01 16:32                               ` Larry McVoy
  2002-11-01 16:44                                 ` Linux without Linus was " Brian Jackson
  2002-11-01 19:14                                 ` Shawn
  2002-11-01 17:56                               ` Nicolas Pitre
                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 333+ messages in thread
From: Larry McVoy @ 2002-11-01 16:32 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel

On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
> On Fri, 1 Nov 2002, Alexander Viro wrote:
> > It's not a fscking public service.  Linus has full control over his
> > tree.  You have equally full control over your tree.  Linus can't
> > tell you what patches to apply in your tree.  You can't tell Linus
> > what patches he should apply to his.
> 
> I'm sorry it _is_ a public service.  Once tens of people started
> contributing to it, it became one.  

Pat, the public service that Linus provides is doing exactly what he does.
He's acting as a filter.  You may or may not agree with the things he
lets in or does not.  That's fine, if you think you can do a better job
you have that option.  i can imagine your answer is "I think he's doing
a fine job except for my project which isn't getting in" or something
like that.  That's a bummer for you but keep the big picture in mind.
Linus is the glue which keeps the Linux world from turning into the
BSD mess.  He is the acknowledged leader.  Without him we have a bunch
of semi-leaders, with him we have a real leader.  The fact that Linus
is here, leading this herd of cats, is a gift to the world.  Try and
imagine Linux without him, it's not a pretty picture.

So figure out a way to work with him, don't stress him out, he's a
critical resource without a viable replacement.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Linux without Linus was Re: What's left over.
  2002-11-01 16:32                               ` Larry McVoy
@ 2002-11-01 16:44                                 ` Brian Jackson
  2002-11-01 16:58                                   ` Paul Fulghum
  2002-11-01 19:14                                 ` Shawn
  1 sibling, 1 reply; 333+ messages in thread
From: Brian Jackson @ 2002-11-01 16:44 UTC (permalink / raw)
  To: linux-kernel

Larry McVoy writes: 

> On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
>> On Fri, 1 Nov 2002, Alexander Viro wrote:
>> > It's not a fscking public service.  Linus has full control over his
>> > tree.  You have equally full control over your tree.  Linus can't
>> > tell you what patches to apply in your tree.  You can't tell Linus
>> > what patches he should apply to his. 
>> 
>> I'm sorry it _is_ a public service.  Once tens of people started
>> contributing to it, it became one.  
> 
> Pat, the public service that Linus provides is doing exactly what he does.
> He's acting as a filter.  You may or may not agree with the things he
> lets in or does not.  That's fine, if you think you can do a better job
> you have that option.  i can imagine your answer is "I think he's doing
> a fine job except for my project which isn't getting in" or something
> like that.  That's a bummer for you but keep the big picture in mind.
> Linus is the glue which keeps the Linux world from turning into the
> BSD mess.  He is the acknowledged leader.  Without him we have a bunch
> of semi-leaders, with him we have a real leader.  The fact that Linus
> is here, leading this herd of cats, is a gift to the world.  Try and
> imagine Linux without him, it's not a pretty picture. 
> 

What something like:
Virox
Hellwigix
Alanix
KHix 

eeewww, I can't bring myself to think about it 

> So figure out a way to work with him, don't stress him out, he's a
> critical resource without a viable replacement.
> -- 
> ---
> Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Linux without Linus was Re: What's left over.
  2002-11-01 16:44                                 ` Linux without Linus was " Brian Jackson
@ 2002-11-01 16:58                                   ` Paul Fulghum
  0 siblings, 0 replies; 333+ messages in thread
From: Paul Fulghum @ 2002-11-01 16:58 UTC (permalink / raw)
  To: linux-kernel

>> The fact that Linus is here, leading this herd of cats,
>> is a gift to the world.  Try and imagine Linux without
>> him, it's not a pretty picture. 
> 
> What something like:
> Virox...

That's actually a pretty cool name.

> ...Alanix

Sounds too much like a Canadian musician.

Oh well, back to hacking hairballs. Meow.

Paul Fulghum, paulkf@microgate.com
Microgate Corporation, www.microgate.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 16:16                             ` Patrick Finnegan
  2002-11-01 16:32                               ` Larry McVoy
@ 2002-11-01 17:56                               ` Nicolas Pitre
  2002-11-01 18:23                               ` Shane R. Stixrud
  2002-11-04  2:13                               ` Rob Landley
  3 siblings, 0 replies; 333+ messages in thread
From: Nicolas Pitre @ 2002-11-01 17:56 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: lkml

On Fri, 1 Nov 2002, Patrick Finnegan wrote:

> What I'm going to say may not be popular, and probably won't win me
> friends, but here it is anyhow:
> 
> On Fri, 1 Nov 2002, Alexander Viro wrote:
> 
> > On Fri, 1 Nov 2002, Patrick Finnegan wrote:
> >
> > > No, vendor == people who sold or gave us the softare.  Right now, Linus is
> > > acting like he's a big evil corporation that won't add the change no
> > > matter what we say:
> >
> > ... to his tree.  Geez, why could that be?  Maybe because you don't have
> > any rights to decide what patches does anybody else apply to their trees?
> >
> > It's not a fscking public service.  Linus has full control over his
> > tree.  You have equally full control over your tree.  Linus can't
> > tell you what patches to apply in your tree.  You can't tell Linus
> > what patches he should apply to his.
> 
> I'm sorry it _is_ a public service.  Once tens of people started
> contributing to it, it became one.  This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors.  

But then would you agree seeing anybody, and I mean anybody, coming along 
with a "good idea" for alteration to the Washington Monument and let them do 
what they want?

> I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel.  It
> also belongs to every single person that has written even a single
> line of code in it.

It is _his_ copy of the kernel, just as you have your own copy.

Linus' tree is known to be the main reference tree, no more.

If your patch is so valuable (and I don't mean it's not), you should be able
to convince vendors to include it in their own tree.  If _then_ it happens
to be a major feature with a large user base I'm sure it'll make the
reference tree.  But in the mean time a few scattered users isn't enough.


Nicolas


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:50                     ` Gerald Britton
@ 2002-11-01 18:17                       ` Matt Porter
  0 siblings, 0 replies; 333+ messages in thread
From: Matt Porter @ 2002-11-01 18:17 UTC (permalink / raw)
  To: Gerald Britton; +Cc: Linus Torvalds, linux-kernel

On Fri, Nov 01, 2002 at 10:50:45AM -0500, Gerald Britton wrote:
> On Fri, Nov 01, 2002 at 03:25:01PM +0000, Linus Torvalds wrote:
> > The question I have is whether such external hardware is even worth it
> > any more for any standard crypto work.  With a regular PCI bus
> > fundamentally limiting throughput to something like a maximum of 66MB/s
> > (copy-in and copy-out, and that's so theoretical that it's not even
> > funny - I'd be surprised if RL throughput copying back and forth over a
> > PCI bus is more than 25-30MB/s), I suspect that you can do most crypto
> > faster on the CPU directly these days. 
> 
> This may be true of a typical workstation or large server, but your router
> may not have such a modern CPU in it.  Crypto accelerators are likely a
> much bigger win on embedded routers or other small appliances with CPUs such
> as the AMD Elan or other 486 to Pentium class processors.

Yes, and as a tangent, the same class of embedded devices also benefit
from TCP/IP offload facilities.  The same argument against a crypto-api
supporting crypto hardware has been used in the past to argue against
a Linux kernel TCP/IP hardware offload layer.  The argument is
completely invalid once one considers the typically lower speed of an
embedded processor going into a crypto or network-edge device.

Even better, synthesizable SoC designs like IBM PPC4xx and reconfigurable
processors architectures have opened further the concept of an on-chip
crypto or tcp/ip offload macro cell which virtually eliminates PCI
speed/latency concerns for these assist engines.  It should be no
surprise that embedded Linux is highly desired in these application
specific processors.

Regards,
-- 
Matt Porter
porter@cox.net
This is Linux Country. On a quiet night, you can hear Windows reboot.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 16:16                             ` Patrick Finnegan
  2002-11-01 16:32                               ` Larry McVoy
  2002-11-01 17:56                               ` Nicolas Pitre
@ 2002-11-01 18:23                               ` Shane R. Stixrud
  2002-11-01 19:18                                 ` John Alvord
  2002-11-04  2:13                               ` Rob Landley
  3 siblings, 1 reply; 333+ messages in thread
From: Shane R. Stixrud @ 2002-11-01 18:23 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel


On Fri, 1 Nov 2002, Patrick Finnegan wrote:
> 
> I'm sorry it _is_ a public service.  Once tens of people started   
> contributing to it, it became one.  This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors.  I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel.  It
> also belongs to every single person that has written even a single
> line of code in it.
>

The logic you seem to be missing is, the Washington Monument is a
physical object.  Linus's source tree is a collection of "copied" parts 
from other peoples source trees.  You obviously see his source copy 
as special, more so then say my copy.  This is true _ONLY_ because 
Linus's copy commands more respect then yours or mine.  
If you think about it, the respect Linus's copy has is _PURELY_ 
the result of his past _choices_ over how he maintains it.


In effect you are saying: 

Patrick: "Everyone trusts your source tree, I think LKCD 
is SUPER DUPER important and should get the exposure and trust 
that being in your tree commands." 

Linus: "I think LKCD is a bad idea, until I am convinced otherwise I 
will not merge it."  

Patrick: "You are wrong, LKCD should be in your copy of the kernel source.
It is your Job Linus, to add things to _your_ copy which others find 
important, what you think is secondary."


You cannot have it both ways, either Linus's tree is a dumping 
grounds for all ideas (both good and bad) or it is a place for good 
ideas (good defined by Linus) where people who trust Linus's judgment can 
work from.

In truth you can have it both ways.  Take Linus's existing copy, add the 
features you think are important.  If your choices prove to be superior. 
you can expect that people (over time) will begin to trust/respect your 
copy more then Linus's.

-- 
Shane R. Stixrud        "Nothing would please me more than being able to 
shane@stixrud.org       hire ten programmers and deluge the hobby market 
                        with good software." -- Bill Gates 1976

                        We are still waiting ....






^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 13:26                 ` Alan Cox
@ 2002-11-01 19:00                   ` Joel Becker
  2002-11-01 19:18                     ` Linus Torvalds
  2002-11-03 13:48                   ` Bill Davidsen
  1 sibling, 1 reply; 333+ messages in thread
From: Joel Becker @ 2002-11-01 19:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bill Davidsen, Linus Torvalds, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Fri, Nov 01, 2002 at 01:26:44PM +0000, Alan Cox wrote:
> My concerns are solely with things like the correctness of the disk
> dumper. Its obviously a good way to do a lot more damage if it isnt done
> carefully.

	I always liked the AIX dumper choices.  You could either dump to
the swap area (and startup detects the dump and moves it to the
filesystem before swapon) or provide a dedicated dump partition.  The
latter was prefered.
	Either of these methods merely require the dumper to correctly
write to one disk partition.  This is about as simple as you are going
to get in disk dumping.

Joel

-- 

"You must remember this:
 A kiss is just a kiss,
 A sigh is just a sigh.
 The fundamental rules apply
 As time goes by."

Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 16:32                               ` Larry McVoy
  2002-11-01 16:44                                 ` Linux without Linus was " Brian Jackson
@ 2002-11-01 19:14                                 ` Shawn
  2002-11-01 19:36                                   ` Shawn
  1 sibling, 1 reply; 333+ messages in thread
From: Shawn @ 2002-11-01 19:14 UTC (permalink / raw)
  To: Larry McVoy, Patrick Finnegan, linux-kernel

On 11/01, Larry McVoy said something like:
> On Fri, Nov 01, 2002 at 11:16:20AM -0500, Patrick Finnegan wrote:
> > On Fri, 1 Nov 2002, Alexander Viro wrote:
> > > It's not a fscking public service.  Linus has full control over his
> > > tree.  You have equally full control over your tree.  Linus can't
> > > tell you what patches to apply in your tree.  You can't tell Linus
> > > what patches he should apply to his.
> > 
> > I'm sorry it _is_ a public service.  Once tens of people started
> > contributing to it, it became one.  
> 
> Pat, the public service that Linus provides is doing exactly what he does.
> He's acting as a filter.  You may or may not agree with the things he

cat name-your.patch | Linus --please-dont-delete-your-inbox-again

--
Shawn Leas
core@enodev.com

My friend has a baby.  I'm recording all the noises he makes so later I can
ask him what he meant.
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 18:23                               ` Shane R. Stixrud
@ 2002-11-01 19:18                                 ` John Alvord
  0 siblings, 0 replies; 333+ messages in thread
From: John Alvord @ 2002-11-01 19:18 UTC (permalink / raw)
  To: Shane R. Stixrud; +Cc: Patrick Finnegan, linux-kernel

On Fri, 1 Nov 2002 10:23:01 -0800 (PST), "Shane R. Stixrud"
<shane@stixrud.org> wrote:

>
>On Fri, 1 Nov 2002, Patrick Finnegan wrote:
>> 
>> I'm sorry it _is_ a public service.  Once tens of people started   
>> contributing to it, it became one.  This is like saying that the
>> Washington Monument belongs to the peole that maintain it, any building
>> belongs to the repair crews and janitors.  I'm not saying that Linus is
>> necessarily a janitor, but when you consider how much of the Linux kernel
>> that he didn't write, you may relize that it's not just his kernel.  It
>> also belongs to every single person that has written even a single
>> line of code in it.
>>
>
>The logic you seem to be missing is, the Washington Monument is a
>physical object.  Linus's source tree is a collection of "copied" parts 
>from other peoples source trees.  You obviously see his source copy 
>as special, more so then say my copy.  This is true _ONLY_ because 
>Linus's copy commands more respect then yours or mine.  
>If you think about it, the respect Linus's copy has is _PURELY_ 
>the result of his past _choices_ over how he maintains it.
>
>
>In effect you are saying: 
>
>Patrick: "Everyone trusts your source tree, I think LKCD 
>is SUPER DUPER important and should get the exposure and trust 
>that being in your tree commands." 
>
>Linus: "I think LKCD is a bad idea, until I am convinced otherwise I 
>will not merge it."  
>
>Patrick: "You are wrong, LKCD should be in your copy of the kernel source.
>It is your Job Linus, to add things to _your_ copy which others find 
>important, what you think is secondary."
>
>
>You cannot have it both ways, either Linus's tree is a dumping 
>grounds for all ideas (both good and bad) or it is a place for good 
>ideas (good defined by Linus) where people who trust Linus's judgment can 
>work from.
>
>In truth you can have it both ways.  Take Linus's existing copy, add the 
>features you think are important.  If your choices prove to be superior. 
>you can expect that people (over time) will begin to trust/respect your 
>copy more then Linus's.

This also explains why Linus said it was a vendor push situation. If
vendors pick it up, find it useful (as I am sure they will), and tell
Linus about that usage... LKCD will become part of the mainline tree.
I suspect for most vendors, it would be part of their extra cost
"server" package and the Linux/390 package... It clearly has the
potential to enhance service and buyers of server packages need it.

If along the way, significant numbers of "big users" like Purdue adopt
it, use it, and reflect back to L-K the diagnostic successes and fixes
which result, that could speed the decision. If Linus has a tough bug,
installs LKCD, sends the dump to a wizzard and gets a fix, that would
definitely speed the decision.

john alvord

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 19:00                   ` Joel Becker
@ 2002-11-01 19:18                     ` Linus Torvalds
  2002-11-01 20:06                       ` Steven King
                                         ` (3 more replies)
  0 siblings, 4 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-01 19:18 UTC (permalink / raw)
  To: Joel Becker
  Cc: Alan Cox, Bill Davidsen, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel


On Fri, 1 Nov 2002, Joel Becker wrote:
> 
> 	I always liked the AIX dumper choices.  You could either dump to
> the swap area (and startup detects the dump and moves it to the
> filesystem before swapon) or provide a dedicated dump partition.  The
> latter was prefered.
> 	Either of these methods merely require the dumper to correctly
> write to one disk partition.  This is about as simple as you are going
> to get in disk dumping.

Ehh.. That was on closed hardware that was largely designed with and for
the OS.

Alan isn't worried about the "which sector do I write" kind of thing.  
That's the trivial part. Alan is worried about the fact that once you know
which sector to write, actually _doing_ so is a really hard thing. You
have bounce buffers, you have exceedingly complex drivers that work
differently in PIO and DMA modes and are more likely than not the _cause_
of a number of problems etc.

And you have a situation where interrupts are not likely to work well
(because you crashed with various locks held), so the regular driver
simply isn't likely to work all that well.

And you have a situation where there are hundreds of different kinds of 
device drivers for the disk.

In other words, the AIX situation isn't even _remotely_ comparable. A
large portion of the complexity in the PC stability space is in device
drivers. It's the thing I worry most about for 2.6.x stabilization, by 
_far_.

And if you get these things wrong, you're quite likely to stomp on your
disk. Hard. You may be tryign to write the swap partition, but if the
driver gets confused, you just overwrote all your important data. At which
point it doesn't matter if your filesystem is journaling or not, since you
just potentially overwrote it.

In other words: it's a huge risk to play with the disk when the system is
already known to be unstable. The disk drivers tend to be one of the main
issues even when everything else is _stable_, for chrissake!

To add insult to injury, you will not be able to actually _test_ any of 
the real error paths in real life. Sure, you will be able to test forced 
dumps on _your_ hardware, but while that is fine in the AIX model ("we 
control the hardware, and charge the user five times what it is worth"), 
again that doesn't mean _squat_ in the PC hardware space.

See?

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 19:14                                 ` Shawn
@ 2002-11-01 19:36                                   ` Shawn
  0 siblings, 0 replies; 333+ messages in thread
From: Shawn @ 2002-11-01 19:36 UTC (permalink / raw)
  To: Shawn; +Cc: Larry McVoy, Patrick Finnegan, linux-kernel

On 11/01, Shawn said something like:
> > Pat, the public service that Linus provides is doing exactly what he does.
> > He's acting as a filter.  You may or may not agree with the things he
> 
> cat name-your.patch | Linus --please-dont-delete-your-inbox-again

Maybe "piping" things to Linus is a little rude... :O

--
Shawn Leas
core@enodev.com

While I was gone, somebody rearranged on the furniture in my
bedroom.  They put it in _exactly_ the same place it was.
When I told my roommate, he said: Do I know you?
						-- Stephen Wright

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 19:18                     ` Linus Torvalds
@ 2002-11-01 20:06                       ` Steven King
  2002-11-02  5:17                         ` Bill Davidsen
  2002-11-01 20:21                       ` David Lang
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 333+ messages in thread
From: Steven King @ 2002-11-01 20:06 UTC (permalink / raw)
  To: Linus Torvalds, Joel Becker
  Cc: Alan Cox, Bill Davidsen, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Friday 01 November 2002 11:18 am, Linus Torvalds wrote:

> To add insult to injury, you will not be able to actually _test_ any of
> the real error paths in real life. Sure, you will be able to test forced
> dumps on _your_ hardware, but while that is fine in the AIX model ("we
> control the hardware, and charge the user five times what it is worth"),
> again that doesn't mean _squat_ in the PC hardware space.

  On the other hand, ISC's system 5 r3 ran on commodity x86 hardware and the 
crash dumper worked on the various disk hardware I had occasion to use it on 
(mfm, scsi, ide), although one did need to make sure swap was larger than ram 
or bad things would happen. 8-{.  

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 19:18                     ` Linus Torvalds
  2002-11-01 20:06                       ` Steven King
@ 2002-11-01 20:21                       ` David Lang
  2002-11-01 22:25                         ` Werner Almesberger
  2002-11-01 20:22                       ` [lkcd-devel] " Matt D. Robinson
  2002-11-01 20:37                       ` Hugh Dickins
  3 siblings, 1 reply; 333+ messages in thread
From: David Lang @ 2002-11-01 20:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Alan Cox, Bill Davidsen, Chris Friesen,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

One question I have is how much of the driver problem you refer to is
becouse of optimizations that the various drivers have, could you fall
back to the simplest, works-with-everything,
all-timeouts-longer-then-the-slowest-disk slug of a driver that could be
used to do this dump?

David Lang

On Fri, 1 Nov 2002, Linus Torvalds wrote:

> Alan isn't worried about the "which sector do I write" kind of thing.
> That's the trivial part. Alan is worried about the fact that once you know
> which sector to write, actually _doing_ so is a really hard thing. You
> have bounce buffers, you have exceedingly complex drivers that work
> differently in PIO and DMA modes and are more likely than not the _cause_
> of a number of problems etc.
>
> And you have a situation where interrupts are not likely to work well
> (because you crashed with various locks held), so the regular driver
> simply isn't likely to work all that well.
>
> And you have a situation where there are hundreds of different kinds of
> device drivers for the disk.
>
> In other words, the AIX situation isn't even _remotely_ comparable. A
> large portion of the complexity in the PC stability space is in device
> drivers. It's the thing I worry most about for 2.6.x stabilization, by
> _far_.
>
> And if you get these things wrong, you're quite likely to stomp on your
> disk. Hard. You may be tryign to write the swap partition, but if the
> driver gets confused, you just overwrote all your important data. At which
> point it doesn't matter if your filesystem is journaling or not, since you
> just potentially overwrote it.
>
> In other words: it's a huge risk to play with the disk when the system is
> already known to be unstable. The disk drivers tend to be one of the main
> issues even when everything else is _stable_, for chrissake!
>
> To add insult to injury, you will not be able to actually _test_ any of
> the real error paths in real life. Sure, you will be able to test forced
> dumps on _your_ hardware, but while that is fine in the AIX model ("we
> control the hardware, and charge the user five times what it is worth"),
> again that doesn't mean _squat_ in the PC hardware space.
>
> See?
>
> 		Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-01 19:18                     ` Linus Torvalds
  2002-11-01 20:06                       ` Steven King
  2002-11-01 20:21                       ` David Lang
@ 2002-11-01 20:22                       ` Matt D. Robinson
  2002-11-02 13:02                         ` Kai Henningsen
  2002-11-01 20:37                       ` Hugh Dickins
  3 siblings, 1 reply; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-01 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Alan Cox, Bill Davidsen, Chris Friesen,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Fri, 1 Nov 2002, Linus Torvalds wrote:
|>Alan isn't worried about the "which sector do I write" kind of thing.  
|>That's the trivial part. Alan is worried about the fact that once you know
|>which sector to write, actually _doing_ so is a really hard thing. You
|>have bounce buffers, you have exceedingly complex drivers that work
|>differently in PIO and DMA modes and are more likely than not the _cause_
|>of a number of problems etc.

[ preamble - this is only a technical discussion, I'm interested
  in feedback on what we can improve upon ]

I agree with you.  We'd prefer to have a better low-level driver
primitive sitting on top of two low-level disk drivers (IDE and
SCSI).  Fundamentally, though, this is difficult to do:

0) There's a lot of early stuff you take risks with, such as the
   partition size (assuming you can probe it), knowing that it
   hasn't changed since boot, and pre-allocating buffers for disk
   I/O operations.  You always take the partition risk no matter
   what.

1) You have to establish that the IDE or SCSI device can be reset
   into an appropriate mode for seek/write mode -- if a DMA operation
   fails to the drive, and you can't reset the drive, you may be stuck.

2) Once the hardware reports back success, it is a matter of how
   you write the blocks.  I once wrote the low-level IDE driver
   below request structures, writing sequentially to the drive,
   and ran into occasional drive lock-ups while writing during
   interrupt crashes.  This was more likely due to my inexperience
   with the IDE driver than anything else.

|>And you have a situation where interrupts are not likely to work well
|>(because you crashed with various locks held), so the regular driver
|>simply isn't likely to work all that well.

This is simply an avoidance of certain code paths.  We saw this
problem earlier in 2.2 using kiobufs and got around it for the
most part by doing our best to avoid the io_request_lock.  That's
why we haven't seen the lock contention problems for 2.5.

|>And you have a situation where there are hundreds of different kinds of 
|>device drivers for the disk.

This is the biggest problem, absolutely.  Our idea moving forward
was to create a _dump() primitive with drivers that allows you to
determine, upon configuration of a disk dump device, whether or
not the low-level driver supported dumping or not.  I suggested this
to Al Viro a long time ago on this list, but it didn't go anywhere.

That way the driver itself knows that it can support a low-level
page-write method.  If it doesn't, you can't use disk dumping to
that device.

I'm willing to re-open this effort.

|>And if you get these things wrong, you're quite likely to stomp on your
|>disk. Hard. You may be tryign to write the swap partition, but if the
|>driver gets confused, you just overwrote all your important data. At which
|>point it doesn't matter if your filesystem is journaling or not, since you
|>just potentially overwrote it.

We haven't seen this before, but it is always a possibility for any
dump scenario.  That's why you some choose netdump instead. :)

|>In other words: it's a huge risk to play with the disk when the system is
|>already known to be unstable. The disk drivers tend to be one of the main
|>issues even when everything else is _stable_, for chrissake!
|>
|>To add insult to injury, you will not be able to actually _test_ any of 
|>the real error paths in real life. Sure, you will be able to test forced 
|>dumps on _your_ hardware, but while that is fine in the AIX model ("we 
|>control the hardware, and charge the user five times what it is worth"), 
|>again that doesn't mean _squat_ in the PC hardware space.

We have actually done a lot of testing with injection of failures
into the middle of VM, network drivers, etc., in conjunction with
disk dumping.  Certainly it doesn't cover all the cases, but nothing
ever will.

|>		Linus

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 19:18                     ` Linus Torvalds
                                         ` (2 preceding siblings ...)
  2002-11-01 20:22                       ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-01 20:37                       ` Hugh Dickins
  2002-11-02 18:23                         ` Geert Uytterhoeven
  2002-11-03  2:25                         ` Horst von Brand
  3 siblings, 2 replies; 333+ messages in thread
From: Hugh Dickins @ 2002-11-01 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joel Becker, Alan Cox, Bill Davidsen, Chris Friesen,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Linus Torvalds wrote:
> On Fri, 1 Nov 2002, Joel Becker wrote:
> > 
> > 	I always liked the AIX dumper choices.  You could either dump to
> > the swap area (and startup detects the dump and moves it to the
> > filesystem before swapon) or provide a dedicated dump partition.  The
> > latter was prefered.
> 
> Ehh.. That was on closed hardware that was largely designed with and for
> the OS.
>... 
> In other words: it's a huge risk to play with the disk when the system is
> already known to be unstable. The disk drivers tend to be one of the main
> issues even when everything else is _stable_, for chrissake!
> 
> To add insult to injury, you will not be able to actually _test_ any of 
> the real error paths in real life. Sure, you will be able to test forced 
> dumps on _your_ hardware, but while that is fine in the AIX model ("we 
> control the hardware, and charge the user five times what it is worth"), 
> again that doesn't mean _squat_ in the PC hardware space.

I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
OpenServer and UnixWare: which were addressing the PC market, not
own hardware.

It's a real worry that writing a crash dump to disk might stomp in the
wrong place, but I don't recall it ever happening in practice.  But
occasionally, yes, a dump was not generated at all, or not completed.

Of course, you could argue that SCO's disk drivers were more stable :-)
which might or might not be a compliment to them.

Hugh


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 15:25                   ` Linus Torvalds
                                       ` (3 preceding siblings ...)
  2002-11-01 16:16                     ` Erik Andersen
@ 2002-11-01 20:43                     ` romieu
  4 siblings, 0 replies; 333+ messages in thread
From: romieu @ 2002-11-01 20:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus Torvalds <torvalds@transmeta.com> :
[...]
> Maybe not. The only numbers I have is the slowness of PCI.

Issue 'openssl speed' and wait for more numbers.

Short lived hybrid sessions kill (not that this or any of the current 
reasons for asynchronous crypto really matters imho).

Instant benchmark:
                  sign    verify    sign/s verify/s
rsa 1024 bits   0.0148s   0.0008s     67.7   1198.6 (PIV 2GHz)
                  sign    verify    sign/s verify/s
rsa 1024 bits   0.0478s   0.0026s     20.9    381.6 (PII 350MHz)

The 'numbers' are in 1000s of bytes per second processed.
type              8 bytes  64 bytes  256 bytes  1024 bytes  8192 bytes
des ede3         3930.00k  4027.43k   4032.30k    4002.19k    3973.12k (PIV)
type              8 bytes  64 bytes  256 bytes  1024 bytes  8192 bytes
des ede3         1058.51k  1061.25k   1090.70k    1097.44k    1091.36k (PII)

blowfish is ~10x faster btw.

--
Ueimor 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 19:04         ` Alan Cox
  2002-10-31 19:42           ` Michael Shuey
@ 2002-11-01 22:25           ` Pavel Machek
  2002-11-02 13:30             ` Michael Shuey
  1 sibling, 1 reply; 333+ messages in thread
From: Pavel Machek @ 2002-11-01 22:25 UTC (permalink / raw)
  To: Alan Cox
  Cc: shuey, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Hi!

> > I'm a user, and I request that LKCD get merged into the kernel. :-)
> > Do you feel like donating a 700-port console server?  Right, so it's LKCD
> > for me then.
> 
> Wouldn't you rather they neatly tftp'd dumps to a nominated central
> server which noticed the arrival, did the initial processing with a perl
> script and mailed you a summary ?

Out of interest, how does such "initial processing" look like?

Of course I'd like perl script to tell me

"hey, at vicam.c:715 you are freeing memory that is still in use by
usb.c; that crashed your machine 5 times during last week",

but I guess your perl scripts can't do that, right?
								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 20:21                       ` David Lang
@ 2002-11-01 22:25                         ` Werner Almesberger
  2002-11-01 22:42                           ` Karim Yaghmour
  0 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-01 22:25 UTC (permalink / raw)
  To: David Lang; +Cc: Linux Kernel Mailing List, lkcd-general, lkcd-devel

[ Cc: trimmed ]

David Lang wrote:
> One question I have is how much of the driver problem you refer to is
> becouse of optimizations that the various drivers have, could you fall
> back to the simplest, works-with-everything,
> all-timeouts-longer-then-the-slowest-disk slug of a driver that could be
> used to do this dump?

Welcome to the wonderful world of code duplication. And don't forget
the "simplified" TCP/IP stack for network dumps. Uh, USB-attached
storage, anyone ? :-)

Special-case dump drivers make perfect sense in isolated cases (e.g.
narrowly specified boxes) or as a band-aid solution.

But for a general solution, it seems more appropriate to me to solve
the problem of moving the kernel data from the damaged system to an
intact system only once, e.g. using the MCORE approach, than over
and over again for all possible types of hardware and attachment
methods.

The only inherent weakness I see in MCORE is the need to reliably
reset a device, either to the point where it is operational (if
used in the process of dumping), or at least to the point where it
doesn't get in the way (if not used for the dump, e.g. video, HID,
etc.).

But this should still be significantly easier than introducing
"dumb" versions for all drivers. Besides, having a way for cleanly
shutting down or resetting devices is desirable in other contexts,
too (e.g. kexec).

- Werner (disclaimer: not affiliated with Mission Critical Linux,
	 any vendor, or any other form of gainful employment)

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 13:30             ` Alan Cox
@ 2002-11-01 22:28               ` Rusty Russell
  0 siblings, 0 replies; 333+ messages in thread
From: Rusty Russell @ 2002-11-01 22:28 UTC (permalink / raw)
  To: Alan Cox
  Cc: Jeff Garzik, Linus Torvalds, Matt D. Robinson,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

In message <1036157418.12693.19.camel@irongate.swansea.linux.org.uk> you write:
> On Thu, 2002-10-31 at 21:02, Jeff Garzik wrote:
> > hosed/screaming, and various mid-layers are dying.  For LKCD to be of 
> > any use, it needs to _skip_ the block layer and talk directly to 
> > low-level drivers.
> 
> Rusty wrote a polled IDE driver that should handle some subset of that

Yes, patch has bitrotted but updating should be trivial.  There's
enough there that you get the idea though: frankly, it's noninvasive
enough for entry during the 2.6.x series, so it's been down on my
list:

	http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Misc/oopser.patch.gz

I'd love someone to take this for a spin and tweak it up...
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 22:25                         ` Werner Almesberger
@ 2002-11-01 22:42                           ` Karim Yaghmour
  2002-11-01 22:54                             ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: Karim Yaghmour @ 2002-11-01 22:42 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: David Lang, Linux Kernel Mailing List, lkcd-general, lkcd-devel


Werner Almesberger wrote:
> But for a general solution, it seems more appropriate to me to solve
> the problem of moving the kernel data from the damaged system to an
> intact system only once, e.g. using the MCORE approach, than over
> and over again for all possible types of hardware and attachment
> methods.

This is just a random tangential thought here, but FWIW:

Why not just have a simple backup stripped-down "hardened" copy of Linux
lying around in a physical RAM region not used by the copy of Linux
actually running. Granted the running Linux doesn't do random physical
accesses when dying, the crash handler could then just boot that
secondary Linux which would then have a RAM disk containing the
appropriate scripts and binaries to handle the actual crash. Given the
cost of RAM these days, reserving a MB or two for this purpose should
probably not be that bad.

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 22:42                           ` Karim Yaghmour
@ 2002-11-01 22:54                             ` Werner Almesberger
  2002-11-01 23:10                               ` Karim Yaghmour
  0 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-01 22:54 UTC (permalink / raw)
  To: Karim Yaghmour
  Cc: David Lang, Linux Kernel Mailing List, lkcd-general, lkcd-devel

Karim Yaghmour wrote:
> Why not just have a simple backup stripped-down "hardened" copy of Linux
> lying around in a physical RAM region not used by the copy of Linux
> actually running.

Congratulations, you've just re-invented MCORE :-) That's exactly
what they do on systems where rebooting through the firmware
doesn't preserve RAM.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 22:54                             ` Werner Almesberger
@ 2002-11-01 23:10                               ` Karim Yaghmour
  0 siblings, 0 replies; 333+ messages in thread
From: Karim Yaghmour @ 2002-11-01 23:10 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: David Lang, Linux Kernel Mailing List, lkcd-general, lkcd-devel


Werner Almesberger wrote:
> Karim Yaghmour wrote:
> > Why not just have a simple backup stripped-down "hardened" copy of Linux
> > lying around in a physical RAM region not used by the copy of Linux
> > actually running.
> 
> Congratulations, you've just re-invented MCORE :-) That's exactly
> what they do on systems where rebooting through the firmware
> doesn't preserve RAM.

Oh well, can't have a freshmeat db in my head I guess ;) That said,
I like this approach since you don't need to care about new drivers
and so on ... but since it's already out there I guess it's
advantages have been covered elsewhere ...

Karim

===================================================
                 Karim Yaghmour
               karim@opersys.com
      Embedded and Real-Time Linux Expert
===================================================

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over. - Dave's crash code supports a gdb interface for LKCD crash dumps.
  2002-10-31 18:15           ` Andrew Morton
  2002-10-31 19:58             ` Bernhard Kaindl
@ 2002-11-02  0:49             ` Piet Delaney
  1 sibling, 0 replies; 333+ messages in thread
From: Piet Delaney @ 2002-11-02  0:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Anderson, kgdb, Linus Torvalds, piet, Matt D. Robinson,
	Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

On Thu, 2002-10-31 at 10:15, Andrew Morton wrote:
 
> (Disclaimer: I've never used lkcd.  I'm assuming that it's
> possible to gdb around in a dump)

I updated Dave Anderson's (Mission Critical) crash code to work 
with LKCD core dumps when I updated LKCD to support the ia64. 
Dave's crash code uses gdb as a command interpreter. It's not quite 
as flexible as using gdb macros on core dumps but it's very close 
and has lots of support for various kernel structures. For example, 
you can't just have ddd walk through data structures by simply 
clicking on pointers in data structures like you normally can.
 
> 
> >         In particular when it comes to this project, I'm told about
> >         "netdump", which doesn't try to dump to a disk, but over the net.
> 
> It could help.  But like serial console, the random person whose
> kernel just died often can't be bothered setting it up, or simply
> doesn't have the gear, or the crash is not repeatable.

Yes, ideally I'd like to have an integration between live gdb stub
debugging and crash debugging. I'd like to even be able to use ddd/gdb
on a core file and simulate execution. When using gdb on the kernel 
I've found it nice to move the cursor over the PC and move it to the 
end of panic(). Then single step back out of panic and re-execute 
the code that returned the error code that caused us to decide to panic.
Doing this in asm language with a asm debugger is too difficult for 
most folks.

I really liked HP's kwdb approach. kwdb has a tiny TCP/IP stack and
has direct hooks into the trap vectors like a normal kgdb stub. The
nice thing is you can attach to a crash system over the internet
from anywhere in the world to debug the panic. I wasn't able to get
HP to release the kwdb gdb stub into the public domain. The gdb hacks
are available at:
 
http://h21007.www2.hp.com/dspp/tech/tech_TechSoftwareDetailPage_IDX/1,1703,257,00.html

but are based on a very old version of gdb and ia64 libraries.
	


> So.  _If_ lkcd gives me gdb-able images from time-of-crash, I'd
> like it please.  And I'm the grunt who spent nearly two years
> doing not much else apart from working 2.3/2.4 oops reports.

You can snarf a copy from:

	ftp://people.redhat.com/anderson

One area that I'm not sure of is if the lkcd kernel changes are a
problem with the kgdb patch (http://kgdb.sourceforge.net/). Perhaps
I can check into that in the near future.

I'd prefer to have both kgdb (http://kgdb.sourceforge.net/)
remote debugging and kgdb crash support available in stock kernels 
like the BSD kernels (NetBSD, FreeBSD). I don't know why the kgdb 
stub wasn't integrated into the kernel for the ia32 and ia64 platforms.
I suppose for reasons like we are hearing now on the LKCD kernel hooks.
The current LKCD code is at least a step in that direction.

-- 
piet@www.piet.net


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  8:23               ` Craig I. Hagan
  2002-11-01 14:03                 ` Patrick Finnegan
@ 2002-11-02  4:57                 ` Bill Davidsen
  1 sibling, 0 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-02  4:57 UTC (permalink / raw)
  To: Craig I. Hagan
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Craig I. Hagan wrote:

> If it becomes apparent through empirical data that crash dumps are a useful
> tool, I'm sure that Linus will become far more amenable. Until then, lets let
> him handle all of his other work which needs to get done.

Since he doesn't have the problem he will ignore the proof. Better be sure
we can generate ksymoops reports from the dump, so we can post them asking
for help. Anything else will get the old "I don't use that tool, can't
help." Or like Nvidia problems the "try it without the crash dump code,"
routine.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 13:28               ` Alan Cox
@ 2002-11-02  5:00                 ` Bill Davidsen
  2002-11-02 15:30                   ` Alan Cox
  2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-02  5:00 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On 1 Nov 2002, Alan Cox wrote:

> On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> > This never works. Be honest. Nobody takes out features, they are stuck 
> > once they get in. 
> 
> Linus I've asked a couple of times about killing sound/oss off now ALSA
> is integrated 8) While you are on the rant how about that ;)

Good point, that continues to disprove the theory that having one thing in
the kernel prevents development of a similar feature.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 20:06                       ` Steven King
@ 2002-11-02  5:17                         ` Bill Davidsen
  2002-11-02  5:36                           ` Zwane Mwaikambo
  2002-11-02 15:29                           ` Alan Cox
  0 siblings, 2 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-02  5:17 UTC (permalink / raw)
  To: Steven King
  Cc: Linus Torvalds, Joel Becker, Alan Cox, Chris Friesen,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Steven King wrote:

> On Friday 01 November 2002 11:18 am, Linus Torvalds wrote:
> 
> > To add insult to injury, you will not be able to actually _test_ any of
> > the real error paths in real life. Sure, you will be able to test forced
> > dumps on _your_ hardware, but while that is fine in the AIX model ("we
> > control the hardware, and charge the user five times what it is worth"),
> > again that doesn't mean _squat_ in the PC hardware space.
> 
>   On the other hand, ISC's system 5 r3 ran on commodity x86 hardware and the 
> crash dumper worked on the various disk hardware I had occasion to use it on 
> (mfm, scsi, ide), although one did need to make sure swap was larger than ram 
> or bad things would happen. 8-{.  

  The thing is that Solaris, AIX, and ISC are written by commercial
companies, they realize that customers need to be able to debug systems
which don't have a screen, a serial printer, etc. They do have disk. 

  I was hoping Alan would push Redhat to put this in their Linux so we
could resolve some of the ongoing problems which don't write an oops to a
log, but I guess none of the developers has to actually support production
servers and find out why they crash.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02  5:17                         ` Bill Davidsen
@ 2002-11-02  5:36                           ` Zwane Mwaikambo
  2002-11-03 14:08                             ` Bill Davidsen
  2002-11-02 15:29                           ` Alan Cox
  1 sibling, 1 reply; 333+ messages in thread
From: Zwane Mwaikambo @ 2002-11-02  5:36 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Steven King, Linus Torvalds, Joel Becker, Alan Cox,
	Chris Friesen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Sat, 2 Nov 2002, Bill Davidsen wrote:

>   The thing is that Solaris, AIX, and ISC are written by commercial
> companies, they realize that customers need to be able to debug systems
> which don't have a screen, a serial printer, etc. They do have disk. 
> 
>   I was hoping Alan would push Redhat to put this in their Linux so we
> could resolve some of the ongoing problems which don't write an oops to a
> log, but I guess none of the developers has to actually support production
> servers and find out why they crash.

Perhaps i'm being grossly naive here, but none of these presumably x86 
productions servers don't have a serial port? Not even PCI/ISA slots to 
add one? Serial would catch most of your oopsen anyway, and if you were 
borked enough that serial couldn't get the entire output, i somehow doubt 
dumping to disk could manage. And no i don't see anything wrong nor 
consider it studly to use oopses only for debugging...

	Zwane

-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  2:01           ` Matt D. Robinson
@ 2002-11-02 10:36             ` Brad Hards
  2002-11-02 19:28               ` [lkcd-devel] " Matt D. Robinson
  0 siblings, 1 reply; 333+ messages in thread
From: Brad Hards @ 2002-11-02 10:36 UTC (permalink / raw)
  To: Matt D. Robinson; +Cc: Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 1 Nov 2002 13:01, Matt D. Robinson wrote:
<snip>
> Uh ... have you read the patches?  Do you see how few the
> changes are to non-dump code?  Do you know that most of those
> changes only get triggered in a crash situation anyway?
I applied the patches, and reported some issues.
http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
I see no signs that any of them have been addressed, although I haven't tried 
a really recent set.

> Breakage occurs when people change code areas that are used
> all the time, like VM, network, block layer, etc.
Actually, this is the area that Linux is best at. If you break it, some poor 
sod will hit the problem, and you'll know really soon.

> Look at the patches and tell me where we are causing overhead
> and and seriously potential breakage.  If you find problems,
> then tell us, don't just comment on breakage scenarios.

I'm a fairly typical user - I just have a couple of desktop machines and a 
server/firewall. 

I don't have 700 nodes in a cluster, and when my machines break, its normally 
something I did. Sometimes the desktop locks up (say every second month, 
unless I'm dicking with the kernel), but I reboot and everything is happy.

LKCD doesn't really seem to do anything for me - it wouldn't really worry me 
if it went in (since I don't have to maintain it - it isn't near any of my 
code), but I'd really prefer that having the _CONFIG option set to N didn't 
make the kernel any bigger, or change any code paths.

Is this unreasonable?

Brad

BTW: I admit that I'd be pretty pissed if Linus said that my code was 
"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for 
a few walks, and worry about how to get it integrated after that.


- -- 
http://linux.conf.au. 22-25Jan2003. Perth, Aust. I'm registered. Are you?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE9w6rCW6pHgIdAuOMRAlI5AJ48ELVdExIeCr5C5HtDpU5+1ZnuBQCdEji0
t4q2NjZQVGEumrz6b+CqEEs=
=xtYY
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-01 20:22                       ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-02 13:02                         ` Kai Henningsen
  0 siblings, 0 replies; 333+ messages in thread
From: Kai Henningsen @ 2002-11-02 13:02 UTC (permalink / raw)
  To: linux-kernel

yakker@aparity.com (Matt D. Robinson)  wrote on 01.11.02 in <Pine.LNX.4.44.0211011205330.26575-100000@nakedeye.aparity.com>:

> On Fri, 1 Nov 2002, Linus Torvalds wrote:

> |>And if you get these things wrong, you're quite likely to stomp on your
> |>disk. Hard. You may be tryign to write the swap partition, but if the
> |>driver gets confused, you just overwrote all your important data. At which
> |>point it doesn't matter if your filesystem is journaling or not, since you
> |>just potentially overwrote it.
>
> We haven't seen this before, but it is always a possibility for any
> dump scenario.  That's why you some choose netdump instead. :)

*If* you want safe dumping to a partition, it seems wrong to me to try to  
figure that out after the crash.

Instead,

* configure the crash space with a user-mode app or possibly a kernel  
command line arg
* Whenever repartitioning, check if the crash dump partition is affected,  
and if so, clear it until it is explicitely reconfigured
* Save a good checksum (say, md5 or sha1) of the crash partition config,  
and only dump if that checksum checks out

You might want to checksum even more than that, of course :-)

But there's certainly a reason Netware liked to crash dump to a series of  
floppies - too bad those are much too small for today's machines. When  
floppy sizes stopped to be slightly larger than standard RAM sizes[*], the  
computing public lost big time, and we haven't recovered from that.

[*] Apple ][+: 48 KB RAM, 140 KB floppy. IBM PC: 640 KB RAM, 1.2 MB  
floppy. (Yes, I know there were other combinations as well.) Where's my  
approximately-1-GB floppy that everyone and their aunt have installed  
today? No, CD writers are *not* universal. And burn-once CDs aren't much  
like floppies.

Of course, the same problem exists with general backup technology - tape  
the size of modern disks is not really affordable anymore.

MfG Kai

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 22:25           ` Pavel Machek
@ 2002-11-02 13:30             ` Michael Shuey
  0 siblings, 0 replies; 333+ messages in thread
From: Michael Shuey @ 2002-11-02 13:30 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Fri, Nov 01, 2002 at 11:25:04PM +0100, Pavel Machek wrote:
> > Wouldn't you rather they neatly tftp'd dumps to a nominated central
> > server which noticed the arrival, did the initial processing with a perl
> > script and mailed you a summary ?
> 
> Out of interest, how does such "initial processing" look like?

Toss an email to root and the operations staff including the name of the
machine that crashed and the output of lcrash's "report" command, as well
as the location of the dumps (ie, where they were saved on the machine that
died and where they are on an optional netdump server).

-- 
Mike Shuey

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02  5:17                         ` Bill Davidsen
  2002-11-02  5:36                           ` Zwane Mwaikambo
@ 2002-11-02 15:29                           ` Alan Cox
  2002-11-03  1:24                             ` [lkcd-general] " Matt D. Robinson
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-02 15:29 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Steven King, Linus Torvalds, Joel Becker, Chris Friesen,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Sat, 2002-11-02 at 05:17, Bill Davidsen wrote:
>   I was hoping Alan would push Redhat to put this in their Linux so we
> could resolve some of the ongoing problems which don't write an oops to a
> log, but I guess none of the developers has to actually support production
> servers and find out why they crash.

I think several Red Hat people would disagree very strongly. Red Hat
shipped with the kernel symbol decoding oops reporter for a good reason,
and also acquired netdump for a good reason. 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02  5:00                 ` Bill Davidsen
@ 2002-11-02 15:30                   ` Alan Cox
  2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-02 15:30 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Sat, 2002-11-02 at 05:00, Bill Davidsen wrote:
> > Linus I've asked a couple of times about killing sound/oss off now ALSA
> > is integrated 8) While you are on the rant how about that ;)
> 
> Good point, that continues to disprove the theory that having one thing in
> the kernel prevents development of a similar feature.

Its preventing testing and its making parallel fixing hard to manage.
I'd really like to kill off the OSS drivers to make sure the ALSA ones
are tested and anything only in OSS does get ported over,



^ permalink raw reply	[flat|nested] 333+ messages in thread

* RE: What's left over.
  2002-10-31  7:42             ` Alexander Viro
  2002-10-31 16:24               ` Stephen Wille Padnos
@ 2002-11-02 17:35               ` LA Walsh
  2002-11-02 20:44                 ` Chris Wedgwood
  1 sibling, 1 reply; 333+ messages in thread
From: LA Walsh @ 2002-11-02 17:35 UTC (permalink / raw)
  To: 'Alexander Viro', 'Dax Kelson'
  Cc: 'Chris Wedgwood', 'Rik van Riel',
	'Linus Torvalds', 'Rusty Russell',
	linux-kernel

	Then why do we need 'non-repudiation' w/r/t certificates?  Isn't
the
idea to provide a way to isolate "bugs" in the "security" system.  If
something
is written to a file by the group signon, who wrote it?

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
> Alexander Viro
> Sent: Wednesday, October 30, 2002 11:43 PM
> Then give them all the same account and be done with that.  
> Effect will
> be the same.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 20:37                       ` Hugh Dickins
@ 2002-11-02 18:23                         ` Geert Uytterhoeven
  2002-11-03  2:25                         ` Horst von Brand
  1 sibling, 0 replies; 333+ messages in thread
From: Geert Uytterhoeven @ 2002-11-02 18:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Joel Becker, Alan Cox, Bill Davidsen,
	Chris Friesen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Fri, 1 Nov 2002, Hugh Dickins wrote:
> I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
> OpenServer and UnixWare: which were addressing the PC market, not
> own hardware.
> 
> It's a real worry that writing a crash dump to disk might stomp in the
> wrong place, but I don't recall it ever happening in practice.  But
> occasionally, yes, a dump was not generated at all, or not completed.

IIRC, some years ago wuarchive.wustl.edu went down for a few days because the
machine paniced and dumped to the wrong partition...

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02  5:00                 ` Bill Davidsen
  2002-11-02 15:30                   ` Alan Cox
@ 2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
  2002-11-02 19:19                     ` romieu
  2002-11-02 20:31                     ` Alan Cox
  1 sibling, 2 replies; 333+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-11-02 18:55 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Alan Cox, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Em Sat, Nov 02, 2002 at 12:00:18AM -0500, Bill Davidsen escreveu:
> On 1 Nov 2002, Alan Cox wrote:
> 
> > On Fri, 2002-11-01 at 06:36, Linus Torvalds wrote:
> > > This never works. Be honest. Nobody takes out features, they are stuck 
> > > once they get in. 
> > 
> > Linus I've asked a couple of times about killing sound/oss off now ALSA
> > is integrated 8) While you are on the rant how about that ;)
> 
> Good point, that continues to disprove the theory that having one thing in
> the kernel prevents development of a similar feature.

SPX was also removed (hey, it never worked anyway) and probably econet and
ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
drivers/atm, but I'm not sure the later will be useful without the former).

- Arnaldo

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
@ 2002-11-02 19:19                     ` romieu
  2002-11-02 19:21                       ` Arnaldo Carvalho de Melo
  2002-11-02 20:31                     ` Alan Cox
  1 sibling, 1 reply; 333+ messages in thread
From: romieu @ 2002-11-02 19:19 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Bill Davidsen, Alan Cox,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, linux-atm-general

[Cc: changed]

Arnaldo Carvalho de Melo <acme@conectiva.com.br> :
> Em Sat, Nov 02, 2002 at 12:00:18AM -0500, Bill Davidsen escreveu:
[...]
> > Good point, that continues to disprove the theory that having one thing in
> > the kernel prevents development of a similar feature.
>
> SPX was also removed (hey, it never worked anyway) and probably econet and
> ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> drivers/atm, but I'm not sure the later will be useful without the former).

What's the deadline ?

--
Ueimor

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 19:19                     ` romieu
@ 2002-11-02 19:21                       ` Arnaldo Carvalho de Melo
  2002-11-02 19:32                         ` romieu
  0 siblings, 1 reply; 333+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-11-02 19:21 UTC (permalink / raw)
  To: romieu
  Cc: Bill Davidsen, Alan Cox, Linus Torvalds, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, linux-atm-general

Em Sat, Nov 02, 2002 at 08:19:17PM +0100, romieu@fr.zoreil.com escreveu:
> > SPX was also removed (hey, it never worked anyway) and probably econet and
> > ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> > drivers/atm, but I'm not sure the later will be useful without the former).
> 
> What's the deadline ?

Plan was for 2.6.0

- Arnaldo

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-02 10:36             ` Brad Hards
@ 2002-11-02 19:28               ` Matt D. Robinson
  0 siblings, 0 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-02 19:28 UTC (permalink / raw)
  To: Brad Hards; +Cc: Linus Torvalds, linux-kernel, lkcd-general, lkcd-devel

On Sat, 2 Nov 2002, Brad Hards wrote:
|>I applied the patches, and reported some issues.
|>http://marc.theaimsgroup.com/?l=linux-kernel&m=103520434201014&w=2
|>I see no signs that any of them have been addressed, although I haven't tried 
|>a really recent set.

We did put your fixes in, if they don't work, let me know.

|>LKCD doesn't really seem to do anything for me - it wouldn't really worry me 
|>if it went in (since I don't have to maintain it - it isn't near any of my 
|>code), but I'd really prefer that having the _CONFIG option set to N didn't 
|>make the kernel any bigger, or change any code paths.
|>
|>Is this unreasonable?

Absolutely not.  I would expect most people to not use it, and I
would hope that most distributions would build it as a module but
not turn it on (unless they really wanted it on by default).

|>Brad
|>
|>BTW: I admit that I'd be pretty pissed if Linus said that my code was 
|>"stupid", but life isn't reasonable or fair. Take a few days off LKCD, go for 
|>a few walks, and worry about how to get it integrated after that.

It's neither here nor there anymore.  I think if companies like
Red Hat don't want it turned on, that's fine, but they should at
least allow their customers to have it available to them for
use, if that's what they want.

Of course, I'm not going to go through all the reasons why there's
a major disconnect between Linux distributions and hardware vendors,
but suffice it to say that's the root of the problem here.

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 19:21                       ` Arnaldo Carvalho de Melo
@ 2002-11-02 19:32                         ` romieu
  2002-11-02 19:42                           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 333+ messages in thread
From: romieu @ 2002-11-02 19:32 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Bill Davidsen, Alan Cox,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, linux-atm-general

Arnaldo Carvalho de Melo <acme@conectiva.com.br> :
> Em Sat, Nov 02, 2002 at 08:19:17PM +0100, romieu@fr.zoreil.com escreveu:
[...]
> > What's the deadline ?
> 
> Plan was for 2.6.0

:o)
Is there a lower bound for it's estimate arrival date ?

--
Ueimor

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 19:32                         ` romieu
@ 2002-11-02 19:42                           ` Arnaldo Carvalho de Melo
  2002-11-02 20:23                             ` romieu
  0 siblings, 1 reply; 333+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-11-02 19:42 UTC (permalink / raw)
  To: romieu
  Cc: Bill Davidsen, Alan Cox, Linus Torvalds, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, linux-atm-general

Em Sat, Nov 02, 2002 at 08:32:23PM +0100, romieu@fr.zoreil.com escreveu:
> Arnaldo Carvalho de Melo <acme@conectiva.com.br> :
> > Em Sat, Nov 02, 2002 at 08:19:17PM +0100, romieu@fr.zoreil.com escreveu:
> [...]
> > > What's the deadline ?
> > 
> > Plan was for 2.6.0
> 
> :o)
> Is there a lower bound for it's estimate arrival date ?

:-) I think that if you state that you plan to work on it RSN we can forget
about removing it for now.

- Arnaldo

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 20:31                     ` Alan Cox
@ 2002-11-02 20:12                       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 333+ messages in thread
From: Arnaldo Carvalho de Melo @ 2002-11-02 20:12 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bill Davidsen, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Em Sat, Nov 02, 2002 at 08:31:29PM +0000, Alan Cox escreveu:
> On Sat, 2002-11-02 at 18:55, Arnaldo Carvalho de Melo wrote:
> > SPX was also removed (hey, it never worked anyway) and probably econet and
> > ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> > drivers/atm, but I'm not sure the later will be useful without the former).
 
> ATM is actively used by large numbers of people [1]. Its in the fix
> rather than remove category. Econet should be trivial and might as well
> just be marked CONFIG_OBSOLETE until someone does deal with it.

Oh, cool, way more motivation to fix that stuff 8)

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 19:42                           ` Arnaldo Carvalho de Melo
@ 2002-11-02 20:23                             ` romieu
  0 siblings, 0 replies; 333+ messages in thread
From: romieu @ 2002-11-02 20:23 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Bill Davidsen, Alan Cox,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, linux-atm-general

Arnaldo Carvalho de Melo <acme@conectiva.com.br> :
[...]
> :-) I think that if you state that you plan to work on it RSN we can forget
> about removing it for now.

$*@#&%@ !

Will have to setup a burnproof testbed for ATM then.

--
Ueimor

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
  2002-11-02 19:19                     ` romieu
@ 2002-11-02 20:31                     ` Alan Cox
  2002-11-02 20:12                       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-02 20:31 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Bill Davidsen, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Sat, 2002-11-02 at 18:55, Arnaldo Carvalho de Melo wrote:
> SPX was also removed (hey, it never worked anyway) and probably econet and
> ATM will be removed as well if nobody jumps to fix it (I mean net/atm, not
> drivers/atm, but I'm not sure the later will be useful without the former).

ATM is actively used by large numbers of people [1]. Its in the fix
rather than remove category. Econet should be trivial and might as well
just be marked CONFIG_OBSOLETE until someone does deal with it.

Alan
[1] PPPoATM is used for a large number of DSL connections


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 17:35               ` LA Walsh
@ 2002-11-02 20:44                 ` Chris Wedgwood
  0 siblings, 0 replies; 333+ messages in thread
From: Chris Wedgwood @ 2002-11-02 20:44 UTC (permalink / raw)
  To: LA Walsh
  Cc: 'Alexander Viro', 'Dax Kelson',
	'Rik van Riel', 'Linus Torvalds',
	'Rusty Russell',
	linux-kernel

On Sat, Nov 02, 2002 at 09:35:17AM -0800, LA Walsh wrote:

> Then why do we need 'non-repudiation' w/r/t certificates?

we dont


  --cw

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 17:54           ` Matt D. Robinson
  2002-10-31 17:54             ` Linus Torvalds
@ 2002-11-02 23:44             ` Horst von Brand
  2002-11-03  1:14               ` Matt D. Robinson
  1 sibling, 1 reply; 333+ messages in thread
From: Horst von Brand @ 2002-11-02 23:44 UTC (permalink / raw)
  To: Matt D. Robinson; +Cc: linux-kernel

"Matt D. Robinson" <yakker@aparity.com> dijo:

[...]

> This isn't bloat.  If you want, it can be built as a module, and
> not as part of your kernel.  How can that be bloat?  People who
> build kernels can optionally build it in, but we're not asking
> that it be turned on by default, rather, built as a module so
> people can load it if they want to.  We made it into a module
> because 18 months ago you complained about it being bloat.  We
> addressed your concerns.

Bloat is not just RAM/CPU/... usage when in use, it is much more about
developers who have to understand, work with, and consider how to use or
interface with the new code. Even more so when it is not builtin, as this
creates _two_ scenarios to consider.

This is the sense of "bloat" that Linus is most worried about (and very
rightly so, IMVHO). At lesat that is my observation over the years.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02 23:44             ` Horst von Brand
@ 2002-11-03  1:14               ` Matt D. Robinson
  0 siblings, 0 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-03  1:14 UTC (permalink / raw)
  To: Horst von Brand; +Cc: linux-kernel

I'm not sure I understand your point, Horst.  There are four
primary mechanisms which would invoke a dump:

	die() (or die_if_kernel())
	panic()
	interrupt-driven dumps
	sysrq()

Assuming you call these functions, there is a single dump()
call that will perform dumping, the the dump_function_ptr
(which is assigned when the dump module is loaded) is set.
dump() is a simple function that basically says:

static inline void dump(char * str, struct pt_regs * regs)
{
	if (dump_function_ptr) {
		dump_function_ptr((char *)str, regs);
	}
}

str is for the panic() string, and regs are so you can create
a proper stack trace for the failing task on the correct CPU.

I don't see how that can can attributed to bloating the kernel.
If you don't panic(), the code is never invoked.  If you don't load
the dump module, dump_function_ptr isn't assigned.  It's meant
to be non-invasive, off to the side and called when required
(or requested).

There is some additional code put in the kernel to disable
interrupts, quiesce the system, and I think there are a few projects
that can probably use the same code base (such as the suspend-to-ram
project, which I was just informed about).  All of that is called
within the dump driver itself, otherwise it sits quietly off to
the side, never getting called.

Using the dump driver infrastructure is like writing any plain-jane
driver.  You set up the _open(), _close(), etc., functions,
assigning the ops table based on the dump method you want to use
(disk, network, mini-oopser, etc.)  This isn't that difficult,
and it should only be loaded for those customer systems that want
a specific dump style.

--Matt

Standard disclaimer:  I'm not trying anymore to get this into the
kernel at this time (via Linus).  This is purely for educating
those that aren't familiar with crash dumping for Linux.

On Sat, 2 Nov 2002, Horst von Brand wrote:
|>"Matt D. Robinson" <yakker@aparity.com> dijo:
|>
|>[...]
|>
|>> This isn't bloat.  If you want, it can be built as a module, and
|>> not as part of your kernel.  How can that be bloat?  People who
|>> build kernels can optionally build it in, but we're not asking
|>> that it be turned on by default, rather, built as a module so
|>> people can load it if they want to.  We made it into a module
|>> because 18 months ago you complained about it being bloat.  We
|>> addressed your concerns.
|>
|>Bloat is not just RAM/CPU/... usage when in use, it is much more about
|>developers who have to understand, work with, and consider how to use or
|>interface with the new code. Even more so when it is not builtin, as this
|>creates _two_ scenarios to consider.
|>
|>This is the sense of "bloat" that Linus is most worried about (and very
|>rightly so, IMVHO). At lesat that is my observation over the years.
|>

-- 


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-02 15:29                           ` Alan Cox
@ 2002-11-03  1:24                             ` Matt D. Robinson
  2002-11-03  1:49                               ` Alan Cox
  2002-11-03  3:10                               ` Christoph Hellwig
  0 siblings, 2 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-03  1:24 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bill Davidsen, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On 2 Nov 2002, Alan Cox wrote:
|>On Sat, 2002-11-02 at 05:17, Bill Davidsen wrote:
|>>   I was hoping Alan would push Redhat to put this in their Linux so we
|>> could resolve some of the ongoing problems which don't write an oops to a
|>> log, but I guess none of the developers has to actually support production
|>> servers and find out why they crash.
|>
|>I think several Red Hat people would disagree very strongly. Red Hat
|>shipped with the kernel symbol decoding oops reporter for a good reason,
|>and also acquired netdump for a good reason. 

It would be great if crash dumping were an option, at the very least
to unify the netdump, oops reporter and disk dumping (for those that
want it) into a single infrastructure.  Long term, that's probably
where this is going anyway.  It takes away the religious "who is right"
argument, which is fundamentally silly.

Maybe one day.  I think quite a few Red Hat customers would
appreciate it.

--Matt

P.S.  IBM shouldn't have signed a contact with Red Hat without
      requiring certain features in Red Hat's OS(es).  Pushing for
      LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
      variety of cases if that had been done in the first place.

P.S.  As an aside, too many engineers try and make product marketing
      decisions at Red Hat.  I personally think that's really bad for
      their business model as a whole (and I'm not referring to LKCD).


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03  1:24                             ` [lkcd-general] " Matt D. Robinson
@ 2002-11-03  1:49                               ` Alan Cox
  2002-11-03  9:34                                 ` [lkcd-devel] " Matt D. Robinson
  2002-11-03 14:33                                 ` Bill Davidsen
  2002-11-03  3:10                               ` Christoph Hellwig
  1 sibling, 2 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-03  1:49 UTC (permalink / raw)
  To: Matt D. Robinson
  Cc: Bill Davidsen, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Sun, 2002-11-03 at 01:24, Matt D. Robinson wrote:
> P.S.  IBM shouldn't have signed a contact with Red Hat without
>       requiring certain features in Red Hat's OS(es).  Pushing for
>       LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
>       variety of cases if that had been done in the first place.

I would hope IBM have more intelligence than to attempt to destroy the
product by trying to force all sorts of junk into it. The Linux world
has a process for filterng crap, it isnt IBM applying force. That path
leads to Star Office 5.2, Netscape 4 and other similar scales of horror
code that become unmaintainably bad.

> P.S.  As an aside, too many engineers try and make product marketing
>       decisions at Red Hat.  I personally think that's really bad for
>       their business model as a whole (and I'm not referring to LKCD).

You think things like EVMS are a product marketing decision. I'm very
glad you don't run a Linux distro. It would turn into something like the
old 3com rapops rather rapidly by your models (3com rapops btw ceased to
exist and for good reasons)

Alan


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 20:37                       ` Hugh Dickins
  2002-11-02 18:23                         ` Geert Uytterhoeven
@ 2002-11-03  2:25                         ` Horst von Brand
  2002-11-04 16:18                           ` Hugh Dickins
  1 sibling, 1 reply; 333+ messages in thread
From: Horst von Brand @ 2002-11-03  2:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel

Hugh Dickins <hugh@veritas.com> said:

[...]

> I dealt with crash dumps quite a lot over 10 years with SCO UNIX,
> OpenServer and UnixWare: which were addressing the PC market, not
> own hardware.

What I remember about hardware compatibility for SCO Unix and Solaris on
ia32 is _not_ funny. Lightyears from what Linux handles today without
breaking a sweat.

> It's a real worry that writing a crash dump to disk might stomp in the
> wrong place, but I don't recall it ever happening in practice.  But
> occasionally, yes, a dump was not generated at all, or not completed.

How do you test that? Not in some contrieved situation, under real crashes.
Don't just consider crashes in the official $DISTRIBUTION kernel, but in
Linus' BK tree, or some of the random, two-or-three-letter-trees of the day
(_that_ is where crashes happen, _that_ is where the info would be most
valuable). It gets _real_ hairy _real_ fast to make sure you don't scribble
over /home or /etc on the user's disk...

> Of course, you could argue that SCO's disk drivers were more stable :-)

If you only handle a few, thoroughly tested, high-end controllers and
disks, that is not too hard to do.
--
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03  1:24                             ` [lkcd-general] " Matt D. Robinson
  2002-11-03  1:49                               ` Alan Cox
@ 2002-11-03  3:10                               ` Christoph Hellwig
  1 sibling, 0 replies; 333+ messages in thread
From: Christoph Hellwig @ 2002-11-03  3:10 UTC (permalink / raw)
  To: Matt D. Robinson
  Cc: Alan Cox, Bill Davidsen, Steven King, Linus Torvalds,
	Joel Becker, Chris Friesen, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Sat, Nov 02, 2002 at 05:24:17PM -0800, Matt D. Robinson wrote:
> P.S.  IBM shouldn't have signed a contact with Red Hat without
>       requiring certain features in Red Hat's OS(es).  Pushing for
>       LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
>       variety of cases if that had been done in the first place.

Bah, it's enough that IBMs money totally fucked up the tree of one popular
distribution..


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: [lkcd-general] Re: What's left over.
  2002-11-03  1:49                               ` Alan Cox
@ 2002-11-03  9:34                                 ` Matt D. Robinson
  2002-11-03 14:33                                 ` Bill Davidsen
  1 sibling, 0 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-03  9:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bill Davidsen, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On 3 Nov 2002, Alan Cox wrote:
|>On Sun, 2002-11-03 at 01:24, Matt D. Robinson wrote:
|>> P.S.  IBM shouldn't have signed a contact with Red Hat without
|>>       requiring certain features in Red Hat's OS(es).  Pushing for
|>>       LKCD, kprobes, LTT, etc., wouldn't be on this list for a whole
|>>       variety of cases if that had been done in the first place.
|>
|>I would hope IBM have more intelligence than to attempt to destroy the
|>product by trying to force all sorts of junk into it. The Linux world
|>has a process for filterng crap, it isnt IBM applying force. That path
|>leads to Star Office 5.2, Netscape 4 and other similar scales of horror
|>code that become unmaintainably bad.

I think you misunderstand me.  If IBM considers a feature to be useful,
they should require distributions to put into a release from a contractual
standpoint.  That doesn't mean Red Hat has to put it into all their
distributions -- it just means they have to produce something that
IBM wants.  If nobody else uses it, that's fine.  IBM gets what they
want, and Red Hat gets what they want.  End of story.

You're looking at this from an engineering perspective and open source
philosophy rather than a business unit at a company like IBM might look
at it.  That's not a bad thing to do, but the two concepts are very
different from each other.  The Linux world may filter "crap", which
is great, but some of that "crap" is important to companies like IBM,
and if they were smart they'd use their leverage ($$$) to make sure the
"crap" ends up in the products they care to use/support.  The rest of
Linux can do whatever it wants, doing things the "Linux world" way.

|>> P.S.  As an aside, too many engineers try and make product marketing
|>>       decisions at Red Hat.  I personally think that's really bad for
|>>       their business model as a whole (and I'm not referring to LKCD).
|>
|>You think things like EVMS are a product marketing decision. I'm very
|>glad you don't run a Linux distro. It would turn into something like the
|>old 3com rapops rather rapidly by your models (3com rapops btw ceased to
|>exist and for good reasons)

Again, I wasn't mentioning any product in particular.  Making decisions
like GPL-only as an engineering philosophy rather than as a product
marketing decision are more problematic than looking at EVMS vs. anything
else as a question of which is technically better.

But again, that's a complete aside and would probably open up a plethora
of opinions from people who care about both sides of that argument, and
would inevitably head down an rathole infinitely deep.

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 13:26                 ` Alan Cox
  2002-11-01 19:00                   ` Joel Becker
@ 2002-11-03 13:48                   ` Bill Davidsen
  2002-11-03 14:26                     ` yodaiken
  2002-11-04  2:44                     ` [lkcd-general] " Jennie Haywood
  1 sibling, 2 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-03 13:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Chris Friesen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On 1 Nov 2002, Alan Cox wrote:

> On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> >   From the standpoint of just the driver that's true. However, the remote
> > machine and all the network bits between them are a string of single
> > points of failure. Isn't it good that both disk and network can be
> > supported.
> 
> My concerns are solely with things like the correctness of the disk
> dumper. Its obviously a good way to do a lot more damage if it isnt done
> carefully. Quite clearly your dump system wants to support multiple dump
> targets so you can dump to pci battery backed ram, down the parallel
> port to an analysing box etc

Quite clearly SCO, Sun, and IBM have been doing this for years without
offering dozens of options. I don't need it to sing and dance, I just need
a way to put the dump where I can find it. I'm not going to put another
box in at the end of a serial or parallel port, I don't have NVram, I do
have lopts of disk, and so does almost everyone else. I have remote
systems in wiring closets all over the country (all four time zones). They
are at the end of open net connections, unreliable and untrusted. I don't
want to bet that I have a working VPN, or that I can safely send all that
data without it being read by someone other than me.

The AIX support has a group just to beat on dumps customers send. What
more evidence is needed that people can and do use the capability.

I had hoped that someone would do this for Linux, I never dreamed that
it would be kept out of the kernel by people who clearly don't understand
the problems if distributed and clustered headless systems.

I guess the development folks are working on more important things like
xiafs and morse code dumps to the keyboard LEDs.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-02  5:36                           ` Zwane Mwaikambo
@ 2002-11-03 14:08                             ` Bill Davidsen
  0 siblings, 0 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-03 14:08 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Steven King, Linus Torvalds, Joel Becker, Alan Cox,
	Chris Friesen, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Sat, 2 Nov 2002, Zwane Mwaikambo wrote:

> On Sat, 2 Nov 2002, Bill Davidsen wrote:
> 
> >   The thing is that Solaris, AIX, and ISC are written by commercial
> > companies, they realize that customers need to be able to debug systems
> > which don't have a screen, a serial printer, etc. They do have disk. 
> > 
> >   I was hoping Alan would push Redhat to put this in their Linux so we
> > could resolve some of the ongoing problems which don't write an oops to a
> > log, but I guess none of the developers has to actually support production
> > servers and find out why they crash.
> 
> Perhaps i'm being grossly naive here, but none of these presumably x86 
> productions servers don't have a serial port? Not even PCI/ISA slots to 
> add one? Serial would catch most of your oopsen anyway, and if you were 
> borked enough that serial couldn't get the entire output, i somehow doubt 
> dumping to disk could manage. And no i don't see anything wrong nor 
> consider it studly to use oopses only for debugging...

I have distributed servers in 15 locations, six states, four timezones. In
secure unattended locations like wiring closets. What do I do with the
serial port? Do I double my colocation costs and have another system there
to listen? Is the code on a sick system going to dial the modem on the
serial line amd establish a connection?

I have a mix of Linux, Solaris, and AIX systems deployed, and only the
Linux systems don't have this capability. Actually for the most part only
the Linux systems NEED it, that's another problem, but reliability would
go up if I could see the problem.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-03 13:48                   ` Bill Davidsen
@ 2002-11-03 14:26                     ` yodaiken
  2002-11-05 17:09                       ` Bill Davidsen
  2002-11-04  2:44                     ` [lkcd-general] " Jennie Haywood
  1 sibling, 1 reply; 333+ messages in thread
From: yodaiken @ 2002-11-03 14:26 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Alan Cox, Linus Torvalds, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> On 1 Nov 2002, Alan Cox wrote:
> 
> > On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> > >   From the standpoint of just the driver that's true. However, the remote
> > > machine and all the network bits between them are a string of single
> > > points of failure. Isn't it good that both disk and network can be
> > > supported.
> > 
> > My concerns are solely with things like the correctness of the disk
> > dumper. Its obviously a good way to do a lot more damage if it isnt done
> > carefully. Quite clearly your dump system wants to support multiple dump
> > targets so you can dump to pci battery backed ram, down the parallel
> > port to an analysing box etc
> 
> Quite clearly SCO, Sun, and IBM have been doing this for years without
> offering dozens of options. I don't need it to sing and dance, I just need
> a way to put the dump where I can find it. I'm not going to put another
> box in at the end of a serial or parallel port, I don't have NVram, I do
> have lopts of disk, and so does almost everyone else. I have remote
> systems in wiring closets all over the country (all four time zones). They
> are at the end of open net connections, unreliable and untrusted. I don't
> want to bet that I have a working VPN, or that I can safely send all that
> data without it being read by someone other than me.
> 
> The AIX support has a group just to beat on dumps customers send. What
> more evidence is needed that people can and do use the capability.
> 
> I had hoped that someone would do this for Linux, I never dreamed that

You paid someone for this for AIX. So the solution is obvious for Linux.



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03  1:49                               ` Alan Cox
  2002-11-03  9:34                                 ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-03 14:33                                 ` Bill Davidsen
  2002-11-03 15:34                                   ` Bernd Eckenfels
  2002-11-03 16:32                                   ` Alan Cox
  1 sibling, 2 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-03 14:33 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matt D. Robinson, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On 3 Nov 2002, Alan Cox wrote:

> I would hope IBM have more intelligence than to attempt to destroy the
> product by trying to force all sorts of junk into it. The Linux world
> has a process for filterng crap, it isnt IBM applying force. That path
> leads to Star Office 5.2, Netscape 4 and other similar scales of horror
> code that become unmaintainably bad.

If you define "unmaintainably bad" as "having features you don't need"
then I agree. But since dump to disk is in almost every other commercial
UNIX, maybe someone would question why it's good for others but not for
Linux.

I can agree on stuff the non-hacker wouldn't use, but that is exactly who
uses the crash dump in AIX, the person who wants to send a compressed dump
and money to IBM and get back a fix. Netdump assumes external resources
and a functional secure network (is the dump encrypted and I missed it?)
which home users surely don't have, and remote servers oftem lack as well.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03 14:33                                 ` Bill Davidsen
@ 2002-11-03 15:34                                   ` Bernd Eckenfels
  2002-11-03 16:32                                   ` Alan Cox
  1 sibling, 0 replies; 333+ messages in thread
From: Bernd Eckenfels @ 2002-11-03 15:34 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.3.96.1021103092330.5197D-100000@gatekeeper.tmr.com> you wrote:
> If you define "unmaintainably bad" as "having features you don't need"
> then I agree. But since dump to disk is in almost every other commercial
> UNIX, maybe someone would question why it's good for others but not for
> Linux.

It is even in FreeBSD or Windows > ME

Greetings
Bernd

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03 14:33                                 ` Bill Davidsen
  2002-11-03 15:34                                   ` Bernd Eckenfels
@ 2002-11-03 16:32                                   ` Alan Cox
  2002-11-03 17:08                                     ` [lkcd-devel] " Matt D. Robinson
  2002-11-05 18:07                                     ` Bill Davidsen
  1 sibling, 2 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-03 16:32 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Matt D. Robinson, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Sun, 2002-11-03 at 14:33, Bill Davidsen wrote:
> If you define "unmaintainably bad" as "having features you don't need"
> then I agree. But since dump to disk is in almost every other commercial
> UNIX, maybe someone would question why it's good for others but not for
> Linux.

It isnt about features, its about clean maintainable code. netdump to me
doesnt mean no dump to disk option. In fact I'd rather like to be able
to insmod dump-foo.o. The correctness issues are hard but if the
dump-foo is standalone, resets the hardware and has an SHA integrity
check then it can be done (think of it as a post crash variant of the
trusted computing TCB verification problem)

> uses the crash dump in AIX, the person who wants to send a compressed dump
> and money to IBM and get back a fix. Netdump assumes external resources

Lots of interesting legal issues but yes you can do it sometimes (DMCA,
privacy, financial duties sometimes make it horribly complex). Even in
the case where you only dump the oops its still valuable.

> and a functional secure network (is the dump encrypted and I missed it?)
> which home users surely don't have, and remote servers oftem lack as well.

Encrypting the dump with the new crypto lib in the kernel would be easy,
right now it doesnt. 

My disk dump concerns are purely those of correctness. That means

1.	After loading the module getting the block list for the dump target

2.	Resetting and scratch initializing the dump device

3.	Not relying on any code outside of the dump TCB that may have
been corrupted

4.	At dump time turning off all bus masters, doing the dump TCB
verification and then dumping

Most of the pieces already exist.



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: [lkcd-general] Re: What's left over.
  2002-11-03 16:32                                   ` Alan Cox
@ 2002-11-03 17:08                                     ` Matt D. Robinson
  2002-11-05 18:07                                     ` Bill Davidsen
  1 sibling, 0 replies; 333+ messages in thread
From: Matt D. Robinson @ 2002-11-03 17:08 UTC (permalink / raw)
  To: Alan Cox
  Cc: Bill Davidsen, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On 3 Nov 2002, Alan Cox wrote:
|>Encrypting the dump with the new crypto lib in the kernel would be easy,
|>right now it doesnt. 

Piece of cake.  It's like adding a dump compression module.  You
can load dump_gzip.o or dump_rle.o to specify the kind of compression
you want to use.  dump_crypto.o would be the same kind of thing.  Just
add another flag and away you go.

|>My disk dump concerns are purely those of correctness. That means
|>
|> [ ... ]
|>
|>Most of the pieces already exist.

It's just a matter of time, then.

--Matt


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01 16:16                             ` Patrick Finnegan
                                                 ` (2 preceding siblings ...)
  2002-11-01 18:23                               ` Shane R. Stixrud
@ 2002-11-04  2:13                               ` Rob Landley
  2002-11-04 14:58                                 ` Patrick Finnegan
  3 siblings, 1 reply; 333+ messages in thread
From: Rob Landley @ 2002-11-04  2:13 UTC (permalink / raw)
  To: Patrick Finnegan, linux-kernel

On Friday 01 November 2002 16:16, Patrick Finnegan wrote:

> > It's not a fscking public service.  Linus has full control over his
> > tree.  You have equally full control over your tree.  Linus can't
> > tell you what patches to apply in your tree.  You can't tell Linus
> > what patches he should apply to his.
>
> I'm sorry it _is_ a public service.  Once tens of people started
> contributing to it, it became one.  This is like saying that the
> Washington Monument belongs to the peole that maintain it, any building
> belongs to the repair crews and janitors.

You pay taxes to support the washington monument.  When's the last time you 
paid a tax to Linus?

> I'm not saying that Linus is
> necessarily a janitor, but when you consider how much of the Linux kernel
> that he didn't write, you may relize that it's not just his kernel.

He's the editor of a periodical publication.  A cross between an academic 
technical journal which people contribute to for professional reasons, and a 
hobbyist fanzine that people contribute to 'cause it's cool.  This is not a 
new thing, there are real-world precedents for this sort of relationship 
going back hundreds of years, to the invention of the printing press...

Linus's editorial decisions are as final and unappealable as any other 
editorial decision at a magazine or newspaper.  You can publish your article 
elsewhere, and if it doesn't have the same prestige as the Harvard Law Review 
or the New England Journal of Medicine, tough.  They said no.

And like ALL editors, his job isn't to write a significant portion of the 
articles in the publication, but to be a Sturgeon's Law filter throwing out 
99% of the submissions in the slush pile, correcting the spelling and grammar 
of the remaining few, and trying to stitch them together into a coherent 
whole.

Go track down somebody with a Journalism degree if you want to understand 
Linus's job.

>  It
> also belongs to every single person that has written even a single
> line of code in it.

If you get an article published in Time magazine, and you say that this gives 
you the right to print your own copies of Time and distribute them yourself, 
Time's lawyers are going to come after you.

The GPL gives you the ability to do this, but it doesn't obligate the 
publication's editor to listen to you.  If next month's issue contains a huge 
rebuttal to one of your articles, calling you a boogerhead, tough.  The 
editor doesn't owe you anything as a previous contributor, and certainly 
doesn't owe you anything as someone from whom he did NOT take a submission.

What Linus basically said was that if a significant number of distributions 
integrated it, he might take another look at the thing in the future.  But 
wasn't going into 2.5.

Now, thanks to people pestering him beyond the Annoyance Event Horizon, he's 
got his fingers in his ears.  Congratulations.  Hopefully, he'll calm down a 
bit in a few months, but there's no guarantee.  In the mean time, the most 
productive thing to do is drop the topic and work on the Red Hat, SuSE, and 
Debian guys.  (Mandrake feeds from Red Hat, and SuSE is now making kernels 
for Connectiva and TurboLinux.  Gentoo and Slackware might be good to bug as 
well...)

See if you can convince Alan Cox to pick up your patch.  That'll get you Red 
Hat, and the single largest concentration of roll-your-own kernel guys 
outside of Linus's own tree.

Rob

-- 
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, 
CmdrTaco, liquid nitrogen ice cream, and caffienated jello.  Well why not?




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03 13:48                   ` Bill Davidsen
  2002-11-03 14:26                     ` yodaiken
@ 2002-11-04  2:44                     ` Jennie Haywood
  2002-11-04 14:45                       ` Henning P. Schmiedehausen
  1 sibling, 1 reply; 333+ messages in thread
From: Jennie Haywood @ 2002-11-04  2:44 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Alan Cox, Linus Torvalds, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

Bill Davidsen wrote:

>
> On 1 Nov 2002, Alan Cox wrote:
>
> > On Fri, 2002-11-01 at 06:34, Bill Davidsen wrote:
> > >   From the standpoint of just the driver that's true. However, the remote
> > > machine and all the network bits between them are a string of single
> > > points of failure. Isn't it good that both disk and network can be
> > > supported.
> >
> The AIX support has a group just to beat on dumps customers send. What
> more evidence is needed that people can and do use the capability.
>

AIX has 4 people doing dumps in Austin (otherwise known as ZTRANS).  There are
others in other countries.
The folks from other countries were brought to Austin for training (usually for 3
months).
There is usually one person in L3 doing dumps in Austin for service, although
every subsystem has someone that specializes in reading dumps for that subsystem.

The first 4 people only do a scan of the dump to see if it's a known problem.  If
it's not
a known problem AND it's in AIX code it goes to whoever it is that owns that
subsystem.

Dumps are only the beginning with AIX.   Trace hooks along with dumps are VERY
useful.
The trace hooks are also what the performance people use.

The Linux kernel  is _extremely_  painful to debug compared to AIX.


--
Jennie Haywood
jehaywood@compuserve.com
Everyone is crazy. It's just a matter of degree.
jehaywood@yahoo.com
-
The oak tree in your backyard is just a nut that held its ground.



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-04 14:58                                 ` Patrick Finnegan
@ 2002-11-04 12:59                                   ` Rob Landley
  0 siblings, 0 replies; 333+ messages in thread
From: Rob Landley @ 2002-11-04 12:59 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linux-kernel

On Monday 04 November 2002 14:58, Patrick Finnegan wrote:
> First I want to apologize to anyone I've pissed off too badly with this.

Sorry, didn't mean to bring up an old issue.  DSL was out over the weekend 
here (thank you Southwestern Bell), so some stuff queued up in my laptop's 
outbox...

Rob

-- 
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, 
CmdrTaco, liquid nitrogen ice cream, and caffienated jello.  Well why not?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04  2:44                     ` [lkcd-general] " Jennie Haywood
@ 2002-11-04 14:45                       ` Henning P. Schmiedehausen
  2002-11-04 15:29                         ` Alan Cox
  2002-11-05  4:57                         ` Werner Almesberger
  0 siblings, 2 replies; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-11-04 14:45 UTC (permalink / raw)
  To: linux-kernel

Jennie Haywood <jehaywood@compuserve.com> writes:

>The Linux kernel  is _extremely_  painful to debug compared to AIX.

Good! This means, people debugging the code have actually to think and
don't produce "turn on debugger, step here, there, patch a band aid,
done" solutions you see with various other "commercial products" (can
anyone really say "Internet Explorer" on this list and live? ;-) )

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-04  2:13                               ` Rob Landley
@ 2002-11-04 14:58                                 ` Patrick Finnegan
  2002-11-04 12:59                                   ` Rob Landley
  0 siblings, 1 reply; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-04 14:58 UTC (permalink / raw)
  To: Rob Landley; +Cc: linux-kernel

First I want to apologize to anyone I've pissed off too badly with this.
Another note - I have no relation to the LKCD developers, other than a
very satisfied, and sometimes excessivly vehement, user. I was about to
respond to this message in detail, but I dont need to put more Magnesium
on the flames.

Pat

On Mon, 4 Nov 2002, Rob Landley wrote:

> On Friday 01 November 2002 16:16, Patrick Finnegan wrote:
>
> > > It's not a fscking public service.  Linus has full control over his
> > > tree.  You have equally full control over your tree.  Linus can't
> > > tell you what patches to apply in your tree.  You can't tell Linus
> > > what patches he should apply to his.
> >
> > I'm sorry it _is_ a public service.  Once tens of people started
> > contributing to it, it became one.  This is like saying that the
> > Washington Monument belongs to the peole that maintain it, any building
> > belongs to the repair crews and janitors.
>
> You pay taxes to support the washington monument.  When's the last time you
> paid a tax to Linus?
>
> > I'm not saying that Linus is
> > necessarily a janitor, but when you consider how much of the Linux kernel
> > that he didn't write, you may relize that it's not just his kernel.
>
> He's the editor of a periodical publication.  A cross between an academic
> technical journal which people contribute to for professional reasons, and a
> hobbyist fanzine that people contribute to 'cause it's cool.  This is not a
> new thing, there are real-world precedents for this sort of relationship
> going back hundreds of years, to the invention of the printing press...
>
> Linus's editorial decisions are as final and unappealable as any other
> editorial decision at a magazine or newspaper.  You can publish your article
> elsewhere, and if it doesn't have the same prestige as the Harvard Law Review
> or the New England Journal of Medicine, tough.  They said no.
>
> And like ALL editors, his job isn't to write a significant portion of the
> articles in the publication, but to be a Sturgeon's Law filter throwing out
> 99% of the submissions in the slush pile, correcting the spelling and grammar
> of the remaining few, and trying to stitch them together into a coherent
> whole.
>
> Go track down somebody with a Journalism degree if you want to understand
> Linus's job.
>
> >  It
> > also belongs to every single person that has written even a single
> > line of code in it.
>
> If you get an article published in Time magazine, and you say that this gives
> you the right to print your own copies of Time and distribute them yourself,
> Time's lawyers are going to come after you.
>
> The GPL gives you the ability to do this, but it doesn't obligate the
> publication's editor to listen to you.  If next month's issue contains a huge
> rebuttal to one of your articles, calling you a boogerhead, tough.  The
> editor doesn't owe you anything as a previous contributor, and certainly
> doesn't owe you anything as someone from whom he did NOT take a submission.
>
> What Linus basically said was that if a significant number of distributions
> integrated it, he might take another look at the thing in the future.  But
> wasn't going into 2.5.
>
> Now, thanks to people pestering him beyond the Annoyance Event Horizon, he's
> got his fingers in his ears.  Congratulations.  Hopefully, he'll calm down a
> bit in a few months, but there's no guarantee.  In the mean time, the most
> productive thing to do is drop the topic and work on the Red Hat, SuSE, and
> Debian guys.  (Mandrake feeds from Red Hat, and SuSE is now making kernels
> for Connectiva and TurboLinux.  Gentoo and Slackware might be good to bug as
> well...)
>
> See if you can convince Alan Cox to pick up your patch.  That'll get you Red
> Hat, and the single largest concentration of roll-your-own kernel guys
> outside of Linus's own tree.
>
> Rob


--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04 15:29                         ` Alan Cox
@ 2002-11-04 15:27                           ` Henning P. Schmiedehausen
  2002-11-04 15:38                             ` Patrick Finnegan
  0 siblings, 1 reply; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-11-04 15:27 UTC (permalink / raw)
  To: linux-kernel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

>On Mon, 2002-11-04 at 14:45, Henning P. Schmiedehausen wrote:
>> Good! This means, people debugging the code have actually to think and
>> don't produce "turn on debugger, step here, there, patch a band aid,

>Some of us debug hardware. Regardless of the nice theories about
>reviewing your code they don't actually work on hardware because no
>amount of code review will let you discover things like undocumented 
>2uS deskew delays, or errors in DMA engines

A debugger won't help you here either. A pci bus probe, a 'scope and a
logic analyzer do.

(And experience, elbow grease, experience and a nice amount of ESP :-)
I do hate hardware. Had to debug too much of it (and just on
m68k/MCS-51 where the clock rates are low and the parts easy to
solder...).

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04 14:45                       ` Henning P. Schmiedehausen
@ 2002-11-04 15:29                         ` Alan Cox
  2002-11-04 15:27                           ` Henning P. Schmiedehausen
  2002-11-05  4:57                         ` Werner Almesberger
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-04 15:29 UTC (permalink / raw)
  To: hps; +Cc: Linux Kernel Mailing List

On Mon, 2002-11-04 at 14:45, Henning P. Schmiedehausen wrote:
> Good! This means, people debugging the code have actually to think and
> don't produce "turn on debugger, step here, there, patch a band aid,

Some of us debug hardware. Regardless of the nice theories about
reviewing your code they don't actually work on hardware because no
amount of code review will let you discover things like undocumented 
2uS deskew delays, or errors in DMA engines



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04 15:27                           ` Henning P. Schmiedehausen
@ 2002-11-04 15:38                             ` Patrick Finnegan
  2002-11-04 16:51                               ` Henning P. Schmiedehausen
  0 siblings, 1 reply; 333+ messages in thread
From: Patrick Finnegan @ 2002-11-04 15:38 UTC (permalink / raw)
  To: linux-kernel

On Mon, 4 Nov 2002, Henning P. Schmiedehausen wrote:

> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>
> >On Mon, 2002-11-04 at 14:45, Henning P. Schmiedehausen wrote:
> >> Good! This means, people debugging the code have actually to think and
> >> don't produce "turn on debugger, step here, there, patch a band aid,
>
> >Some of us debug hardware. Regardless of the nice theories about
> >reviewing your code they don't actually work on hardware because no
> >amount of code review will let you discover things like undocumented
> >2uS deskew delays, or errors in DMA engines
>
> A debugger won't help you here either. A pci bus probe, a 'scope and a
> logic analyzer do.
>
> (And experience, elbow grease, experience and a nice amount of ESP :-)
> I do hate hardware. Had to debug too much of it (and just on
> m68k/MCS-51 where the clock rates are low and the parts easy to
> solder...).

I find that hard to believe.  You're saying it's impossible to use a
software debugger to debug the interface between the software and the
hardware?  Eg. errors in the hardware that cause periodic anomalies in the
output read by the software would be one thing they could catch, along
with diagnosing that a problem is caused by flaky hardware rather than the
latest not-well-tested VM code.  In that last case, since bad hardware can
usually cause a panic, I see crash dumps as an invaluable resource ;-).
(No Linus, I'm not pushing them, just stating my opinion.)

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-03  2:25                         ` Horst von Brand
@ 2002-11-04 16:18                           ` Hugh Dickins
  0 siblings, 0 replies; 333+ messages in thread
From: Hugh Dickins @ 2002-11-04 16:18 UTC (permalink / raw)
  To: Horst von Brand; +Cc: linux-kernel

On Sat, 2 Nov 2002, Horst von Brand wrote:
> Hugh Dickins <hugh@veritas.com> said:
> 
> > It's a real worry that writing a crash dump to disk might stomp in the
> > wrong place, but I don't recall it ever happening in practice.  But
> > occasionally, yes, a dump was not generated at all, or not completed.
> 
> How do you test that? Not in some contrieved situation, under real crashes.

Sorry for being unclear: by "in practice" I meant "under real crashes" i.e.
I was referring more to what we heard back from users than my own testing.

Hugh


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04 15:38                             ` Patrick Finnegan
@ 2002-11-04 16:51                               ` Henning P. Schmiedehausen
  0 siblings, 0 replies; 333+ messages in thread
From: Henning P. Schmiedehausen @ 2002-11-04 16:51 UTC (permalink / raw)
  To: linux-kernel

Patrick Finnegan <pat@purdueriots.com> writes:

>On Mon, 4 Nov 2002, Henning P. Schmiedehausen wrote:

>> Alan Cox <alan@lxorguk.ukuu.org.uk> writes:
>>
>> >On Mon, 2002-11-04 at 14:45, Henning P. Schmiedehausen wrote:
>> >> Good! This means, people debugging the code have actually to think and
>> >> don't produce "turn on debugger, step here, there, patch a band aid,
>>
>> >Some of us debug hardware. Regardless of the nice theories about
>> >reviewing your code they don't actually work on hardware because no
>> >amount of code review will let you discover things like undocumented
>> >2uS deskew delays, or errors in DMA engines
>>
>> A debugger won't help you here either. A pci bus probe, a 'scope and a
>> logic analyzer do.
>>
>> (And experience, elbow grease, experience and a nice amount of ESP :-)
>> I do hate hardware. Had to debug too much of it (and just on
>> m68k/MCS-51 where the clock rates are low and the parts easy to
>> solder...).

>I find that hard to believe.  You're saying it's impossible to use a
>software debugger to debug the interface between the software and the

No. IMHO it is impossible to use a software debugger to catch 2uS
deskew delays or errors in DMA engines. That's what logic analyzers
are for. If you attach or fire up the debugger, the timing changes and
you're no longer testing the failure case but something different.

>(No Linus, I'm not pushing them, just stating my opinion.)

I am, BTW completely your opinion. Personally I find it horrid that
"the XIAFS resurrection" is winked through with "will be probably
accepted for the hack value" and LKCD is rejected with "bloat"
arguments.

But hey, it _is_ Linus' kernel and he may choose as he likes. I
e.g. run vendor kernels (for 2.4).

	Regards
		Henning

-- 
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen       -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH     hps@intermeta.de

Am Schwabachgrund 22  Fon.: 09131 / 50654-0   info@intermeta.de
D-91054 Buckenhof     Fax.: 09131 / 50654-20   

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31  6:21       ` Chris Wedgwood
@ 2002-11-05  3:38         ` Andreas Gruenbacher
  0 siblings, 0 replies; 333+ messages in thread
From: Andreas Gruenbacher @ 2002-11-05  3:38 UTC (permalink / raw)
  To: Chris Wedgwood; +Cc: linux-kernel

On Thursday 31 October 2002 07:21, Chris Wedgwood wrote:
> Don't get me wrong, I'm not against sane ACLs (POSIX ACLs are not) or
> EAs [...]

POSIX ACLs are more complicated than what would be inherently necessary, if we 
were in a situation where we could design from scratch. Unfortunately we are 
not in that situation. I've heard dozens of people complain about POSIX ACLs 
(and other kinds as well); nobody was able to come up with something truly 
better so far.

--Andreas.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-01  0:54               ` john stultz
  2002-11-01  1:31                 ` Werner Almesberger
@ 2002-11-05  3:58                 ` Andreas Gruenbacher
  1 sibling, 0 replies; 333+ messages in thread
From: Andreas Gruenbacher @ 2002-11-05  3:58 UTC (permalink / raw)
  To: john stultz, Werner Almesberger; +Cc: lkml

On Friday 01 November 2002 01:54, john stultz wrote:
> I probably should just go read the specs. Anyone have a pointer, or care
> to explain what the differences are between AFS's ACLs and POSIX ACLs?

POSIX 1003.1e draft 17 (withdrawn) is available at 
<http://wt.xpilot.org/publications/posix.1e/>.

--Andreas.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-04 14:45                       ` Henning P. Schmiedehausen
  2002-11-04 15:29                         ` Alan Cox
@ 2002-11-05  4:57                         ` Werner Almesberger
  1 sibling, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-05  4:57 UTC (permalink / raw)
  To: Henning P. Schmiedehausen; +Cc: linux-kernel

Henning P. Schmiedehausen wrote:
> Good! This means, people debugging the code have actually to think and
> don't produce "turn on debugger, step here, there, patch a band aid,
> done" solutions you see with various other "commercial products"

Unfortunately, just making it hard doesn't guarantee that they
won't try anyway. If you're lucky, at least their band aid will
be so disgusting that you won't be fooled into thinking they
might be right.

But ultimately, it's an attitude problem. Even people who learn
about their bugs by source code reading may then produce a
shabby fix.

Hmm, I wonder if Linus has ever done any protocol design,
followed by validation. I always find the havoc a protocol
validator (e.g. Spin) wreaks a very instructive demonstration
of how much source code level "correctness" really buys you :-)
(Or what chances you'd stand of realizing what happened just
from an Oops.)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 22:37             ` Werner Almesberger
@ 2002-11-05 11:42               ` Suparna Bhattacharya
  2002-11-05 18:00                 ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-05 11:42 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	linux-kernel, lkcd-general, lkcd-devel

On Thu, Oct 31, 2002 at 07:37:05PM -0300, Werner Almesberger wrote:
> Jeff Garzik wrote:
> > That said, I used to be an LKCD cheerleader until a couple people made 
> > some good points to me:  it is not nearly low-level enough to truly be 
> > of use in crash situations.
> 
> I'm not so convinced about this. I like the Mission Critical
> approach: save the dump to memory, then either boot through the
> firmware or through bootimg (nowadays, that would be kexec),
> then retrieve the dump from memory, and do whatever you like
> with it.
> 
> The huge advantage here is that you don't need a ton of
> specialized dump drivers and/or have much of the original kernel
> infrastructure to be in a usable state. The rebooted system will
> typically be stable enough to offer the full range of utilities,
> including up to date drivers for all possible devices, so you
> can safely write to disk, scp all the mess to your support
> critter, or post an automatic flame to linux-kernel :-)
> 
> The weak points of the Mission Critical design are that early
> memory allocation in the kernel needs to be tightly controlled,
> that architectures that wipe CPU caches on reboot need to
> commit them to memory before the firmware restart, and that
> drivers need to be able to recover from an "unclean" hardware
> state. (I think we'll see much of the latter happen as kexec
> advances. The other two issues aren't really special.)
> 
> Actually, at the RAS BOF I thought that IBM were developing LKCD
> in this direction, and had also eliminated a few not so elegant
> choices of Mission Critical's original design. I haven't looked

Yes, we are putting that in as one of the alternative dump targets
available. I have done quite a bit of work on that implementing the
ideas we talked about at OLS, and that's what I've been referring
to as the memory dump target.  Its not quite ready yet and we
need something like kexec to be available which we can use on Intel 
systems to achieve the softboot (the acceptance status of that still
doesn't seem to be clear), so I was looking at this as a
follow-on thing once the core infrastructure is there. More so 
because we probably need to give it some time to stabilize and try 
it on different environments and look at the issues you mention.

Why do we even consider the other options when we are doing 
this already ? Well, as we discussed earlier there's non-disruptive dumps 
for one, where this wouldn't work. The other is that before overwriting 
memory we need to be able to stop all activity in the system for certain
(system may appear hung/locked up) and I'm not fully certain about
how to do this for all environments. Maybe an answer lies in 
rethinking some parts of the algorithm a bit.

Also having the interface allows people to develop more specific/
reliable solutions for their environment. So we do not necessiate 
code duplication, but if something exists, then the infrastructure 
can use it. 

The general feeling here is that a one solution fits all thing 
may not work best right now ... and hence the focus on an interface 
based approach that gives us the needed flexibility. 

> at the LKCD code, but the descriptions sound as if all the
> special-case cruft seems to be back again, which I would find a
> little disappointing.

Hope that helps a bit.

Regards
Suparna

> 
> There might be a case for specialized low-overhead dump handlers
> for small embedded systems and such, but they're probably better
> maintained outside of the mainstream kernel. (They're more like
> firmware anyway.)
> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by: Influence the future 
> of Java(TM) technology. Join the Java Community 
> Process(SM) (JCP(SM)) program now. 
> http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> _______________________________________________
> lkcd-devel mailing list
> lkcd-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lkcd-devel

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-03 14:26                     ` yodaiken
@ 2002-11-05 17:09                       ` Bill Davidsen
  2002-11-05 17:36                         ` yodaiken
  0 siblings, 1 reply; 333+ messages in thread
From: Bill Davidsen @ 2002-11-05 17:09 UTC (permalink / raw)
  To: yodaiken
  Cc: Alan Cox, Linus Torvalds, Chris Friesen, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Sun, 3 Nov 2002 yodaiken@fsmlabs.com wrote:

> On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> > Quite clearly SCO, Sun, and IBM have been doing this for years without
> > offering dozens of options. I don't need it to sing and dance, I just need
> > a way to put the dump where I can find it. I'm not going to put another
> > box in at the end of a serial or parallel port, I don't have NVram, I do
> > have lopts of disk, and so does almost everyone else. I have remote
> > systems in wiring closets all over the country (all four time zones). They
> > are at the end of open net connections, unreliable and untrusted. I don't
> > want to bet that I have a working VPN, or that I can safely send all that
> > data without it being read by someone other than me.
> > 
> > The AIX support has a group just to beat on dumps customers send. What
> > more evidence is needed that people can and do use the capability.

> You paid someone for this for AIX. So the solution is obvious for Linux.

No, it's included in AIX, SCO and Solaris. And analysis is included in
support contracts. With all the stuff added to Linux to keep up with both
M$ and commercial UNIX, I can't imagine why anyone would be against this.
At least anyone who wanted Linux to compete in the commercial server
market.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* kexec (was: Re: What's left over.)
  2002-10-31  2:07 What's left over Rusty Russell
  2002-10-31  2:31 ` Linus Torvalds
@ 2002-11-05 17:29 ` Werner Almesberger
  2002-11-05 18:10   ` Benjamin LaHaise
  2002-11-05 19:06   ` Martin J. Bligh
  1 sibling, 2 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-05 17:29 UTC (permalink / raw)
  To: Rusty Russell; +Cc: torvalds, linux-kernel, ebiederm

By the way, let's not forget Eric Biederman's kexec. While not
perfect, it's definitely usable, and looks good enough for
inclusion as an experimental feature.

As to why we need it, I've explained this in my OLS 2000 paper,
sections 2.6 and 5:

http://www.almesberger.net/cv/papers/ols2k-9.ps

My approach was called "bootimg". kexec is similar, but does a few
things related to page sorting/moving better, and it's much smarter
about quiescencing the system before trying to reboot.

I view kexec as an "enabler", much like initrd, which had to be
part of the kernel for a while before people started to figure out
how to use it. (At this year's OLS, somebody told me they just
"discovered" initrd and are now using it. Oh well, it's only been
around for six years ;-)

It should be "experimental", because some compatibility issues
still have to be addressed, but most of this can be done in user
space, and shouldn't require significant changes in the kernel
part of kexec, or in its interface to user space.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-11-05 17:09                       ` Bill Davidsen
@ 2002-11-05 17:36                         ` yodaiken
  0 siblings, 0 replies; 333+ messages in thread
From: yodaiken @ 2002-11-05 17:36 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: yodaiken, Alan Cox, Linus Torvalds, Chris Friesen,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Tue, Nov 05, 2002 at 12:09:17PM -0500, Bill Davidsen wrote:
> On Sun, 3 Nov 2002 yodaiken@fsmlabs.com wrote:
> 
> > On Sun, Nov 03, 2002 at 08:48:30AM -0500, Bill Davidsen wrote:
> > > Quite clearly SCO, Sun, and IBM have been doing this for years without
> > > offering dozens of options. I don't need it to sing and dance, I just need
> > > a way to put the dump where I can find it. I'm not going to put another
> > > box in at the end of a serial or parallel port, I don't have NVram, I do
> > > have lopts of disk, and so does almost everyone else. I have remote
> > > systems in wiring closets all over the country (all four time zones). They
> > > are at the end of open net connections, unreliable and untrusted. I don't
> > > want to bet that I have a working VPN, or that I can safely send all that
> > > data without it being read by someone other than me.
> > > 
> > > The AIX support has a group just to beat on dumps customers send. What
> > > more evidence is needed that people can and do use the capability.
> 
> > You paid someone for this for AIX. So the solution is obvious for Linux.
> 
> No, it's included in AIX, SCO and Solaris. And analysis is included in

None of those are free.

> support contracts. With all the stuff added to Linux to keep up with both
> M$ and commercial UNIX, I can't imagine why anyone would be against this.
> At least anyone who wanted Linux to compete in the commercial server
> market.

So buy your Linux from a vendor who supports it. 




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 11:42               ` [lkcd-devel] " Suparna Bhattacharya
@ 2002-11-05 18:00                 ` Werner Almesberger
  2002-11-05 18:36                   ` Alan Cox
  2002-11-09 21:21                   ` Pavel Machek
  0 siblings, 2 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-05 18:00 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	linux-kernel, lkcd-general, lkcd-devel

Suparna Bhattacharya wrote:
> Yes, we are putting [MCORE] in as one of the alternative dump targets
> available.

Great !

> Its not quite ready yet and we need something like kexec to be
> available which we can use on Intel systems to achieve the softboot
> (the acceptance status of that still doesn't seem to be clear),

Yes, I've just checked with Eric, and he hasn't received any
indication from Linus so far. I posted a reminder to linux-kernel.
I'd really hate to see kexec miss 2.6.

> Why do we even consider the other options when we are doing 
> this already ? Well, as we discussed earlier there's non-disruptive
> dumps for one, where this wouldn't work.

But they're very different anyway, aren't they ? I mean, you could
even implement them (well, almost) from user space, with today's
kernels.

> The other is that before overwriting 
> memory we need to be able to stop all activity in the system for certain
> (system may appear hung/locked up) and I'm not fully certain about
> how to do this for all environments. Maybe an answer lies in 
> rethinking some parts of the algorithm a bit.

This is certainly the hairiest part, yes. I think we have about
four types of devices/elements to worry about:

 - those that just sit there, and never talk unless spoken to
 - those that may generate interrupts
 - those that DMA if you ask them nicely
 - those that DMA when they feel like it (e.g. copy an incoming
   network packet to the next buffer in the free list)

The latter are the real problem. I see the following possibilities
for dealing with them:

 - faith-based computing: pray that nothing bad will befall your
   system :-)
 - de-activate them individually. There should be a lot of work
   that can be shared with power management. And that's one of
   the reasons why I think the memory target should be available
   early, or convergence will take forever.
 - try to reset them, without necessarily knowing what they are
   or what they do. I don't know is there is a useful way for
   resetting the PCI bus by software, but if there is one, this
   looks like the most promising strategy to me, even if it may
   be somethat lacking in elegance.
 - if all else fails, maybe introduce an "unsafe" memory type
   for potential DMA targets of unpredictable devices, that will
   not be re-used. I hope we won't need this, though. (But in case
   such a memory type gets introduced by the memory-scrubbers, at
   least you could blame _them_ :-)

> The general feeling here is that a one solution fits all thing 
> may not work best right now ... and hence the focus on an interface 
> based approach that gives us the needed flexibility. 

Yes, this is certainly important. I just think that the "memory
target" concept is closer to a general solution than disk dumps.
But there are always those 5% with special needs, and it's good
if they can use the same framework.

Thanks,
- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-general] Re: What's left over.
  2002-11-03 16:32                                   ` Alan Cox
  2002-11-03 17:08                                     ` [lkcd-devel] " Matt D. Robinson
@ 2002-11-05 18:07                                     ` Bill Davidsen
  1 sibling, 0 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-05 18:07 UTC (permalink / raw)
  To: Alan Cox
  Cc: Matt D. Robinson, Steven King, Linus Torvalds, Joel Becker,
	Chris Friesen, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On 3 Nov 2002, Alan Cox wrote:

> On Sun, 2002-11-03 at 14:33, Bill Davidsen wrote:
> > If you define "unmaintainably bad" as "having features you don't need"
> > then I agree. But since dump to disk is in almost every other commercial
> > UNIX, maybe someone would question why it's good for others but not  for
> > Linux.
> 
> It isnt about features, its about clean maintainable code. netdump to me
> doesnt mean no dump to disk option. In fact I'd rather like to be able
> to insmod dump-foo.o. The correctness issues are hard but if the
> dump-foo is standalone, resets the hardware and has an SHA integrity
> check then it can be done (think of it as a post crash variant of the
> trusted computing TCB verification problem)

I certainly don't disagree, but the one critical problem is writing the
dump to the right place, or at least not writing to the wrong place. I'd
love to have disk, net, NVram, whatever choices, but disk is the one which
would help the most. AIX and ISC have dump to swap, and the swapon copies
the data back or clears it, with a fresh O/S load to ensure writing the
right place.
 
> > uses the crash dump in AIX, the person who wants to send a compressed dump
> > and money to IBM and get back a fix. Netdump assumes external resources
> 
> Lots of interesting legal issues but yes you can do it sometimes (DMCA,
> privacy, financial duties sometimes make it horribly complex). Even in
> the case where you only dump the oops its still valuable.

Agreed, I would think about doing that with a mail server. But even an
oops like ksymoops would be helpful. I started on systems with dumps,
ksymoops is wonderful by comparison.
 
> > and a functional secure network (is the dump encrypted and I missed it?)
> > which home users surely don't have, and remote servers oftem lack as well.
> 
> Encrypting the dump with the new crypto lib in the kernel would be easy,
> right now it doesnt. 
> 
> My disk dump concerns are purely those of correctness. That means
> 
> 1.	After loading the module getting the block list for the dump target

That could all be built as part of init, clearly you can't depend on
demand loading the module.
 
> 2.	Resetting and scratch initializing the dump device

If the modules are to be really self-sufficient it would have to include
the driver. I'll let someone tell me that's not always the case if the
driver can have its own data area.
 
> 3.	Not relying on any code outside of the dump TCB that may have
> been corrupted

Yes, although with separate code, stack and data that's less likely. In
the bad old days self-modifying code was common.
 
> 4.	At dump time turning off all bus masters, doing the dump TCB
> verification and then dumping

The first part of that looks medium hard, particularly if the code has to
be part of the dump module.
 
> Most of the pieces already exist.

Clearly it can be done even better than the current implementation, and
given an interface standard a replacement in the whole could be done.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: Re: What's left over.)
  2002-11-05 17:29 ` kexec (was: Re: What's left over.) Werner Almesberger
@ 2002-11-05 18:10   ` Benjamin LaHaise
  2002-11-05 19:06   ` Martin J. Bligh
  1 sibling, 0 replies; 333+ messages in thread
From: Benjamin LaHaise @ 2002-11-05 18:10 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Rusty Russell, torvalds, linux-kernel, ebiederm

On Tue, Nov 05, 2002 at 02:29:43PM -0300, Werner Almesberger wrote:
> I view kexec as an "enabler", much like initrd, which had to be
> part of the kernel for a while before people started to figure out
> how to use it. (At this year's OLS, somebody told me they just
> "discovered" initrd and are now using it. Oh well, it's only been
> around for six years ;-)

kexec is also a great enabled for a non-intrusive kernel dump 
facility done correctly by booting into a new kernel image (which 
avoids the whole difficulty on x86 with BIOSes wiping out RAM at 
reboot).

		-ben

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 18:00                 ` Werner Almesberger
@ 2002-11-05 18:36                   ` Alan Cox
  2002-11-05 19:19                     ` Werner Almesberger
                                       ` (2 more replies)
  2002-11-09 21:21                   ` Pavel Machek
  1 sibling, 3 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-05 18:36 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.

Let me ask the same dumb question - what does kexec need that a dumper
doesn't. In other words given reboot/trap hooks can kexec happily live
as a standalone module ?


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: Re: What's left over.)
  2002-11-05 17:29 ` kexec (was: Re: What's left over.) Werner Almesberger
  2002-11-05 18:10   ` Benjamin LaHaise
@ 2002-11-05 19:06   ` Martin J. Bligh
  1 sibling, 0 replies; 333+ messages in thread
From: Martin J. Bligh @ 2002-11-05 19:06 UTC (permalink / raw)
  To: Werner Almesberger, Rusty Russell; +Cc: torvalds, linux-kernel, ebiederm

> By the way, let's not forget Eric Biederman's kexec. While not
> perfect, it's definitely usable, and looks good enough for
> inclusion as an experimental feature.

Another me too for this feature. I really want to be able to use this
on the large NUMA boxes - it takes me 5 minutes to do a full reboot
cycle, and I can't even do an init 6 due to some firmware complications,
I have to do init 0, power off, power on, boot, etc. Whilst I have
a remote power interface, it's still a pain in the butt.
kexec would be Nirvana ;-) 

M.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 18:36                   ` Alan Cox
@ 2002-11-05 19:19                     ` Werner Almesberger
  2002-11-05 20:10                       ` Alan Cox
  2002-11-06  0:21                       ` Andy Pfiffer
  2002-11-06  2:48                     ` Eric W. Biederman
  2002-11-06  4:29                     ` Eric W. Biederman
  2 siblings, 2 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-05 19:19 UTC (permalink / raw)
  To: Alan Cox
  Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

Alan Cox wrote:
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't.

kexec needs:
 - a system call to set it up
 - a way to silence devices (difference to dumper: kexec normally
   operates under an intact system, so it's more similar to, say,
   swsusp. But I expect that cleaning up device power management
   would also clear the path for more reliable dumpers.)
 - a bit of glue, e.g. to switch to "real mode", etc. AFAIK, none
   of this touches other code, but there are of course some
   assumptions here on what other codes provides or does.
 - device drivers that can bring silent devices back to life
   (normally, device drivers do this already, but kexec may
   uncover dormant bugs in this area)

Since recent kernels already preserve memory areas with BIOS data,
kexec is actually quite a bit less intrusive than bootimg was.

> In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?

"Module", as in "software package": yes, the main problem spot
would be the system call allocation, which is also inconvenient
for other developers. By the way, kexec does not tap into the
kernel's reboot process, so no such hooks are needed. If LKCD
wants to use kexec, it can do whatever it does to be invoked at
the time of a crash, and then call kexec directly.

"Module", as in "loadable kernel module": I think so, although
it's currently "bool", not "tristate". Also, you'd have the
system call issue again.

So not merging it is mainly inconvenient to use, adds the system
call allocation as a continuous annoyance, and makes it a little
harder to work on the related infrastructure. But then, despite
being somewhat obscure, bootimg and kexec have been in use for
years, the design is about as lean as it can get, and it's cool.
What more could you ask for ? :-)

What kexec needs now is more exposure, so that the BIOS
compatibility issues get noticed and fixed, it is ported to other
architectures, and that more people can start figuring out how to
use it, and how to build a boot environment. Then, maybe in a
year or two, we can send "Methuselah" LILO and "nice little OS"
GRUB off to their well-deserved retirement.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 19:19                     ` Werner Almesberger
@ 2002-11-05 20:10                       ` Alan Cox
  2002-11-05 23:25                         ` Werner Almesberger
  2002-11-06  0:21                       ` Andy Pfiffer
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-05 20:10 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Tue, 2002-11-05 at 19:19, Werner Almesberger wrote:
> kexec needs:
>  - a system call to set it up

Device, file, insmod...

>  - a way to silence devices (difference to dumper: kexec normally
>    operates under an intact system, so it's more similar to, say,
>    swsusp. But I expect that cleaning up device power management
>    would also clear the path for more reliable dumpers.)

So you need to register with the power management as the last thing to
be suspended and do a suspend before kexec.

> So not merging it is mainly inconvenient to use, adds the system
> call allocation as a continuous annoyance, and makes it a little
> harder to work on the related infrastructure. But then, despite
> being somewhat obscure, bootimg and kexec have been in use for
> years, the design is about as lean as it can get, and it's cool.
> What more could you ask for ? :-)

I'm mostly worried about how to make these things fit the least
intrusively into the kernel.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 20:10                       ` Alan Cox
@ 2002-11-05 23:25                         ` Werner Almesberger
  0 siblings, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-05 23:25 UTC (permalink / raw)
  To: Alan Cox
  Cc: ebiederm, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

Alan Cox wrote:
>>  - a system call to set it up
> 
> Device, file, insmod...

I don't know what Eric thinks about using something else than a
system call, but I think he made a quite reasonable choice.

The data structure isn't entirely trivial, so a misc device plus
ioctl would be a bit on the ugly side. I vaguely remember having
proposed something like this a while ago (may have been for
pivot_root), and everybody went "noooo!!" ;-)

insmod would be possible, although with a rather unusual parameter
passing scheme. Also, when using kexec from inside the kernel (e.g.
MCORE), the insmod solution would have to split kexec into the
interface and the kexec core.

But yes, there's always a means to avoid adding a new system
call. /dev/syscall with an ioctl

struct syscall_ioctl {
	const char *symbol_name;
	va_list ap;
};

anyone ? :-) (Implementing it might be a bit of a challenge :)

> So you need to register with the power management as the last thing to
> be suspended and do a suspend before kexec.

Well, kexec just calls device_shutdown. The problem isn't the
interface, it's that device_shutdown apparently doesn't work too
well (devices not supporting it, some semantics mixup, etc.).
But this is general infrastructure work, that should be done
with or without kexec.

> I'm mostly worried about how to make these things fit the least
> intrusively into the kernel.

Just look at Eric's kexec patch. It isn't particularly intrusive:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103604471723358&w=2

(For 2.5.45. The patch fails for 2.5.46, because new system calls
were added ...)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 19:19                     ` Werner Almesberger
  2002-11-05 20:10                       ` Alan Cox
@ 2002-11-06  0:21                       ` Andy Pfiffer
  2002-11-06  1:10                         ` Werner Almesberger
  2002-11-10 18:35                         ` Pavel Machek
  1 sibling, 2 replies; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-06  0:21 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Tue, 2002-11-05 at 11:19, Werner Almesberger wrote:
> Alan Cox wrote:
> > Let me ask the same dumb question - what does kexec need that a dumper
> > doesn't.
> 
> kexec needs:
>  - a system call to set it up
>  - a way to silence devices <snip>
<snip>
>  - a bit of glue <snip>
>  - device drivers that can bring silent devices back to life
<snip>

> > In other words given reboot/trap hooks can kexec happily live
> > as a standalone module ?

You could probably skip the system call to set it up.  Example: I could
imagine a bizarre set of pseudo-devices:

	# insmod kexec
	# cat bzImage > /proc/kexec/next-image
	# echo "root=805" > /proc/kexec/next-cmndline
	# echo 1 > /proc/kexec/reboot

and hide away that dirty little sequence with a nice kexec(3) library
routine.

The Two Kernel Monte trick (that rewrote when insmod'ed the kernel's
function pointers for sys_reboot) was also effective, but that
apparently isn't an option any longer.


> What kexec needs now is more exposure, so that the BIOS
> compatibility issues get noticed and fixed, it is ported to other
> architectures, and that more people can start figuring out how to
> use it, and how to build a boot environment.

I'll 2nd that sentiment, and add another big one: fixing (apparent)
problems with drivers and chipset-munging code, so that devices can be
reliably re-probed/re-inited/etc. after the reboot.

Long term, I think it would be advantageous to be able to avoid SCSI and
other time consuming device probes for the common and simple reboot case
of 1) the currently running kernel is being rebooted, and 2) no changes
to the device configuration have occured.  Shouldn't we be able to "save
away" what is in sysfs, and then re-inject that state after a fast
reboot?

Andy



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  0:21                       ` Andy Pfiffer
@ 2002-11-06  1:10                         ` Werner Almesberger
  2002-11-06  1:37                           ` Alexander Viro
  2002-11-10 18:35                         ` Pavel Machek
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-06  1:10 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

Andy Pfiffer wrote:
> You could probably skip the system call to set it up.

Yes, yes, there are many ways to do this. This isn't the issue. The
questions regarding this are:

 - it kexec allowed to use a system call ?
 - if yes, is a system call the technically right solution ?
 - if yes, is it a practical solution ?

So far, it hasn't been considered inherently wrong to use system
calls, even for highly Linux-specific functions, and even if they
aren't performance-critical (just think of pivot_root). (*)

If this perception has changed, such a change of policy would also
affect kexec, but then we don't need to discuss kexec but the
policy change. (I don't know - is such a change in the air ?)

(*) By the way, I remember now where I brought up some hack for
    avoiding to use a system call - it was for bootimg :-)

Now, if we assume that it's okay for kexec to use a system call,
the next question is whether kexec should indeed use it, i.e.
whether a system call makes sense for what it is trying to do.
Since there are no device files or network elements naturally
involved here (i.e. other major kernel function interfaces),
the answer seems to be "yes".

Last but not least, we need to decide whether using a system
call would be painful for Eric or for kexec users. This would be
the case if kexec isn't merged, and the kexec patch would need
frequent updates because system calls have changed.

I understand Alan's question as the "what if ... ?" type. If
kexec is indeed rejected for merging, it may make sense to change
the interface to something which may be technically less elegant,
but which makes patch maintenance easier to handle.

> I'll 2nd that sentiment, and add another big one: fixing (apparent)
> problems with drivers and chipset-munging code, so that devices can be
> reliably re-probed/re-inited/etc. after the reboot.

Yes, kexec is likely to turn up a few problems in this area, too.
Right now, we only hear about such issues if some BIOS lets
something slip through. With kexec, such problems should show up
sooner.

> Long term, I think it would be advantageous to be able to avoid SCSI and
> other time consuming device probes

Definitely. May I refer you to my booting paper, which discusses
all this in section 5 ? :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  1:10                         ` Werner Almesberger
@ 2002-11-06  1:37                           ` Alexander Viro
  2002-11-06  2:05                             ` Werner Almesberger
  2002-11-06  4:07                             ` Eric W. Biederman
  0 siblings, 2 replies; 333+ messages in thread
From: Alexander Viro @ 2002-11-06  1:37 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Andy Pfiffer, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel



On Tue, 5 Nov 2002, Werner Almesberger wrote:

> Now, if we assume that it's okay for kexec to use a system call,
> the next question is whether kexec should indeed use it, i.e.
> whether a system call makes sense for what it is trying to do.
> Since there are no device files or network elements naturally
> involved here (i.e. other major kernel function interfaces),
> the answer seems to be "yes".

That's not obvious.  By the same logics, we would need syscalls for
turning off overcommit, etc., etc.

FWIW, I suspect that
	open("/proc/image", O_EXCL|O_WRONLY);
	bunch of lseek()/write()
	close()
would be more natural - definitely easier to understand than arguments of
your sys_kexec().  It's easy to switch from your code to that - you
put initialization into ->open(), pulling segments from userland into
->write(), use default ->lseek() and do actual work on ->close() if
no errors had happened.  file->private_data will point to intermediate
state you need.

After all, that's what happens - you form an image, writing to it user-supplied
data from given buffers at given offsets and when you are done with that you
commit the changes.  IMO special syscall is less natural match for that
than sequence above - commit-on-close is not something unusual, so it matches
the semantics of all syscalls involved...




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  1:37                           ` Alexander Viro
@ 2002-11-06  2:05                             ` Werner Almesberger
  2002-11-07  6:04                               ` Eric W. Biederman
  2002-11-06  4:07                             ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-06  2:05 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andy Pfiffer, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Alexander Viro wrote:
> That's not obvious.  By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.

Okay okay, add file system specific ioctls and sysctl to my list
of alternative mechanisms :-)

> FWIW, I suspect that
> 	open("/proc/image", O_EXCL|O_WRONLY);
> 	bunch of lseek()/write()
> 	close()

Hmm, interesting. Yes, that should work. One would of course have
to retain the current interface for in-kernel use (e.g. MCORE), but
that's probably okay. Let's see what Eric thinks about it - it's
his code after all.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 18:36                   ` Alan Cox
  2002-11-05 19:19                     ` Werner Almesberger
@ 2002-11-06  2:48                     ` Eric W. Biederman
  2002-11-06  4:29                     ` Eric W. Biederman
  2 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06  2:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
> 
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?

Kexec primarily needs the reboot/trap hooks in working order, and exported,
for it to live externally to the kernel.  

Currently the reboot_notifier call chain is private to sys.c, and is not
exported even to other parts of the kernel.

Even together device_shutdown, and the reboot_notifier do not properly shutdown
the cpus on an SMP system.

Plus we are missing quite a ->shutdown methods at random in the kernel, and if
kexec is not easily available someone might not get around to writing
and debugging them.

Plus a system call seems the natural interface for something that
appears to be a reboot.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  1:37                           ` Alexander Viro
  2002-11-06  2:05                             ` Werner Almesberger
@ 2002-11-06  4:07                             ` Eric W. Biederman
  2002-11-06  4:47                               ` Eric W. Biederman
  2002-11-06 19:24                               ` Rob Landley
  1 sibling, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06  4:07 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Werner Almesberger, Andy Pfiffer, Alan Cox, Suparna Bhattacharya,
	Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Alexander Viro <viro@math.psu.edu> writes:

> On Tue, 5 Nov 2002, Werner Almesberger wrote:
> 
> > Now, if we assume that it's okay for kexec to use a system call,
> > the next question is whether kexec should indeed use it, i.e.
> > whether a system call makes sense for what it is trying to do.
> > Since there are no device files or network elements naturally
> > involved here (i.e. other major kernel function interfaces),
> > the answer seems to be "yes".
> 
> That's not obvious.  By the same logics, we would need syscalls for
> turning off overcommit, etc., etc.
>
> FWIW, I suspect that
> 	open("/proc/image", O_EXCL|O_WRONLY);
> 	bunch of lseek()/write()
> 	close()
> would be more natural - definitely easier to understand than arguments of
> your sys_kexec().  It's easy to switch from your code to that - you
> put initialization into ->open(), pulling segments from userland into
> ->write(), use default ->lseek() and do actual work on ->close() if
> no errors had happened.  file->private_data will point to intermediate
> state you need.
> 
> After all, that's what happens - you form an image, writing to it user-supplied
> data from given buffers at given offsets and when you are done with that you
> commit the changes.  IMO special syscall is less natural match for that
> than sequence above - commit-on-close is not something unusual, so it matches
> the semantics of all syscalls involved...

First take a look at a ELF header.  There is a one to one mapping between
the arguments to kexec and the segments found there.  

Second lseek()/write() pairs do not have the capacity to specify holes/bss
segments kexec does, so it would not be a 1 to 1 transform.  But I can
live without holes.

Third I am not fully certain it makes sense to implement a function that will
boot into a user specified image remotely.  If the export process has
too many capabilities we could be in trouble.

Are you arguing for more /proc files?  Where does the magic file come
from?   I cannot request the allocation of a device number because the 
allocation was frozen before 2.4 started.  Though char 1 minor 11
seems the obvious choice.   Or should it be a magic file in sysfs
instead of procfs?  All of the require the code to live someplace
where I need to allocate a place in the namespace.  So there is no
inherent advantage over a system call.  And unless someone exports the
hooks to properly shutdown the system to modules it is useless.

Given that this is a seldom used system function I agree that it does not
need to be optimized.

I do not have any problem with changing the interface to something
more palatable to other kernel developers.  But I will only do it for
one of two reasons.  My patch will never get accepted in any
reasonable time frame and it makes maintenance easier for me.  Or
makes the interface palatable for acceptance, into the kernel.
Neither position currently appears apparent.

----------
Now to dig into the heart of the issue.

I could write the new kernel image into /dev/mem and just jump to
it.  Because that is really all I want an interface to do.  There
are several practical problems, with something quite that simple.

No kernel shutdown code is run, so I am left with processors flying
all over the place, devices doing all manner of things, after their
device drivers have walked away.  Something needs to put the system in
a quiescent state.  The fix I call the reboot notifiers, and
device_shutdown.  (And then implement a bunch of ->shutdown() methods)

As we all know writing to /dev/mem is not safe because the memory is
being use for other things.  So I need a way to safely use memory
during the transition, from one kernel to another.  

Personally I would love to be able to allocate one big contiguous
buffer that the kernel is not using and neither is the image I will
eventually load.  Then I could just memcpy from that buffer and I
would be done.

Alas memory management in the kernel is done in pages, and can be
fragmented after running for many moons.  So I need to allocate all of
my memory in pages, and I need to let the kernel know where it will
all eventually live so I can correctly order the memcpy operations.

Once all the memory copying is sorted out I need to jump to the new
kernel (a kernel being anything that runs without an OS).  Logically
all you should have to do is do a single jump instruction but in
practice there is much more that has to be done.  The kernel when it
loads up looks around and enables all sorts of cpu optimizations so
the kernel runs as well as possible on the processor.  The new kernel
image needs to be given a least common denominator interface so it can
enable what it is prepared to take advantage of.   In addition to what
the normal shutdown path can accomplish on x86 this involves disabling
page, changing the gdt, and changing the idt, and possibly disabling
SMP.  It should be possible to enhance device_shutdown so it can
properly disable SMP though if that will happen still remains in the
air.

-----------------------------------------

So kexec needs:

- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
  for jumping to another kernel.
- Awareness of it's existence.

To some extent every piece of this is intimately tied to the kernel
implementation, from the ability to modify page tables, when jumping
to a new kernel, to the best algorithm for finding a safe memory
buffer, to the proper way to shutdown devices this week, and being
intimately tied to the kernel the code needs to find a home in the
kernel.


Eric
 

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 18:36                   ` Alan Cox
  2002-11-05 19:19                     ` Werner Almesberger
  2002-11-06  2:48                     ` Eric W. Biederman
@ 2002-11-06  4:29                     ` Eric W. Biederman
  2002-11-06  6:25                       ` Linus Torvalds
  2 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06  4:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Tue, 2002-11-05 at 18:00, Werner Almesberger wrote:
> > Yes, I've just checked with Eric, and he hasn't received any
> > indication from Linus so far. I posted a reminder to linux-kernel.
> > I'd really hate to see kexec miss 2.6.
> 
> Let me ask the same dumb question - what does kexec need that a dumper
> doesn't. In other words given reboot/trap hooks can kexec happily live
> as a standalone module ?

In replying to another post by Al Viro I managed to think this through.
kexec needs:

- An allocated slot in some namespace.
- The ability to request the kernel devices shut themselves down.
- Buffers that are safe to use.
- The ability to transition the cpu into a state that is suitable
  for jumping to another kernel.
- Awareness of it's existence.

Most of this code is intimate with how the kernel currently behaves
and needs at least minor adjustments for things like living in PAE
mode.

The safe buffers a kernel can probably avoid.

I cannot see the core functionality of kexec every living happily as a
standalone module.  The kexec code accomplishes nothing.  If there is
something useful it does it can probably be moved elsewhere and the
line count reduced.  The entire code base is basically obsessed with
getting safe temporary buffers for the new kernel to live in, and
given improvements to how the kernel allocates memory that are
theoretically possible with rmap I could remove that code as well.

The only thing that keeps kexec at all maintainable outside the kernel
is that big fundamental changes do not happen often.  But the kernel
must be tracked, closely.  I don't see that as a recipe for a
standalone module.  I can barely even see making the code a module, or
what the point would be.

The reason kmonte fails in so many cases where kexec succeeds is
precisely because kmonte is a module.

If we include machine_kexec or something very similar to but more
generalized to the list of exported functions, perhaps kexec could
just have the buffer allocation code and live happily outside of the
kernel.  But as it is, if we want to factor kexec into pieces so one
piece can live happily as a standalone module it will take some
serious design work, and a total rethink of the implementation.  And
we will still have to add code to the kernel.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  4:07                             ` Eric W. Biederman
@ 2002-11-06  4:47                               ` Eric W. Biederman
  2002-11-06 19:24                               ` Rob Landley
  1 sibling, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06  4:47 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Werner Almesberger, Andy Pfiffer, Alan Cox, Suparna Bhattacharya,
	Jeff Garzik, Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel


And the question I was building up to, but forgot to ask.

Given that the kexec code is tied intimately to the kernel
implementation.

Given that there is no real advantage in an incremental write
model for kexec users (except not needing to allocate a syscall
number).

Do you see a better way to structure the kexec interface?

Another file in proc, not carefully placed is just a hair better than
an ioctl.  Using /proc is not desirable because there are uses of
kexec that need a very small kernel, and /proc is a pig, is otherwise
useless size bloat. 

For some uses including the one that drove me to write it CONFIG_KEXEC
and CONFIG_TINY will both be defined.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  4:29                     ` Eric W. Biederman
@ 2002-11-06  6:25                       ` Linus Torvalds
  2002-11-06  6:38                         ` Suparna Bhattacharya
                                           ` (3 more replies)
  0 siblings, 4 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-06  6:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel


On 5 Nov 2002, Eric W. Biederman wrote:
> 
> In replying to another post by Al Viro I managed to think this through.
> kexec needs:

Note that kexec doesn't bother me at all, and I might find myself using it 
myself.

>From a sanity standpoint, I think the thing already _has_ a system call, 
though: clearly "sys_reboot()" is the place to add a case for "reboot into 
this image". No? That's where we shut down devices anyway, and it's the 
sane place to say "reboot into the kexec image"

Which still leaves you with a real sys_kexec() to actually _load_ the
image, or course. I think loading of the image should be a totally
separate event from the actual booting of the image, since we may want to
load the image early, then do various user-level shutdown (unmounting 
etc), and then reboot.

Right now the kexec() stuff seems to mix up the loading and rebooting, but
I didn't take a very deep look, maybe I'm wrong.

Anyway, I don't really get why the kexec() system call would not just be

	void *kexec_image = NULL;
	unsigned long kexec_size;

	int sys_kexec(void *uaddr, size_t len)
	{
		void *new;

		if (!capable(CAP_ADMIN))
			return -EPERM;

		/* Get rid of old image if any.. */
		if (kexec_image) {
			vfree(kexec_image);
			kexec_image = NULL;
		}

		/* Zero length just meant "get rid of it" */
		if (!len)
			return 0;

		if (!access_ok(VERIFY_READ, uaddr, len))
			return -EFAULT;

		new = vmalloc(len);
		if (!new)
			return -ENOMEM;

		if (memcpy_from_user(new, uaddr, len)) {
			vfree(new);
			return -EFAULT;
		}

		kexec_image = new;
		kexec_size = len;
		return 0;
	}

and be done with it that way? Then the actual "reboot" (and that would be
in the existing "sys_reboot()") basically just does something like

	memcpy(kernelbase, kexec_image, kexec_size);

at the very end (while obviously having to be careful about itself being
out of the way. It can avoid the page table issue by using the "page *"
array that vmalloc uses internally anyway: see "area->pages[]" in
vmalloc).

Note that the two-phase boot means that you can load the new kernel early, 
which allows you to later on use it for oops handling (it's a bit late to 
try to set up the kernel to be loaded at that time ;)

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  6:25                       ` Linus Torvalds
@ 2002-11-06  6:38                         ` Suparna Bhattacharya
  2002-11-06  7:48                         ` Eric W. Biederman
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-06  6:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Alan Cox, Werner Almesberger, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Tue, Nov 05, 2002 at 10:25:35PM -0800, Linus Torvalds wrote:
> 
> On 5 Nov 2002, Eric W. Biederman wrote:
> > 
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
> 
> Note that kexec doesn't bother me at all, and I might find myself using it 
> myself.
> 
> >From a sanity standpoint, I think the thing already _has_ a system call, 
> though: clearly "sys_reboot()" is the place to add a case for "reboot into 
> this image". No? That's where we shut down devices anyway, and it's the 
> sane place to say "reboot into the kexec image"
> 
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting 
> etc), and then reboot.
> 
> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.
> 
> Anyway, I don't really get why the kexec() system call would not just be
> 
> 	void *kexec_image = NULL;
> 	unsigned long kexec_size;
> 
> 	int sys_kexec(void *uaddr, size_t len)
> 	{
> 		void *new;
> 
> 		if (!capable(CAP_ADMIN))
> 			return -EPERM;
> 
> 		/* Get rid of old image if any.. */
> 		if (kexec_image) {
> 			vfree(kexec_image);
> 			kexec_image = NULL;
> 		}
> 
> 		/* Zero length just meant "get rid of it" */
> 		if (!len)
> 			return 0;
> 
> 		if (!access_ok(VERIFY_READ, uaddr, len))
> 			return -EFAULT;
> 
> 		new = vmalloc(len);
> 		if (!new)
> 			return -ENOMEM;
> 
> 		if (memcpy_from_user(new, uaddr, len)) {
> 			vfree(new);
> 			return -EFAULT;
> 		}
> 
> 		kexec_image = new;
> 		kexec_size = len;
> 		return 0;
> 	}
> 
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
> 
> 	memcpy(kernelbase, kexec_image, kexec_size);
> 
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).
> 
> Note that the two-phase boot means that you can load the new kernel early, 
> which allows you to later on use it for oops handling (it's a bit late to 
> try to set up the kernel to be loaded at that time ;)

Yes, that's exactly what we need to support a soft-boot based dump
mechanism, much like the Mission Critical folks split up the bootimg
syscall to do the early load on a sane system, and the actual soft-boot
at crash time. And it fits in naturally as you point out ..

Regards
Suparna

> 
> 		Linus
> 

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  6:25                       ` Linus Torvalds
  2002-11-06  6:38                         ` Suparna Bhattacharya
@ 2002-11-06  7:48                         ` Eric W. Biederman
  2002-11-06  9:11                           ` Suparna Bhattacharya
  2002-11-06 22:05                           ` Michal Jaegermann
  2002-11-06 16:13                         ` Eric W. Biederman
  2002-11-07  8:50                         ` Eric W. Biederman
  3 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06  7:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

Linus Torvalds <torvalds@transmeta.com> writes:

> On 5 Nov 2002, Eric W. Biederman wrote:
> > 
> > In replying to another post by Al Viro I managed to think this through.
> > kexec needs:
> 
> Note that kexec doesn't bother me at all, and I might find myself using it 
> myself.

Good.  Just before I saw this message I sent you my patch ported to 2.5.46,
and from the feed back on this one it looks like people would
appreciate a tweak or two.
 
> >From a sanity standpoint, I think the thing already _has_ a system call, 
> though: clearly "sys_reboot()" is the place to add a case for "reboot into 
> this image". No? That's where we shut down devices anyway, and it's the 
> sane place to say "reboot into the kexec image"
> 
> Which still leaves you with a real sys_kexec() to actually _load_ the
> image, or course. I think loading of the image should be a totally
> separate event from the actual booting of the image, since we may want to
> load the image early, then do various user-level shutdown (unmounting 
> etc), and then reboot.

That sounds reasonable to me.  Especially as that lines up a little more
with what the mcore people want as well.  Until today I hadn't realized
they were using a spare current to process oopses.  For just booting
another kernel all of the staging can currently be done by reading the
new kernel into your process before calling the user-level shutdown code.

> Right now the kexec() stuff seems to mix up the loading and rebooting, but
> I didn't take a very deep look, maybe I'm wrong.

It currently happens all in one step because I had never gotten
feedback that people wanted it in two steps.   

> Note that the two-phase boot means that you can load the new kernel early, 
> which allows you to later on use it for oops handling (it's a bit late to 
> try to set up the kernel to be loaded at that time ;)

Given that it is definitely a good idea to split the patch up into two
pieces.  And a kernel for oops handling should work once all of other
problems are resolved.

> Anyway, I don't really get why the kexec() system call would not just be
> 
> 	void *kexec_image = NULL;
> 	unsigned long kexec_size;
> 
> 	int sys_kexec(void *uaddr, size_t len)
> 	{
> 		void *new;
> 
> 		if (!capable(CAP_ADMIN))
> 			return -EPERM;
> 
> 		/* Get rid of old image if any.. */
> 		if (kexec_image) {
> 			vfree(kexec_image);
> 			kexec_image = NULL;
> 		}
> 
> 		/* Zero length just meant "get rid of it" */
> 		if (!len)
> 			return 0;
> 
> 		if (!access_ok(VERIFY_READ, uaddr, len))
> 			return -EFAULT;
> 
> 		new = vmalloc(len);
> 		if (!new)
> 			return -ENOMEM;
> 
> 		if (memcpy_from_user(new, uaddr, len)) {
> 			vfree(new);
> 			return -EFAULT;
> 		}
> 
> 		kexec_image = new;
> 		kexec_size = len;
> 		return 0;
> 	}
> 
> and be done with it that way? Then the actual "reboot" (and that would be
> in the existing "sys_reboot()") basically just does something like
> 
> 	memcpy(kernelbase, kexec_image, kexec_size);
> 
> at the very end (while obviously having to be careful about itself being
> out of the way. It can avoid the page table issue by using the "page *"
> array that vmalloc uses internally anyway: see "area->pages[]" in
> vmalloc).

Using area->pages[] is an interesting idea.

>From my current interface this is missing the following pieces.
1) The address or addresses to load the new kernel at.  (Think kernel + ramdisk)
2) The address to jump to start the new kernel.
3) My interesting buffer handling.

The question is how much of that do we need.

Thinking out loud, and hopefully answering your question.
- We need a working stack when the new kernel is jumped to so PIC code
  can exist at the entry point.

- An oops processing kernel needs to load at an address other than 1MB,
  or at the very least it's boot sequence needs to squirrel away the
  old contents of the kernel text and data segments, which reside at
  1MB, before it moves down to 1MB.

- When we transfer control to the trampoline in machine_kexec we need
  to be able to refer to everything with physical addresses.

- I do not see a way out of running my buffer verifier algorithm.
  The problem is that I do not want to put complex logic in the
  assembly machine_kexec trampoline.  So I want to be able to pass
  it something it can just memcpy to it's final resting place.  Which
  means the buffer pages either need to be the final resting place of
  the new kernel (ideal) or are not a page that of the final resting
  place.

- I can dig up area->pages[] but I don't see vmalloc buying me
  anything.  Doing the copies and allocations a page at a time is not
  hard.   I have to sort the contents of the pages, and where they
  are located so I need to undo the virtual mapping.
  area ->pages is all by struct pages *, which is most inconvenient 
  when you are tearing down page tables, I would need to put the pages
  into another data structure that just had the page frame number or
  physical page address anyway.

- Once I am using my own data structure to track the pages, and I am
  already vetting the pages for safe locations.  Going the rest of the
  way to my current interface is not a big step, and I have already
  tested that code.

So either I have blinders on, or there is very little percentage in
changing how I load an image.  But to make the oops processing easier
I will split it up into two parts.

Then I guess the reasonable thing to do is to modify sys_reboot to
call machine_kexec instead of machine_restart when a kexec_image is
present.  Or should I add another magic number, and another case to
sys_reboot?  

	case LINUX_REBOOT_CMD_RESTART:
		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
		system_running = 0;
		device_shutdown();
		printk(KERN_EMERG "Restarting system.\n");
+		if (kexec_image)
+			machine_kexec(kexec_image);
		machine_restart(NULL);
		break;


O.k.  In the next couple of days I will split the loading, and
executing phase of my kexec code into parts, and resubmit the code.
The we can dig in on what it takes to make kexec run stably.

Eric



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  7:48                         ` Eric W. Biederman
@ 2002-11-06  9:11                           ` Suparna Bhattacharya
  2002-11-06 22:05                           ` Michal Jaegermann
  1 sibling, 0 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-06  9:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@transmeta.com> writes:
> 
> > On 5 Nov 2002, Eric W. Biederman wrote:
> > > 
> > > In replying to another post by Al Viro I managed to think this through.
> > > kexec needs:
> > 
> > Note that kexec doesn't bother me at all, and I might find myself using it 
> > myself.
> 
> Good.  Just before I saw this message I sent you my patch ported to 2.5.46,
> and from the feed back on this one it looks like people would
> appreciate a tweak or two.
>  
> 
> That sounds reasonable to me.  Especially as that lines up a little more
> with what the mcore people want as well.  Until today I hadn't realized
> they were using a spare current to process oopses.  For just booting
> another kernel all of the staging can currently be done by reading the
> new kernel into your process before calling the user-level shutdown code.
> 
> > Right now the kexec() stuff seems to mix up the loading and rebooting, but
> > I didn't take a very deep look, maybe I'm wrong.
> 
> It currently happens all in one step because I had never gotten
> feedback that people wanted it in two steps.   

I'd mentioned it a few times in the context of mcore, but probably 
didn't explain myself clearly enough then. 

> 
> > Note that the two-phase boot means that you can load the new kernel early, 
> > which allows you to later on use it for oops handling (it's a bit late to 
> > try to set up the kernel to be loaded at that time ;)
> 
> Given that it is definitely a good idea to split the patch up into two
> pieces.  And a kernel for oops handling should work once all of other
> problems are resolved.

Yes, this fits the model we need.

> 
> The question is how much of that do we need.
> 
> Thinking out loud, and hopefully answering your question.
> - We need a working stack when the new kernel is jumped to so PIC code
>   can exist at the entry point.
> 
> - An oops processing kernel needs to load at an address other than 1MB,
>   or at the very least it's boot sequence needs to squirrel away the
>   old contents of the kernel text and data segments, which reside at
>   1MB, before it moves down to 1MB.

Yes, that bit of memory save logic exists in the mcore mechanism. These
pages are saved away in compressed form in memory and written out
later after dump.  

Now to avoid these pages from being used by the new kernel until
the dump is safetly written out to disk, mcore patches some of
the initialization code to mark these pages (containing saved
dump) as reserved. 

> O.k.  In the next couple of days I will split the loading, and
> executing phase of my kexec code into parts, and resubmit the code.

Great !

> The we can dig in on what it takes to make kexec run stably.
> 

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  6:25                       ` Linus Torvalds
  2002-11-06  6:38                         ` Suparna Bhattacharya
  2002-11-06  7:48                         ` Eric W. Biederman
@ 2002-11-06 16:13                         ` Eric W. Biederman
  2002-11-07  8:50                         ` Eric W. Biederman
  3 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-06 16:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

Linus Torvalds <torvalds@transmeta.com> writes:

> >From a sanity standpoint, I think the thing already _has_ a system call, 
> though: clearly "sys_reboot()" is the place to add a case for "reboot into 
> this image". No? That's where we shut down devices anyway, and it's the 
> sane place to say "reboot into the kexec image"

When kexec is separated into two pieces I agree.  As I had it
initially in one step it does not look at all like reboot.    Now I
just need to think up a new magic number for sys_reboot.

[snip wonderful vision of the theoretical simplicity of sys_kexec].

In case I was not sufficiently clear last night.  It could be as
simple as your example code if I replaced vmalloc by
__get_free_pages/alloc_pages, and allocated a large contiguous area of
ram.  But MAX_ORDER limits me to 8MB images, and allocating an 8MB
chunk is unreliable, and even a 2MB chunk is dangerous.    

So I must use some form of scatter/gather list of pages, like
area ->pages[] to make it work.  Things get tricky because I gather
(via memcpy) the pages at a location that potentially overlaps the
source pages.  So I must walk through the list of pages making certain
I when I gather (memcpy) the buffer pages into their final location I
will not stomp on a buffer page I have not come to yet. Correctly
doing that untangling is where the complexity in kernel/kexec.c comes
from.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  4:07                             ` Eric W. Biederman
  2002-11-06  4:47                               ` Eric W. Biederman
@ 2002-11-06 19:24                               ` Rob Landley
  1 sibling, 0 replies; 333+ messages in thread
From: Rob Landley @ 2002-11-06 19:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Suparna Bhattacharya, Jeff Garzik, Rusty Russell,
	Linux Kernel Mailing List

On Wednesday 06 November 2002 04:07, Eric W. Biederman wrote:

> Personally I would love to be able to allocate one big contiguous
> buffer that the kernel is not using and neither is the image I will
> eventually load.  Then I could just memcpy from that buffer and I
> would be done.
>
> Alas memory management in the kernel is done in pages, and can be
> fragmented after running for many moons.  So I need to allocate all of
> my memory in pages, and I need to let the kernel know where it will
> all eventually live so I can correctly order the memcpy operations.

Reverse Mappings are cool, and one reason tehy're cool is, in theory, you can 
grab a page of physical memory away from something else.  In theory code 
could be written to ask the kernel "could you please swap this the heck out, 
pin the page in memory, and give it to me instead now?"  And it can refuse 
("it's already pinned by something else, maybe it's a kernel page, go away"), 
it can block a bit ("gotta flush it to disk, wait until DMA is done"), or it 
could immediatley comply ("it was a clean buffer, have it, keep it, stuff it 
and mount it on the wall for all I care...").

This means you can retroactively get contiguous areas of memory by shoving 
stuff aside.  If it's in use, it'll swap back in immediately.  (An obvious 
optimization occurs, but that's not necessary for minimal functionality.)

So the the whole problem of needing contiguous areas of memory could, in 
theory, be addressed using RMAP.

-- 
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad, 
CmdrTaco, liquid nitrogen ice cream, and caffienated jello.  Well why not?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: What's left over.
  2002-10-31 18:00         ` Oliver Xymoron
@ 2002-11-06 20:52           ` Florian Weimer
  0 siblings, 0 replies; 333+ messages in thread
From: Florian Weimer @ 2002-11-06 20:52 UTC (permalink / raw)
  To: linux-kernel

Oliver Xymoron <oxymoron@waste.org> writes:

> - /tmp-style symlink issues on shared directories
> - vast majority of software (including security tools) ACL-unaware
> - much harder to check for correctness

 - surprising inheritance of of the ACL of the directory

This is a known problem in NTFS land, and some people suggest that
per-directory ACLs are enough for everyone for exactly this reason.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  7:48                         ` Eric W. Biederman
  2002-11-06  9:11                           ` Suparna Bhattacharya
@ 2002-11-06 22:05                           ` Michal Jaegermann
  1 sibling, 0 replies; 333+ messages in thread
From: Michal Jaegermann @ 2002-11-06 22:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Linux Kernel Mailing List,
	lkcd-general, lkcd-devel

On Wed, Nov 06, 2002 at 12:48:36AM -0700, Eric W. Biederman wrote:
> 
> Then I guess the reasonable thing to do is to modify sys_reboot to
> call machine_kexec instead of machine_restart when a kexec_image is
> present.  Or should I add another magic number, and another case to
> sys_reboot?  

Given that "bird-eye" description why not to make a "normal" restart
a particular case of kexec where you just have one kernel loaded
from an external storage?  It does not seem to be that much
different although some issues are skipped or taken for granted.  Or
I am talking nonsense?

   Michal

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  2:05                             ` Werner Almesberger
@ 2002-11-07  6:04                               ` Eric W. Biederman
  2002-11-07 12:17                                 ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-07  6:04 UTC (permalink / raw)
  To: Werner Almesberger; +Cc: Alexander Viro, Linux Kernel Mailing List

Werner Almesberger <wa@almesberger.net> writes:

> Alexander Viro wrote:
> > That's not obvious.  By the same logics, we would need syscalls for
> > turning off overcommit, etc., etc.
> 
> Okay okay, add file system specific ioctls and sysctl to my list
> of alternative mechanisms :-)
> 
> > FWIW, I suspect that
> > 	open("/proc/image", O_EXCL|O_WRONLY);
> > 	bunch of lseek()/write()
> > 	close()
> 
> Hmm, interesting. Yes, that should work. One would of course have
> to retain the current interface for in-kernel use (e.g. MCORE), but
> that's probably okay. Let's see what Eric thinks about it - it's
> his code after all.

For the record my opinion is there is extra code bloat but it is ok
if it is built as kexecfs.  Any other way of getting a magic file
to work with seems currently insane.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  6:25                       ` Linus Torvalds
                                           ` (2 preceding siblings ...)
  2002-11-06 16:13                         ` Eric W. Biederman
@ 2002-11-07  8:50                         ` Eric W. Biederman
  2002-11-07 15:44                           ` Linus Torvalds
                                             ` (3 more replies)
  3 siblings, 4 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-07  8:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel


I am now officially grumpy.  From a code perspective splitting kexec
into two phases load, and execute is a simple change to make.  From a
semantics standpoint things get ugly, and messy.  And that means I
can't just dash off another patch.

There are currently 2 cases that it would be nice to have work.
1) Load a new kernel and immediately execute it.
2) Load a new kernel and execute it on panic.

At first glance splitting the code into a load and execute phases allows
us to use one mechanism to accomplish both goals.  In practice
that does not work.  There are 2 problems.

panic does not call sys_reboot it rolls that functionality by hand.
And to a certain extent it makes sense for panic to take a different
path because we know something is badly wrong so we need to be extra
careful.

In staging the image we allocate a whole pile of pages, and keep them
locked in place.  Waiting for years potentially until the machine
reboots or panics.   This memory is not accounted for anywhere so no
one can see that we have it allocated, which makes debugging hard.
Additionally in locking up megabytes for a long period of time we
create unsolvable fragmentation issues for the mm layer to deal with.

In a unified design I can buffer the image in the anonymous pages of a
user space process just as well as I can in locked down kernel memory.
So factoring sys_kexec in to load and execute pieces only helps for
executing the new image on a kernel panic, and that case does not
actually work.

So currently factoring kexec looks like a pointless exercise, that
will just lead to more pain.

I am left with the following questions.
- How should the pages allocated to an early loaded image be accounted
  for?
- How do we avoid making life hard for the mm system with an early
  loaded image?
- Is it safe to call sys_reboot from panic?
- Can we simply factor out the sequence:
		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
		system_running = 0;
		device_shutdown();
  And place it into it's own subroutine?
- What does the current mcore implementation do?  And is that a good
  model to follow to resolve some of these issues?


Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-07  6:04                               ` Eric W. Biederman
@ 2002-11-07 12:17                                 ` Werner Almesberger
  0 siblings, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-07 12:17 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Alexander Viro, Linux Kernel Mailing List

Eric W. Biederman wrote:
[ Al's FS-based kexec interface ]

> For the record my opinion is there is extra code bloat but it is ok
> if it is built as kexecfs.  Any other way of getting a magic file
> to work with seems currently insane.

Yes, such an interface change would only make sense if you couldn't
get the system call, or if there would actually be a useful way for
setting up kexec using "third party" programs. But it seems unlikely
to me that somebody could get all the magic right just by using dd.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-07  8:50                         ` Eric W. Biederman
@ 2002-11-07 15:44                           ` Linus Torvalds
  2002-11-09 23:05                             ` Eric W. Biederman
  2002-11-07 15:48                           ` Linus Torvalds
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-11-07 15:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel


On 7 Nov 2002, Eric W. Biederman wrote:
> 
> There are currently 2 cases that it would be nice to have work.
> 1) Load a new kernel and immediately execute it.
> 2) Load a new kernel and execute it on panic.

I really don't think (1) is _ever_ a valid thing to do.

The fact is, loading a new kernel wants filesystems and a fully working 
system. While executing it wants the filesystems quiescent.

> panic does not call sys_reboot it rolls that functionality by hand.

Forget about panic for now. It's a design issue - it should be possible to 
work, but somebody else can do it if the infrastructure is done right.

> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.

And in a unified design, I won't apply the patches. It's that simple.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-07  8:50                         ` Eric W. Biederman
  2002-11-07 15:44                           ` Linus Torvalds
@ 2002-11-07 15:48                           ` Linus Torvalds
  2002-11-07 19:32                           ` kexec (was: [lkcd-devel] Re: What's left over.) Andy Pfiffer
  2002-11-08 18:01                           ` [lkcd-devel] Re: What's left over Alan Cox
  3 siblings, 0 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-07 15:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel


On 7 Nov 2002, Eric W. Biederman wrote:
> 
> In staging the image we allocate a whole pile of pages, and keep them
> locked in place.  Waiting for years potentially until the machine
> reboots or panics.   This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.

So how about facing the fact that my "vmalloc()" approach actually solves
all these issues. The memory is visible to the rest of the system (few
things care about it right now, but it _is_ accounted for and things like
/dev/kmem will actually see it etc).

And the vmalloc() approach is even portable, so one of the two phases is 
something that is totally generic (and the second phase is almost totally 
architecture-dependent anyway). 

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
  2002-11-07  8:50                         ` Eric W. Biederman
  2002-11-07 15:44                           ` Linus Torvalds
  2002-11-07 15:48                           ` Linus Torvalds
@ 2002-11-07 19:32                           ` Andy Pfiffer
  2002-11-07 22:13                             ` Andy Pfiffer
                                               ` (2 more replies)
  2002-11-08 18:01                           ` [lkcd-devel] Re: What's left over Alan Cox
  3 siblings, 3 replies; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-07 19:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Thu, 2002-11-07 at 00:50, Eric W. Biederman wrote:

> In staging the image we allocate a whole pile of pages, and keep them
> locked in place.  Waiting for years potentially until the machine
> reboots or panics.   This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
> Additionally in locking up megabytes for a long period of time we
> create unsolvable fragmentation issues for the mm layer to deal with.

Just an idea:

Could a new, unrunnable process be created to "hold" the image?

<hand-wave>
Use a hypothetical sys_kexec() to:
1. create an empty process.
2. copy the kernel image and parameters into the processes' address
space.
3. put the process to sleep.
</hand-wave>

If it's floating out there for weeks or years, the data could get paged
out and not wired down.  It would show up in ps, so you'd have at least
some visibility into the allocation.

Change your mind?  Kill the process.

It might be complicated (or unworkable) to handle the panic case
properly, but for the case where a fast reboot is requested by calling
sys_reboot(), one should be able to fault-in and read back the image
from the "kexec holder" process' address space, copying it to the final
destination as you go.

You might even be able to go the next step, and if the process were
crafted carefully, waking it and running it would trigger the "copyin,
quiesce, and go" behavior.

Just a thought.

Andy



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
  2002-11-07 19:32                           ` kexec (was: [lkcd-devel] Re: What's left over.) Andy Pfiffer
@ 2002-11-07 22:13                             ` Andy Pfiffer
  2002-11-07 22:56                               ` Werner Almesberger
  2002-11-11 17:03                             ` Bill Davidsen
       [not found]                             ` <200211080536.31287.landley@trommello.org>
  2 siblings, 1 reply; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-07 22:13 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Eric W. Biederman, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Thu, 2002-11-07 at 11:32, Andy Pfiffer wrote:
> On Thu, 2002-11-07 at 00:50, Eric W. Biederman wrote:
> 
> > In staging the image we allocate a whole pile of pages, and keep them
> > locked in place.

> Just an idea:
> 
> Could a new, unrunnable process be created to "hold" the image?
> 
> <hand-wave>
> Use a hypothetical sys_kexec() to:
> 1. create an empty process.
> 2. copy the kernel image and parameters into the processes' address
> space.
> 3. put the process to sleep.
> </hand-wave>

A further refinement to the above:

1. make sys_kexec() a blocking call.  The caller reads the image into
their address space prior to making the call, and passes the same kind
of information (number of segments, segment pointer, etc.) to the
syscall in the same manner.  Then, it sets a well-known global variable
that means "there is a kexec image available", and then blocks.

2. add code to sys_reboot() under a CONFIG_KEXEC to check the global
variable in [1) above], and if a kexec image is available, wake the
process in [1) above].

3. the reawakened sys_kexec() then proceeds to copy-in and lay down the
new image in memory, shutdown the devices, and go.

I'm still pondering the kexec-ish reboot after panic() with this kind of
mechanism.  Ah well, it's just an idea.

Andy



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
  2002-11-07 22:13                             ` Andy Pfiffer
@ 2002-11-07 22:56                               ` Werner Almesberger
  0 siblings, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-07 22:56 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Eric W. Biederman, Linus Torvalds, Alan Cox,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

Andy Pfiffer wrote:
> I'm still pondering the kexec-ish reboot after panic() with this kind of
> mechanism.  Ah well, it's just an idea.

Yes, that's where the problems get really nasty. Also, for such
cases, you want the pages to be mlock'ed. Furthermore, you'd
have to tell init about this magic process. (Which would be
tricky, because e.g. sysvinit simply uses kill(-1,...).)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-07  8:50                         ` Eric W. Biederman
                                             ` (2 preceding siblings ...)
  2002-11-07 19:32                           ` kexec (was: [lkcd-devel] Re: What's left over.) Andy Pfiffer
@ 2002-11-08 18:01                           ` Alan Cox
  3 siblings, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-08 18:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

On Thu, 2002-11-07 at 08:50, Eric W. Biederman wrote:
> panic does not call sys_reboot it rolls that functionality by hand.
> And to a certain extent it makes sense for panic to take a different
> path because we know something is badly wrong so we need to be extra
> careful.

However both of them should use the same end point routines and the
hooks should go there

> reboots or panics.   This memory is not accounted for anywhere so no
> one can see that we have it allocated, which makes debugging hard.
> Additionally in locking up megabytes for a long period of time we
> create unsolvable fragmentation issues for the mm layer to deal with.

We have an MMU so if you just n thousand "get me a page" calls its quite
happy.

> In a unified design I can buffer the image in the anonymous pages of a
> user space process just as well as I can in locked down kernel memory.
> So factoring sys_kexec in to load and execute pieces only helps for
> executing the new image on a kernel panic, and that case does not
> actually work.

What if your user space is swapped out - you can't page it back in
safely

> - How should the pages allocated to an early loaded image be accounted
>   for?

Just get_free_page them - if you can handle it over 4Gb then specify
that high pages are fine and kmap them to copy into them - that makes
the VM on giant boxes way happier. For bonus points also adjust the
virtual memory accounting.

> - How do we avoid making life hard for the mm system with an early
>   loaded image?

Not really, especially if you are allowing high pages

> - Is it safe to call sys_reboot from panic?

No but both can call sys_machine_restart or whatever

> - Can we simply factor out the sequence:
> 		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
> 		system_running = 0;
> 		device_shutdown();
>   And place it into it's own subroutine?

Don't do that sequence on a panic IMHO (this is a standing issue, we
should not pass NULL but REBOOT/PANIC/KEXEC/... so the drivers can make
that decision - then we can do it right

Alan


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-05 18:00                 ` Werner Almesberger
  2002-11-05 18:36                   ` Alan Cox
@ 2002-11-09 21:21                   ` Pavel Machek
  2002-11-11 16:27                     ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Pavel Machek @ 2002-11-09 21:21 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Suparna Bhattacharya, Jeff Garzik, Linus Torvalds,
	Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general,
	lkcd-devel

Hi!

> > Yes, we are putting [MCORE] in as one of the alternative dump targets
> > available.
> 
> Great !
> 
> > Its not quite ready yet and we need something like kexec to be
> > available which we can use on Intel systems to achieve the softboot
> > (the acceptance status of that still doesn't seem to be clear),
> 
> Yes, I've just checked with Eric, and he hasn't received any
> indication from Linus so far. I posted a reminder to linux-kernel.
> I'd really hate to see kexec miss 2.6.
> 
> > Why do we even consider the other options when we are doing 
> > this already ? Well, as we discussed earlier there's non-disruptive
> > dumps for one, where this wouldn't work.
> 
> But they're very different anyway, aren't they ? I mean, you could
> even implement them (well, almost) from user space, with today's
> kernels.
> 
> > The other is that before overwriting 
> > memory we need to be able to stop all activity in the system for certain
> > (system may appear hung/locked up) and I'm not fully certain about
> > how to do this for all environments. Maybe an answer lies in 
> > rethinking some parts of the algorithm a bit.
> 
> This is certainly the hairiest part, yes. I think we have about
> four types of devices/elements to worry about:
> 
>  - those that just sit there, and never talk unless spoken to
>  - those that may generate interrupts
>  - those that DMA if you ask them nicely
>  - those that DMA when they feel like it (e.g. copy an incoming
>    network packet to the next buffer in the free list)
> 
> The latter are the real problem. I see the following possibilities
> for dealing with them:
> 
>  - faith-based computing: pray that nothing bad will befall your
>    system :-)
>  - de-activate them individually. There should be a lot of work
>    that can be shared with power management. And that's one of
>    the reasons why I think the memory target should be available
>    early, or convergence will take forever.

I have very similar problem in swsusp (need to deactivate DMA
devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
for that.

								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-07 15:44                           ` Linus Torvalds
@ 2002-11-09 23:05                             ` Eric W. Biederman
  2002-11-09 23:33                               ` Linus Torvalds
                                                 ` (3 more replies)
  0 siblings, 4 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-09 23:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

There are two cases I am seeing users wanting.
1) Load a new kernel on panic.
   - Extra care must be taken so what broke the first kernel does
     not break this one, and so that the shards of the old kernel
     do not break it.
   - Care must be taken so that loading the second kernel does not
     erase valuable data that is desirable to place in a crash dump.
   - This kernel cannot live at the same address as the old one, (at
     least not initially).

2) Load a new kernel under normal operating conditions.
   And when you have a normal user space that boils down to:
   - Acquire the kernel you are going to boot.
   - Run the user space shutdown scripts, so the system is in
     a clean state.
   - Execute the new kernel.
   - The normal case is that the newly loaded kernel will live at the 
     same physical location where the current kernel lives.


Currently my code handles starting a new kernel under normal operating
conditions.  There are currently two ways I can implement a clean user
space shutdown with out needing locked buffers in the kernel until the
very last moment.

Method 1 (This works today with my sample user space):
- copy the kernel to /newkernel
- init 6
- if [ -r /newkernel ]; then
        /sbin/kexec /newkernel
  else
        /sbin/reboot
  fi
- /sbin/kexec reads in /newkernel
- /newkernel is parsed to figure out how it should be loaded
- sys_kexec is called to copy the kernel from user space anonymous
  memory to temporary kernel buffers.

Method 2 (For people with read only roots):
- /sbin/delayed_kexec /path/to/new/kernel 
- Read in the /path/to/new/kernel into anonymous pages
- Parse it and figure out how it should be loaded
- Run the shutdown scripts from /etc/rc6.d/ 
- Call sys_kexec, which will copy the data from user space anonymous
  pages, to kernel space.

This is to just make it clear that I am not working from a
FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
maintenance.  I am quite willing to make changes assuming I understand
what is gained with the change.  



What I currently support is a moderately nice interface, that has the
kernel doing as much as it can without being bogged down in the
specific details in any one file format, or needing something besides
a 32bit entry point to jump to.

I model an image as a set of segments of physical memory.  And I copy
the image loaded with sys_kexec to it's final location, before jumping
to the new image.  There are two reasons for this.  It takes 3
segments to load a bzImage (setup.S, vmlinux, and an initrd).  And an
arbitrary number of segments maps cleanly to a static ELF binary.

There is only one difficult case.  What happens when the buffers the
kernel allocates are physically in one of the segments of memory of
the new kernel image.  Possible especially for the initrd which is
loaded at the end of memory.  

I then use the following algorithm to sort the potential mess out
before I jump to the new code.  And since this code depends on
swapping the contents of pages, knowing the physical location of
the pages, and is not limited to 128MB I am reluctant to look a
vmalloc variant.  I can more get my pages from a slab if I need to
report I have them.

static int kimage_get_off_destination_pages(struct kimage *image)
{
	kimage_entry_t *ptr, *cptr, entry;
	unsigned long buffer, page;
	unsigned long destination = 0;

	/* Here we implement safe guards to insure that
	 * a source page is not copied to it's destination
	 * page before the data on the destination page is
	 * no longer useful.
	 *
	 * To make it work we actually wind up with a 
	 * stronger condition.  For every page considered
	 * it is either it's own destination page or it is
	 * not a destination page of any page considered.
	 *
	 * Invariants 
	 * 1. buffer is not a destination of a previous page.
	 * 2. page is not a destination of a previous page.
	 * 3. destination is not a previous source page.
	 *
	 * Result: Either a source page and a destination page 
	 * are the same or the page is not a destination page.
	 *
	 * These checks could be done when we allocate the pages,
	 * but doing it as a final pass allows us more freedom
	 * on how we allocate pages.
	 * 
	 * Also while the checks are necessary, in practice nothing
	 * happens.  The destination kernel wants to sit in the
	 * same physical addresses as the current kernel so we never
	 * actually allocate a destination page.
	 *
	 * BUGS: This is a O(N^2) algorithm.
	 */

	
	buffer = __get_free_page(GFP_KERNEL);
	if (!buffer) {
		return -ENOMEM;
	}
	buffer = virt_to_phys((void *)buffer);
	for_each_kimage_entry(image, ptr, entry) {
		/* Here we check to see if an allocated page */
		kimage_entry_t *limit;
		if (entry & IND_DESTINATION) {
			destination = entry & PAGE_MASK;
		}
		else if (entry & IND_INDIRECTION) {
			/* Indirection pages must include all of their
			 * contents in limit checking.
			 */
			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
		}
		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
			continue;
		}
		page = entry & PAGE_MASK;
		limit = ptr;

		/* See if a previous page has the current page as it's 
		 * destination.
		 * i.e. invariant 2
		 */
		cptr = kimage_dst_conflict(image, page, limit);
		if (cptr) {
			unsigned long cpage;
 			kimage_entry_t centry;
			centry = *cptr;
			cpage = centry & PAGE_MASK;
			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
			*cptr = page | (centry & ~PAGE_MASK);
			*ptr = buffer | (entry & ~PAGE_MASK);
			buffer = cpage;
		}
		if (!(entry & IND_SOURCE)) {
			continue;
		}

		/* See if a previous page is our destination page.
		 * If so claim it now.
		 * i.e. invariant 3
		 */
		cptr = kimage_src_conflict(image, destination, limit);
		if (cptr) {
			unsigned long cpage;
 			kimage_entry_t centry;
			centry = *cptr;
			cpage = centry & PAGE_MASK;
			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
			*cptr = buffer | (centry & ~PAGE_MASK);
			*ptr = cpage | ( entry & ~PAGE_MASK);
			buffer = page;
		}
		/* If the buffer is my destination page do the copy now 
		 * i.e. invariant 3 & 1
		 */
		if (buffer == destination) {
			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
			*ptr = buffer | (entry & ~PAGE_MASK);
			buffer = page;
		}
	}
	free_page((unsigned long)phys_to_virt(buffer));
	return 0;
}


static kimage_entry_t *kimage_dst_conflict(
	struct kimage *image, unsigned long page, kimage_entry_t *limit)
{
	kimage_entry_t *ptr, entry;
	unsigned long destination = 0;
	for_each_kimage_entry(image, ptr, entry) {
		if (ptr == limit) {
			return 0;
		}
		else if (entry & IND_DESTINATION) {
			destination = entry & PAGE_MASK;
		}
		else if (entry & IND_SOURCE) {
			if (page == destination) {
				return ptr;
			}
			destination += PAGE_SIZE;
		}
	}
	return 0;
}


static kimage_entry_t *kimage_src_conflict(
	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
{
	kimage_entry_t *ptr, entry;
	for_each_kimage_entry(image, ptr, entry) {
		unsigned long page;
		if (ptr == limit) {
			return 0;
		}
		else if (entry & IND_DESTINATION) {
			/* nop */
		}
		else if (entry & IND_DONE) {
			/* nop */
		}
		else {
			/* SOURCE & INDIRECTION */
			page = entry & PAGE_MASK;
			if (page == destination) {
				return ptr;
			}
		}
	}
	return 0;
}





Having had time to digest the idea of starting a new kernel on panic
I can now make some observations and what I believe it would take to
make it as robust as possible.

- On panic because random pieces of the kernel may be broken we want
  to use as little of the kernel as possible.  

- Therefore machine_kexec should not allocate any memory, and as
  quickly as possible it should transition to the new kernel

- So a new page table should be sitting around with the new kernel
  already mapped, and likewise other important tables like the
  gdt, and the idt, should be pre-allocated.

- Then machine_kexec can just switch stacks, page tables, and other
  machine dependent tables and jump to the new kernel.

- The load stage needs to load everything at the physical location it
  will initially run at.  This would likely need support from rmap.

- The load stage needs to preallocate page tables and buffers.

- The load stage would likely work easiest by either requiring a mem=xxx
  line, reserving some of physical memory for the new kernel.  Or
  alternatively using some rmap support to clear out a swath of
  physical memory the new kernel can be loaded into.  

- The new kernel loaded on panic must know about the previous kernel,
  and have various restrictions because of that.


Supporting a kernel loaded from a normal environment is a rather
different problem.  

- It cannot be loaded at it's run location (The current kernel is
  sitting there).

- It should not need to know about the previously executing kernel.

- Work can be done in machine_kexec to allocate memory so everything
  does not need to be pre allocated.

- I can safely use multiple calls to the page allocator instead of
  needing a special mechanism to allocate memory.



And now I go back to the silly exercise of factoring my code so the
new kernel can be kept in locked kernel memory, instead of in a file
while the shutdown scripts are run.

Unless the linux kernel is modified to copy itself to the top of
physical memory when it loads I have trouble seeing how any of this
will help make the panic case easier to implement.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:05                             ` Eric W. Biederman
@ 2002-11-09 23:33                               ` Linus Torvalds
  2002-11-10  1:37                                 ` Eric W. Biederman
  2002-11-09 23:39                               ` [lkcd-devel] Re: What's left over Randy.Dunlap
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-11-09 23:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel


On 9 Nov 2002, Eric W. Biederman wrote:
> 
> Currently my code handles starting a new kernel under normal operating
> conditions.  There are currently two ways I can implement a clean user
> space shutdown with out needing locked buffers in the kernel until the
> very last moment.

PLEASE tell me why you don't just use the 20-line "vmalloc()" function I 
already wrote for you?

It works for all cases - and since you do need to load the kernel into 
memory anyway, it's not using any more memory than your existing code. And 
it's infinitely more flexible to have a clearly separated load-process, 
than having to have some load happen at reboot time (whether by init or by 
anything else).

And since the kernel is fully working at the load time, you can even do
things like swap out pages in order to make room for the kernel at the 
right place.  So you can even do something like this:

	int alloc_kernel_pages(unsigned long *array, int nr_pages,
		unsigned long min_address)
	{
		void *bad_page_list = NULL;
		int i = 0, retval;

		while (i < nr_pages) {
			unsigned long page = __get_free_page(GFP_USER);

			if (!page)
				goto oom;

			if (page < min_address) {
				*(void **)page = bad_page_list;
				bad_page_list = (void *)page;
				continue;
			}
			array[i] = page;
			i++;
		}
		retval = 0;
	out:
		while (bad_page_list) {
			unsigned long page = (unsigned long) bad_page_list;
			bad_page_list = *(void **)bad_page_list;
			free_page(page);
		}
		return retval;
	oom:
		while (i > 0)
			free_page(array[--i]);
		retval = -ENOMEM;
		goto out;
	}

and now you are guaranteed that all the allocated pages are above a
certain mark (change the "min_address" to be a "validity callback" or
whatever if you want to be fancy and allow arbitrary rules, which is good
if you want to allow pages in the low 1M on x86, for example), which means
that your final reboot stage is _much_much_ simpler and you don't ever 
have to worry about overlap. 

Use one of the pages to allocate the memcpy() trampoline and the actual 
data structures used for the copy, for example. Use the rest for the 
actual kernel data.

Keep it simple. 

			Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:05                             ` Eric W. Biederman
  2002-11-09 23:33                               ` Linus Torvalds
@ 2002-11-09 23:39                               ` Randy.Dunlap
  2002-11-10  2:58                                 ` Eric W. Biederman
  2002-11-10  1:31                               ` Werner Almesberger
  2002-11-10  2:08                               ` Alan Cox
  3 siblings, 1 reply; 333+ messages in thread
From: Randy.Dunlap @ 2002-11-09 23:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
	Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel

{warning: cc: list too large :}

On 9 Nov 2002, Eric W. Biederman wrote:

| There are two cases I am seeing users wanting.
| 1) Load a new kernel on panic.
|    - Extra care must be taken so what broke the first kernel does
|      not break this one, and so that the shards of the old kernel
|      do not break it.
|    - Care must be taken so that loading the second kernel does not
|      erase valuable data that is desirable to place in a crash dump.
|    - This kernel cannot live at the same address as the old one, (at
|      least not initially).

Conceptually we would like a new kernel on panic, although
I doubt that it's normally safe to "load a new kernel on panic."
Or maybe it depends on the definition of "load."

What I'm trying to say is that I think the new kernel must
already be loaded when the panic happens.
Is that what you describe later (below)?

| 2) Load a new kernel under normal operating conditions.
|    And when you have a normal user space that boils down to:
|    - Acquire the kernel you are going to boot.
|    - Run the user space shutdown scripts, so the system is in
|      a clean state.
|    - Execute the new kernel.
|    - The normal case is that the newly loaded kernel will live at the
|      same physical location where the current kernel lives.
|
|
| Currently my code handles starting a new kernel under normal operating
| conditions.  There are currently two ways I can implement a clean user
| space shutdown with out needing locked buffers in the kernel until the
| very last moment.
|
| Method 1 (This works today with my sample user space):
| - copy the kernel to /newkernel
| - init 6
| - if [ -r /newkernel ]; then
|         /sbin/kexec /newkernel
|   else
|         /sbin/reboot
|   fi
| - /sbin/kexec reads in /newkernel
| - /newkernel is parsed to figure out how it should be loaded
| - sys_kexec is called to copy the kernel from user space anonymous
|   memory to temporary kernel buffers.
|
| Method 2 (For people with read only roots):
| - /sbin/delayed_kexec /path/to/new/kernel
| - Read in the /path/to/new/kernel into anonymous pages
| - Parse it and figure out how it should be loaded
| - Run the shutdown scripts from /etc/rc6.d/
| - Call sys_kexec, which will copy the data from user space anonymous
|   pages, to kernel space.
|
| This is to just make it clear that I am not working from a
| FUNDAMENTALLY BROKEN interface, nor from a broken model of machine
| maintenance.  I am quite willing to make changes assuming I understand
| what is gained with the change.
|
|
| What I currently support is a moderately nice interface, that has the
| kernel doing as much as it can without being bogged down in the
| specific details in any one file format, or needing something besides
| a 32bit entry point to jump to.
|
| I model an image as a set of segments of physical memory.  And I copy
| the image loaded with sys_kexec to it's final location, before jumping
| to the new image.  There are two reasons for this.  It takes 3
| segments to load a bzImage (setup.S, vmlinux, and an initrd).  And an
| arbitrary number of segments maps cleanly to a static ELF binary.
|
| There is only one difficult case.  What happens when the buffers the
| kernel allocates are physically in one of the segments of memory of
| the new kernel image.  Possible especially for the initrd which is
| loaded at the end of memory.
|
| I then use the following algorithm to sort the potential mess out
| before I jump to the new code.  And since this code depends on
| swapping the contents of pages, knowing the physical location of
| the pages, and is not limited to 128MB I am reluctant to look a
| vmalloc variant.  I can more get my pages from a slab if I need to
| report I have them.
|
[code deleted]
|
| Having had time to digest the idea of starting a new kernel on panic
| I can now make some observations and what I believe it would take to
| make it as robust as possible.
|
| - On panic because random pieces of the kernel may be broken we want
|   to use as little of the kernel as possible.
|
| - Therefore machine_kexec should not allocate any memory, and as
|   quickly as possible it should transition to the new kernel
|
| - So a new page table should be sitting around with the new kernel
|   already mapped, and likewise other important tables like the
|   gdt, and the idt, should be pre-allocated.
|
| - Then machine_kexec can just switch stacks, page tables, and other
|   machine dependent tables and jump to the new kernel.
|
| - The load stage needs to load everything at the physical location it
|   will initially run at.  This would likely need support from rmap.
|
| - The load stage needs to preallocate page tables and buffers.
|
| - The load stage would likely work easiest by either requiring a mem=xxx
|   line, reserving some of physical memory for the new kernel.  Or
|   alternatively using some rmap support to clear out a swath of
|   physical memory the new kernel can be loaded into.
|
| - The new kernel loaded on panic must know about the previous kernel,
|   and have various restrictions because of that.
|
|
| Supporting a kernel loaded from a normal environment is a rather
| different problem.
|
| - It cannot be loaded at it's run location (The current kernel is
|   sitting there).
|
| - It should not need to know about the previously executing kernel.
|
| - Work can be done in machine_kexec to allocate memory so everything
|   does not need to be pre allocated.
|
| - I can safely use multiple calls to the page allocator instead of
|   needing a special mechanism to allocate memory.
|
|
| And now I go back to the silly exercise of factoring my code so the
| new kernel can be kept in locked kernel memory, instead of in a file
| while the shutdown scripts are run.
|
| Unless the linux kernel is modified to copy itself to the top of
| physical memory when it loads I have trouble seeing how any of this
| will help make the panic case easier to implement.
|
| Eric
| -

-- 
~Randy


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:05                             ` Eric W. Biederman
  2002-11-09 23:33                               ` Linus Torvalds
  2002-11-09 23:39                               ` [lkcd-devel] Re: What's left over Randy.Dunlap
@ 2002-11-10  1:31                               ` Werner Almesberger
  2002-11-10  3:10                                 ` Eric W. Biederman
  2002-11-10  2:08                               ` Alan Cox
  3 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-10  1:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

Eric W. Biederman wrote:
>    - Extra care must be taken so what broke the first kernel does
>      not break this one, and so that the shards of the old kernel
>      do not break it.

For this, you should checksum the data that you've pre-loaded, and
verify it before rebooting. If the pre-loaded kernel has been hit,
you just do a normal reboot. (In the case if a bzImage, you'd
probably fail uncompression anyway.)

Alternatively, you could also wire this into the uncompression
functions (i.e. reboot if bzImage or initrd don't uncompress
cleanly), but this would be more intrusive.

>    - Care must be taken so that loading the second kernel does not
>      erase valuable data that is desirable to place in a crash dump.

Or copy all "interesting" memory to a safe place before the kexec.
I don't quite like the idea of building a kernel that "knows" which
addresses it isn't supposed to touch, and I think being able to use
the same kernel binary for regular and panic use would be a
desirable feature.

Also, firmware may not give you the choice of preserving all memory,
so you need that "copy memory to a safe place" functionality anyway.
Furthermore, you most likely want to checksum that memory, too.

But ... I think you're designing too far ahead. The "load kernel on
panic" part isn't trivial, and I think it would be better to tackle
this in a second phase. For now, having a reasonably generic kexec
mechanism would be all that's needed in term of building blocks.

> Method 2 (For people with read only roots):
> - /sbin/delayed_kexec /path/to/new/kernel 
> - Read in the /path/to/new/kernel into anonymous pages

There's no delayed_kexec in kexec-tools 1.4, so let me gues how
this would work: as far as I know, there's no way for regular
user space to create a persistent unreferenced memory object, so
you'd probably load the data, perhaps mlock the pages, and then
fork a process that keeps the data in memory. Then, this process
would probably call sys_kexec upon reception of a signal, or
such.

Unfortunately, init assumes that it can SIGKILL all non-init
processes (that is, all processes with PID != 1). Worse yet, this
assumption makes sense, because walking the process list and
killing each of them individually would be racy.

So you'd either have to add this race condition to init, add some
magic to make this type of killing atomic, teach the kernel that
your kexec memory keeper process is somehow magic too, or merge
kexec into init. Not nice.

> I then use the following algorithm to sort the potential mess out
> before I jump to the new code.

I like this approach. It gives you complete freedom of where to
load data. This also makes it future-proof. But I don't see the
reason why you couldn't do the same thing with vmalloc. Using
vmalloc may actually simplify your code a little.

> Having had time to digest the idea of starting a new kernel on panic
> I can now make some observations and what I believe it would take to
> make it as robust as possible.

That pretty much sums it up, yes. But as I've said, this isn't
really something that needs to be implemented at the same time
as the basic kexec functionality. A two-phase kexec with
unrestricted copying capabilities should be a good enough
building block that only minor changes, if any, would be needed
when adding kexec-on-panic.

> And now I go back to the silly exercise of factoring my code so the
> new kernel can be kept in locked kernel memory, instead of in a file
> while the shutdown scripts are run.

Not silly :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:33                               ` Linus Torvalds
@ 2002-11-10  1:37                                 ` Eric W. Biederman
  2002-11-10  2:12                                   ` Alan Cox
  2002-11-10  3:17                                   ` Linus Torvalds
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  1:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Linus Torvalds <torvalds@transmeta.com> writes:

> On 9 Nov 2002, Eric W. Biederman wrote:
> > 
> > Currently my code handles starting a new kernel under normal operating
> > conditions.  There are currently two ways I can implement a clean user
> > space shutdown with out needing locked buffers in the kernel until the
> > very last moment.
> 
> PLEASE tell me why you don't just use the 20-line "vmalloc()" function I 
> already wrote for you?

The reasons I don't jump on board:
- It does not handle multiple segments.
  Without multiple segments I think I simply out essential complexity
  of the problem.  A bzImage has at least 2.

- vmalloc is artificially limited to 128MB.

- I still have to run code to prevent imperfect overlaps.  A perfect
  overlap being a source buffer living in it's destination address.

- I still have to run code to find the physical addresses of the
  pages, and locate those in non-destination pages, and form a linked
  list of pages for that.

> It works for all cases - and since you do need to load the kernel into 
> memory anyway, it's not using any more memory than your existing code. And 
> it's infinitely more flexible to have a clearly separated load-process, 
> than having to have some load happen at reboot time (whether by init or by 
> anything else).

I am trying to process it but I don't see why having the load happen
as a seperate syscall is clearer.  Having it happen as a seperate
architecture independent function I understand.

asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
	struct kexec_segment *segments)
{
	/* Am I using to much stack space here? */
	struct kimage image;
	int result;

	/* We only trust the superuser with rebooting the system. */
	if (!capable(CAP_SYS_BOOT))
		return -EPERM;

	lock_kernel();

////  This chunk does the load and there is no kernel shutdown code
////  run yet.
	kimage_init(&image);
	result = do_kexec(entry, nr_segments, segments, &image);
	if (result) {
		kimage_free(&image);
		unlock_kernel();
		return result;
	}

//// ----------- I can snip here for your two syscall version -----------

////  This part is the kernel shutdown
	
	/* The point of no return is here... */
	notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
	system_running = 0;
	device_shutdown();
	printk(KERN_EMERG "Starting new kernel\n");

//// And here is where I start the new kernel.

	machine_kexec(&image);
}

>
> And since the kernel is fully working at the load time, you can even do
> things like swap out pages in order to make room for the kernel at the 
> right place.  So you can even do something like this:

I have clearly separated load code, that runs before any of the kernel
starts to shutdown.  Until it completes successfully I do not start
to shutdown the kernel.  My user space is shut down but that is a
different story.

Swapping out pages is nice, but when user space is shutdown there
shouldn't be any extra pages in the kernel to swap out, and if you are
that tight on memory that you need to swap it won't work, anyway.

> 	int alloc_kernel_pages(unsigned long *array, int nr_pages,
> 		unsigned long min_address)
> 	{
> 		void *bad_page_list = NULL;
> 		int i = 0, retval;
> 
> 		while (i < nr_pages) {
> 			unsigned long page = __get_free_page(GFP_USER);
> 
> 			if (!page)
> 				goto oom;
> 
> 			if (page < min_address) {
> 				*(void **)page = bad_page_list;
> 				bad_page_list = (void *)page;
> 				continue;
> 			}
> 			array[i] = page;
> 			i++;
> 		}
> 		retval = 0;
> 	out:
> 		while (bad_page_list) {
> 			unsigned long page = (unsigned long) bad_page_list;
> 			bad_page_list = *(void **)bad_page_list;
> 			free_page(page);
> 		}
> 		return retval;
> 	oom:
> 		while (i > 0)
> 			free_page(array[--i]);
> 		retval = -ENOMEM;
> 		goto out;
> 	}

Which is a good algorithm but it has the potential to allocate a lot
of extra pages, and I have implemented this it in the past.  It's
worst case is just nasty.  

My current code allocates at most 1 extra page and works gracefully if
it happens to allocates the pages it really wanted to use.  It is just
a hair more complex, and it makes everything else very simple.  

> and now you are guaranteed that all the allocated pages are above a
> certain mark (change the "min_address" to be a "validity callback" or
> whatever if you want to be fancy and allow arbitrary rules, which is good
> if you want to allow pages in the low 1M on x86, for example), which means
> that your final reboot stage is _much_much_ simpler and you don't ever 
> have to worry about overlap. 

Exactly and that is why I do it where I do it.  In the C load code.
In the kernel so it has to be written only once.
 
> Use one of the pages to allocate the memcpy() trampoline and the actual 
> data structures used for the copy, for example. Use the rest for the 
> actual kernel data.
> 
> Keep it simple. 

Yep.  

After loading everything I have a total of 243 lines of code.
100 lines of assembly doing the copies in the trampoline.
143 lines of C modifying the page tables, the gdt, and the idt,
copying the trampoline to the correct place, and going for it.

And despite my utter puzzlement on why you want the syscall cut in two.
I will now go cut along the dotted line.  If that is all it takes to
have piece I can do that.  A sore head from all of the scratching
trying to figure out why it needs to be cut in two, but I can cut
sys_kexec in two.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:05                             ` Eric W. Biederman
                                                 ` (2 preceding siblings ...)
  2002-11-10  1:31                               ` Werner Almesberger
@ 2002-11-10  2:08                               ` Alan Cox
  2002-11-10  2:18                                 ` Eric W. Biederman
  3 siblings, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-10  2:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> There are two cases I am seeing users wanting.
> 1) Load a new kernel on panic.

Load a new *something* on panic. That something might be a new kernel
but it might also be a kernel dump system like LKCD or a debugger front
end for something like kdb, or a network dump module, or ...

Alan


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  1:37                                 ` Eric W. Biederman
@ 2002-11-10  2:12                                   ` Alan Cox
  2002-11-10  2:16                                     ` Eric W. Biederman
  2002-11-10  3:17                                   ` Linus Torvalds
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-10  2:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> The reasons I don't jump on board:
> - It does not handle multiple segments.
>   Without multiple segments I think I simply out essential complexity
>   of the problem.  A bzImage has at least 2.

Thats a matter for user space and the unpacker

> - vmalloc is artificially limited to 128MB.

Just grabbing a load of pages and using kmap/scatter gather by hand is
not



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:12                                   ` Alan Cox
@ 2002-11-10  2:16                                     ` Eric W. Biederman
  2002-11-10  3:03                                       ` Werner Almesberger
  2002-11-10 14:30                                       ` Alan Cox
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  2:16 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Sun, 2002-11-10 at 01:37, Eric W. Biederman wrote:
> > The reasons I don't jump on board:
> > - It does not handle multiple segments.
> >   Without multiple segments I think I simply out essential complexity
> >   of the problem.  A bzImage has at least 2.
> 
> Thats a matter for user space and the unpacker
> 
> > - vmalloc is artificially limited to 128MB.
> 
> Just grabbing a load of pages and using kmap/scatter gather by hand is
> not

To use kmapped memory I need to setup a page table to do the final copy.
And to setup a page table I need to know where the memory is going to be copied
to.

So my gut impression at least says an interface that ignores where
the image wants to live just adds complexity in other places, and
makes for an interface that is hard to maintain long term, because
you depend on a lot of kernel implementation details, which are likely
to change in arbitrary ways.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:08                               ` Alan Cox
@ 2002-11-10  2:18                                 ` Eric W. Biederman
  2002-11-10 14:31                                   ` Alan Cox
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  2:18 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Sat, 2002-11-09 at 23:05, Eric W. Biederman wrote:
> > There are two cases I am seeing users wanting.
> > 1) Load a new kernel on panic.
> 
> Load a new *something* on panic. That something might be a new kernel
> but it might also be a kernel dump system like LKCD or a debugger front
> end for something like kdb, or a network dump module, or ...

And if it isn't a kernel why not load it as a module?  The code
has to come preloaded anyway.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 23:39                               ` [lkcd-devel] Re: What's left over Randy.Dunlap
@ 2002-11-10  2:58                                 ` Eric W. Biederman
  2002-11-10 14:35                                   ` Alan Cox
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  2:58 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Eric W. Biederman, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
	Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel

"Randy.Dunlap" <rddunlap@osdl.org> writes:

> {warning: cc: list too large :}
> 
> On 9 Nov 2002, Eric W. Biederman wrote:
> 
> | There are two cases I am seeing users wanting.
> | 1) Load a new kernel on panic.
> |    - Extra care must be taken so what broke the first kernel does
> |      not break this one, and so that the shards of the old kernel
> |      do not break it.
> |    - Care must be taken so that loading the second kernel does not
> |      erase valuable data that is desirable to place in a crash dump.
> |    - This kernel cannot live at the same address as the old one, (at
> |      least not initially).
> 
> Conceptually we would like a new kernel on panic, although
> I doubt that it's normally safe to "load a new kernel on panic."
> Or maybe it depends on the definition of "load."
> 
> What I'm trying to say is that I think the new kernel must
> already be loaded when the panic happens.
> Is that what you describe later (below)?

Yes that was my meaning.   The new kernel must be preloaded.
And only started on panic.

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:16                                     ` Eric W. Biederman
@ 2002-11-10  3:03                                       ` Werner Almesberger
  2002-11-10  3:23                                         ` Eric W. Biederman
  2002-11-10 14:30                                       ` Alan Cox
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-10  3:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Linus Torvalds, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Eric W. Biederman wrote:
> So my gut impression at least says an interface that ignores where
> the image wants to live just adds complexity in other places,

Linus' alloc_kernel_pages function would actually be able to handle
this, provided that the "validity callback" checks if the allocated
page happens to be in one of the destination areas.

I'm not so sure if this implementation is really that much more
compact than your current conflict resolution, though. Also, it may
be hairy in scenarios where you actually expect to fill more than
50% of system memory. (But your concerns about a 128MB limit scare
me, too. I realize that people have taken initrds to extremes I
never quite imagined, but that still looks a little excessive :-)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  1:31                               ` Werner Almesberger
@ 2002-11-10  3:10                                 ` Eric W. Biederman
  2002-11-10  3:30                                   ` Werner Almesberger
  2002-11-10  3:49                                   ` Linus Torvalds
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  3:10 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

Werner Almesberger <wa@almesberger.net> writes:
> 
> But ... I think you're designing too far ahead. The "load kernel on
> panic" part isn't trivial, and I think it would be better to tackle
> this in a second phase. For now, having a reasonably generic kexec
> mechanism would be all that's needed in term of building blocks.

I'm not designing yet, just looking and what I see says that it
does not very much resemble the non panic case.
 
> > Method 2 (For people with read only roots):
> > - /sbin/delayed_kexec /path/to/new/kernel 
> > - Read in the /path/to/new/kernel into anonymous pages
> 
> There's no delayed_kexec in kexec-tools 1.4, so let me gues how
> this would work: as far as I know, there's no way for regular
> user space to create a persistent unreferenced memory object, so
> you'd probably load the data, perhaps mlock the pages, and then
> fork a process that keeps the data in memory. Then, this process
> would probably call sys_kexec upon reception of a signal, or
> such.

What I was thinking is that the process would for and exec
something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
And that script would do all of the user space shutdown.

No need to mlock any pages, or hack init, or special hacks.
Just user space cleanly shutting itself down.

> 
> > I then use the following algorithm to sort the potential mess out
> > before I jump to the new code.
> 
> I like this approach. It gives you complete freedom of where to
> load data. This also makes it future-proof. But I don't see the
> reason why you couldn't do the same thing with vmalloc. Using
> vmalloc may actually simplify your code a little.

Mostly it's a bird in the hand versus a bird in the bush.  I simply
see nowhere that vmalloc makes my code simpler.

> > Having had time to digest the idea of starting a new kernel on panic
> > I can now make some observations and what I believe it would take to
> > make it as robust as possible.
> 
> That pretty much sums it up, yes. But as I've said, this isn't
> really something that needs to be implemented at the same time
> as the basic kexec functionality. A two-phase kexec with
> unrestricted copying capabilities should be a good enough
> building block that only minor changes, if any, would be needed
> when adding kexec-on-panic.

My feel is that kexec-on-panic is a rather different problem.  Which
is why I thought it all through, to see if they felt close.  At the
very least you almost need to know that it is the same.

> 
> > And now I go back to the silly exercise of factoring my code so the
> > new kernel can be kept in locked kernel memory, instead of in a file
> > while the shutdown scripts are run.
> 
> Not silly :-)

Except for the part about getting Linus to accept it I don't see
the advantage.  kexec-on-panic looks different enough that I don't
think it will help at all with that case.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  1:37                                 ` Eric W. Biederman
  2002-11-10  2:12                                   ` Alan Cox
@ 2002-11-10  3:17                                   ` Linus Torvalds
  2002-11-10  4:26                                     ` Eric W. Biederman
                                                       ` (3 more replies)
  1 sibling, 4 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-10  3:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh


On 9 Nov 2002, Eric W. Biederman wrote:
> 
> And despite my utter puzzlement on why you want the syscall cut in two.

I'm amazed about your puzzlement, since everybody else seem to get my 
arguments, but as long as you play along I don't much care.

I will explain once more why it needs to be cut into two, even if you're 
apparently willing to do it even without understanding:

When you reboot, you often cannot load the image.

	This is _trivially_ true for panics or things like 

	 - I don't understand why you do not want to accept this. Even if 
	   your code doesn't even _handle_ panics, it's so obvious that 
	   this is true that I don't understand why you want a limitation
	   in your particular current implementation to be a fundamental
	   flaw of the whole idea.

	But it is _also_ true for any standard setup where you don't have
	a special "init" that knows about loading the kernel, and where to
	load it from.

	 - Do you want to rewrite every "init" setup out there, adding 
	   some way to tell init where to load the kernel from?

	   Or do you want to just split the thing in two, so that you can 
	   load the kernel _before_ you ask init to shut down, and just 
	   happily use bog-standard tools that everybody is already 
	   familiar with..

The two-part loader can clearly handle both cases. And if _you_ don't want
a two-part loader, you can do exactly what you do now by just doing two 
system calls. 

As to vmalloc - I don't actually much care how the first and second parts
are implemented. I suggested a vmalloc()-like approach just because your
patch looks unnecessarily complicated to me. But while I am convinced that 
the two-phase loading/exec is absolutely the way to do it, the actual 
low-level implementation is just a detail.

			Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:03                                       ` Werner Almesberger
@ 2002-11-10  3:23                                         ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  3:23 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Alan Cox, Linus Torvalds, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Werner Almesberger <wa@almesberger.net> writes:

> Eric W. Biederman wrote:
> > So my gut impression at least says an interface that ignores where
> > the image wants to live just adds complexity in other places,
> 
> Linus' alloc_kernel_pages function would actually be able to handle
> this, provided that the "validity callback" checks if the allocated
> page happens to be in one of the destination areas.
> 
> I'm not so sure if this implementation is really that much more
> compact than your current conflict resolution, though. Also, it may
> be hairy in scenarios where you actually expect to fill more than
> 50% of system memory. (But your concerns about a 128MB limit scare
> me, too. I realize that people have taken initrds to extremes I
> never quite imagined, but that still looks a little excessive :-)

I have not heard of more than about 90MB.  One of the things I would
not be surprised to see in the next couple of years as memory gets
cheaper is diskless systems that don't even bother doing NFS root and
just put everything in an initrd.  But that is not the main concern.

Since there are more polite ways of allocating memory already
implemented.  Sucking up a 16MB hunk of some ones  vmalloc space is
quite rude.  Currently the limit is pretty much 50% of system memory
or 1GB whichever is less because the code must be loaded into user
space first, and I don't touch high memory.  Although I guess if it
was mmaped read only the limit may be higher. 

I don't expect to come to close to using all of system memory
except on limited memory systems.  But it is always nice to be
polite.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:10                                 ` Eric W. Biederman
@ 2002-11-10  3:30                                   ` Werner Almesberger
  2002-11-10  3:49                                     ` Eric W. Biederman
  2002-11-10  3:49                                   ` Linus Torvalds
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-10  3:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

Eric W. Biederman wrote:
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.

Yes, but init also does a kill(-1,...) to get rid of all processes,
before the last steps of system shutdown. So you have to somehow
make your "page holding" process survive beyond this point.

> My feel is that kexec-on-panic is a rather different problem.

You make it a different problem by assuming that you'd have a
kernel that is specifically built for running at a "safe"
location. If you assume that you're just using your normal
kernel, the two problems converge again. There are still a
few things that are different, like the checksumming, but
they can safely be added at a later time.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:30                                   ` Werner Almesberger
@ 2002-11-10  3:49                                     ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  3:49 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Linus Torvalds, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

Werner Almesberger <wa@almesberger.net> writes:

> Eric W. Biederman wrote:
> > What I was thinking is that the process would for and exec
> > something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> > And that script would do all of the user space shutdown.
> 
> Yes, but init also does a kill(-1,...) to get rid of all processes,
> before the last steps of system shutdown. So you have to somehow
> make your "page holding" process survive beyond this point.

True.  But it is just as easy to drop the file into something like
ramfs.  Or a file on the read only file on the root filesystem.  Now
that we can having shutdown do a pivot_root and totally unmounting
the root filesystem is probably a good idea.

> > My feel is that kexec-on-panic is a rather different problem.
> 
> You make it a different problem by assuming that you'd have a
> kernel that is specifically built for running at a "safe"
> location.  

Well at least the part cleans up after the running kernel.  That is
what I think it takes to make it stable.  Perhaps I am wrong, but
I think getting other architecture stable is very hard.

> If you assume that you're just using your normal
> kernel, the two problems converge again. There are still a
> few things that are different, like the checksumming, but
> they can safely be added at a later time.

I guess I can be proven wrong.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:10                                 ` Eric W. Biederman
  2002-11-10  3:30                                   ` Werner Almesberger
@ 2002-11-10  3:49                                   ` Linus Torvalds
  1 sibling, 0 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-11-10  3:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Werner Almesberger, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel


On 9 Nov 2002, Eric W. Biederman wrote:
> 
> What I was thinking is that the process would for and exec
> something like "/etc/rc 6" or maybe "/etc/rc 7" to be clean.
> And that script would do all of the user space shutdown.
> 
> No need to mlock any pages, or hack init, or special hacks.
> Just user space cleanly shutting itself down.

Ehh.. You do realize that the above doesn't actually _work_?

First off, "all the user space shutdown" includes things like turning off 
networking. Oh, and if you're on a NFS-root system, your process is now 
officially _toast_.

Unless you do the "mlockall()" etc that you seem to think that you don't 
need, that is.

In other words: oh, yes, you do need those mlock() calls.

And never mind the fact that everybody has a slightly different "init" 
setup, so executing "/etc/rc 6" may not actually _do_ anything. So now you 
need to learn about all the different initscripts, and get that right. 

And btw, thanks to the mlockall() requirements, you actually end up
pinning more memory than the two-phase approach ever would have done while 
you do all this.

You then need to do the pre-loading anyway for the "kexec-on-panic" case
that you think is so different (I don't understand why you think a reboot
is different from another reboot, but whatever). So now you not only
maintain something that knows about many different init scripts and uses
more memory, it also ends up doing the same reboot thing at least two
different ways.

  -- meanwhile, in another universe --

With the two-way separation, you don't have any of these problems. Your
maintenance nightmare has now become _one_ added script:

	/etc/rc.d/rc6.d/K00loadkernel

and people using other init script variants can trivially add a script to
match their setup. Done. No maintenance headache, no special init
binaries, no torn-out-hair.

And by adding a startup script, you can have a _different_ small "debug
dump" kernel loaded early, so that if the machine reboots without going
through the controlled sequence, it will automatically boot into that
debug kernel..

			Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:17                                   ` Linus Torvalds
@ 2002-11-10  4:26                                     ` Eric W. Biederman
  2002-11-10 18:07                                     ` Kexec 2.5.46-b6 Eric W. Biederman
                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Linus Torvalds <torvalds@transmeta.com> writes:

> On 9 Nov 2002, Eric W. Biederman wrote:
> > 
> > And despite my utter puzzlement on why you want the syscall cut in two.
> 
> I'm amazed about your puzzlement, since everybody else seem to get my 
> arguments, but as long as you play along I don't much care.
> 
> I will explain once more why it needs to be cut into two, even if you're 
> apparently willing to do it even without understanding:
> 
> When you reboot, you often cannot load the image.
> 
> 	This is _trivially_ true for panics or things like 

That the load needs to be separate for handling panics is trivially
true.  I simply have a very hard time believing that the load you want
for the normal case will be the load you want for a panic.  I think
I want to be much more paranoid in preparing for the kernel to blow
up.  And I want to move data around much more carefully.  And that
care adds restrictions I want for the normal case.

So splitting it up to prepare for panic handling just looks like over
design. 

> 	But it is _also_ true for any standard setup where you don't have
> 	a special "init" that knows about loading the kernel, and where to
> 	load it from.
> 
> 	 - Do you want to rewrite every "init" setup out there, adding 
> 	   some way to tell init where to load the kernel from?
> 
> 	   Or do you want to just split the thing in two, so that you can 
> 	   load the kernel _before_ you ask init to shut down, and just 
> 	   happily use bog-standard tools that everybody is already 
> 	   familiar with..

When you can change the init setup with just a couple of lines of
shell script seeing if file exists in magic location (say a special
ramfs or tmpfs), I guess it does not look hard to me.

> The two-part loader can clearly handle both cases. And if _you_ don't want
> a two-part loader, you can do exactly what you do now by just doing two 
> system calls. 

Right which is why I don't much care, so long as I don't have to test
reboot on panic any time soon...

I doubt we will see eye to eye on this one.  So I will now finish up
the patch to split this all up.
 
> As to vmalloc - I don't actually much care how the first and second parts
> are implemented. I suggested a vmalloc()-like approach just because your
> patch looks unnecessarily complicated to me. 

I'd love to make it simpler as well if I saw a clear opportunity that
wasn't just moving the complexity somewhere else.  But when I really
look at it I think that I am just wordy.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:16                                     ` Eric W. Biederman
  2002-11-10  3:03                                       ` Werner Almesberger
@ 2002-11-10 14:30                                       ` Alan Cox
  2002-11-10 16:56                                         ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-10 14:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> To use kmapped memory I need to setup a page table to do the final copy.
> And to setup a page table I need to know where the memory is going to be copied
> to.

And ?

I find it hard to believe you can't drive an MMU if you can write code
that boots one Linux from another


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:18                                 ` Eric W. Biederman
@ 2002-11-10 14:31                                   ` Alan Cox
  0 siblings, 0 replies; 333+ messages in thread
From: Alan Cox @ 2002-11-10 14:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh,
	lkcd-general, lkcd-devel

On Sun, 2002-11-10 at 02:18, Eric W. Biederman wrote:
> > Load a new *something* on panic. That something might be a new kernel
> > but it might also be a kernel dump system like LKCD or a debugger front
> > end for something like kdb, or a network dump module, or ...
> 
> And if it isn't a kernel why not load it as a module?  The code
> has to come preloaded anyway.

You may want to load it as a module or via syscall request. Doesn't
matter which really. But you do want all the intelligence in the loaded
code not in the reboot stub of the dying code.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  2:58                                 ` Eric W. Biederman
@ 2002-11-10 14:35                                   ` Alan Cox
  2002-11-10 18:13                                     ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Alan Cox @ 2002-11-10 14:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Randy.Dunlap, Linus Torvalds, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
	Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel

On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > What I'm trying to say is that I think the new kernel must
> > already be loaded when the panic happens.
> > Is that what you describe later (below)?
> 
> Yes that was my meaning.   The new kernel must be preloaded.
> And only started on panic.

Another question from the point of view of unifying things. What is
wrong with

	insmod kexec
		creates /dev/kexec (or kexecfs is you are Al Viro)
		hooks the reboot and panic final notifiers
	user copies file to /dev/kexec (which stuffs it into ram)

	reboot
		kexec module handler jumps to the first page of the
		kexec data in a defined state assuming its PIC


At which point we have clearly reduced kexec/oops reporter/lkcd/netdump 
to a single common tiny interface.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10 14:30                                       ` Alan Cox
@ 2002-11-10 16:56                                         ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10 16:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Werner Almesberger, Suparna Bhattacharya,
	Jeff Garzik, Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Sun, 2002-11-10 at 02:16, Eric W. Biederman wrote:
> > To use kmapped memory I need to setup a page table to do the final copy.
> > And to setup a page table I need to know where the memory is going to be
> copied
> 
> > to.
> 
> And ?
> 
> I find it hard to believe you can't drive an MMU if you can write code
> that boots one Linux from another

One of the simplifying things I do between OS's is turn of the MMU, or
at least give it a 1-1 trivial mapping with physical memory.

If all of that memory is hanging out there forever. It probably makes sense
to be high memory capable.  But for the first rev of this I won't be.
Addresses > 4GB are a major pain to work with on x86.  

But I do have a test machine that can reproduce that so I can test for
strange bugs.  I added a BIOS option to put all but 512M out of 4GB
above the 4GB line.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Kexec 2.5.46-b6
  2002-11-10  3:17                                   ` Linus Torvalds
  2002-11-10  4:26                                     ` Eric W. Biederman
@ 2002-11-10 18:07                                     ` Eric W. Biederman
  2002-11-11 18:03                                     ` [lkcd-devel] Re: What's left over Eric W. Biederman
  2002-11-11 18:15                                     ` Kexec for v2.5.47 Eric W. Biederman
  3 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10 18:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh


O.k.  Here is the splitup version of my kexec
Added are
sys_reboot(LINUX_REBOOT_CMD_KEXEC)
sys_kexec_load(unsigned long entry, unsigned long nr_segments, struct kexec *segments, unsigned long flags);

The flags field is currently enforced to be zero, but it leaves the window open to tweak
what the load does for the panic case.

Currently (because of missing hardware shutdown code) the code only approaches stable
in UP without APICs.  

Generating a patch to cleanly shutdown the apics, and releasing a sample user space
is the next step.

Eric



 MAINTAINERS                        |    7 
 arch/i386/Kconfig                  |   17 
 arch/i386/kernel/Makefile          |    1 
 arch/i386/kernel/entry.S           |    1 
 arch/i386/kernel/machine_kexec.c   |  142 ++++++++
 arch/i386/kernel/relocate_kernel.S |   99 +++++
 include/asm-i386/kexec.h           |   25 +
 include/asm-i386/unistd.h          |    1 
 include/linux/kexec.h              |   46 ++
 include/linux/reboot.h             |    2 
 kernel/Makefile                    |    1 
 kernel/kexec.c                     |  643 +++++++++++++++++++++++++++++++++++++
 kernel/sys.c                       |   23 +
 13 files changed, 1008 insertions

diff -uNr linux-2.5.46-bk6/MAINTAINERS linux-2.5.46-bk6.x86kexec/MAINTAINERS
--- linux-2.5.46-bk6/MAINTAINERS	Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/MAINTAINERS	Sun Nov 10 10:05:32 2002
@@ -968,6 +968,13 @@
 W:	http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
 S:	Maintained
 
+KEXEC
+P:	Eric Biederman
+M:	ebiederm@xmission.com
+M:	ebiederman@lnxi.com
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+
 LANMEDIA WAN CARD DRIVER
 P:	Andrew Stanley-Jones
 M:	asj@lanmedia.com
diff -uNr linux-2.5.46-bk6/arch/i386/Kconfig linux-2.5.46-bk6.x86kexec/arch/i386/Kconfig
--- linux-2.5.46-bk6/arch/i386/Kconfig	Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/Kconfig	Sun Nov 10 10:05:32 2002
@@ -784,6 +784,23 @@
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config KEXEC
+	bool "kexec system call (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  kexec is a system call that implements the ability to  shutdown your
+	  current kernel, and to start another kernel.  It is like a reboot
+	  but it is indepedent of the system firmware.   And like a reboot the
+	  you can start any kernel with it not just Linux.  
+	
+	  The name comes from the similiarity to the exec system call. 
+	
+	  It is on an going process to be certain the hardware in a machine
+	  is properly shutdown, so do not be surprised if this code does not
+	  initially work for you.  It may help to enable device hotplugging
+	  support.  As of this writing the exact hardware interface is
+	  strongly in flux, so no good recommendation can be made.
+
 endmenu
 
 
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/Makefile linux-2.5.46-bk6.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.46-bk6/arch/i386/kernel/Makefile	Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/Makefile	Sun Nov 10 10:05:32 2002
@@ -24,6 +24,7 @@
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o
 obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 obj-$(CONFIG_PROFILING)		+= profile.o
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/entry.S linux-2.5.46-bk6.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.46-bk6/arch/i386/kernel/entry.S	Sun Nov 10 10:04:38 2002
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/entry.S	Sun Nov 10 10:05:32 2002
@@ -743,6 +743,7 @@
 	.long sys_epoll_ctl	/* 255 */
 	.long sys_epoll_wait
  	.long sys_remap_file_pages
+	.long sys_kexec_load
 
 
 	.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/machine_kexec.c linux-2.5.46-bk6.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.46-bk6/arch/i386/kernel/machine_kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/machine_kexec.c	Sun Nov 10 10:05:32 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+	unsigned char curidt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curidt)) = limit;
+	(*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+	__asm__ __volatile__ (
+		"lidt %0\n" 
+		: "=m" (curidt)
+		);
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+	unsigned char curgdt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curgdt)) = limit;
+	(*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+	__asm__ __volatile__ (
+		"lgdt %0\n" 
+		: "=m" (curgdt)
+		);
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+	__asm__ __volatile__ (
+		"\tljmp $"STR(__KERNEL_CS)",$1f\n"
+		"\t1:\n"
+		"\tmovl $"STR(__KERNEL_DS)",%eax\n"
+		"\tmovl %eax,%ds\n"
+		"\tmovl %eax,%es\n"
+		"\tmovl %eax,%fs\n"
+		"\tmovl %eax,%gs\n"
+		"\tmovl %eax,%ss\n"
+		);
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+	/* This code is x86 specific...
+	 * general purpose code must be more carful 
+	 * of caches and tlbs...
+	 */
+	pgd_t *pgd;
+	pmd_t *pmd;
+	struct mm_struct *mm = current->mm;
+	spin_lock(&mm->page_table_lock);
+	
+	pgd = pgd_offset(mm, address);
+	pmd = pmd_alloc(mm, pgd, address);
+
+	if (pmd) {
+		pte_t *pte = pte_alloc_map(mm, pmd, address);
+		if (pte) {
+			set_pte(pte, 
+				mk_pte(virt_to_page(phys_to_virt(address)), 
+					PAGE_SHARED));
+			__flush_tlb_one(address);
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+	unsigned long indirection_page, unsigned long reboot_code_buffer,
+	unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+	unsigned long *indirection_page;
+	void *reboot_code_buffer;
+	relocate_new_kernel_t rnk;
+
+	/* Interrupts aren't acceptable while we reboot */
+	local_irq_disable();
+	reboot_code_buffer = image->reboot_code_buffer;
+	indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+	identity_map_page(virt_to_phys(reboot_code_buffer));
+
+	/* copy it out */
+	memcpy(reboot_code_buffer, relocate_new_kernel, 
+		relocate_new_kernel_size);
+
+	/* The segment registers are funny things, they are
+	 * automatically loaded from a table, in memory wherever you
+	 * set them to a specific selector, but this table is never
+	 * accessed again you set the segment to a different selector.
+	 *
+	 * The more common model is are caches where the behide
+	 * the scenes work is done, but is also dropped at arbitrary
+	 * times.
+	 *
+	 * I take advantage of this here by force loading the
+	 * segments, before I zap the gdt with an invalid value.
+	 */
+	load_segments();
+	/* The gdt & idt are now invalid.
+	 * If you want to load them you must set up your own idt & gdt.
+	 */
+	set_gdt(phys_to_virt(0),0);
+	set_idt(phys_to_virt(0),0);
+
+	/* now call it */
+	rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+	(*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), 
+		image->start);
+}
+
diff -uNr linux-2.5.46-bk6/arch/i386/kernel/relocate_kernel.S linux-2.5.46-bk6.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.46-bk6/arch/i386/kernel/relocate_kernel.S	Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/arch/i386/kernel/relocate_kernel.S	Sun Nov 10 10:05:32 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+	/* Must be relocatable PIC code callable as a C function, that once
+	 * it starts can not use the previous processes stack.
+	 *
+	 */
+	.globl relocate_new_kernel
+relocate_new_kernel:
+	/* read the arguments and say goodbye to the stack */
+	movl  4(%esp), %ebx /* indirection_page */
+	movl  8(%esp), %ebp /* reboot_code_buffer */
+	movl  12(%esp), %edx /* start address */
+
+	/* zero out flags, and disable interrupts */
+	pushl $0
+	popfl
+
+	/* set a new stack at the bottom of our page... */
+	lea   4096(%ebp), %esp
+
+	/* store the parameters back on the stack */
+	pushl   %edx /* store the start address */
+
+	/* Set cr0 to a known state:
+	 * 31 0 == Paging disabled
+	 * 18 0 == Alignment check disabled
+	 * 16 0 == Write protect disabled
+	 * 3  0 == No task switch
+	 * 2  0 == Don't do FP software emulation.
+	 * 0  1 == Proctected mode enabled
+	 */
+	movl	%cr0, %eax
+	andl	$~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+	orl	$(1<<0), %eax
+	movl	%eax, %cr0
+	jmp 1f
+1:	
+
+	/* Flush the TLB (needed?) */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* Do the copies */
+	cld
+0:	/* top, read another word for the indirection page */
+	movl    %ebx, %ecx
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+	testl	$0x1,   %ecx  /* is it a destination page */
+	jz	1f
+	movl	%ecx,	%edi
+	andl	$0xfffff000, %edi
+	jmp     0b
+1:
+	testl	$0x2,	%ecx  /* is it an indirection page */
+	jz	1f
+	movl	%ecx,	%ebx
+	andl	$0xfffff000, %ebx
+	jmp     0b
+1:
+	testl   $0x4,   %ecx /* is it the done indicator */
+	jz      1f
+	jmp     2f
+1:
+	testl   $0x8,   %ecx /* is it the source indicator */
+	jz      0b	     /* Ignore it otherwise */
+	movl    %ecx,   %esi /* For every source page do a copy */
+	andl    $0xfffff000, %esi
+
+	movl    $1024, %ecx
+	rep ; movsl
+	jmp     0b
+
+2:
+
+	/* To be certain of avoiding problems with self modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+	
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+	
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+	ret
+relocate_new_kernel_end:
+
+	.globl relocate_new_kernel_size
+relocate_new_kernel_size:	
+	.long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.46-bk6/include/asm-i386/kexec.h linux-2.5.46-bk6.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.46-bk6/include/asm-i386/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/include/asm-i386/kexec.h	Sun Nov 10 10:05:32 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) 
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE	4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.46-bk6/include/asm-i386/unistd.h linux-2.5.46-bk6.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.46-bk6/include/asm-i386/unistd.h	Tue Nov  5 19:03:51 2002
+++ linux-2.5.46-bk6.x86kexec/include/asm-i386/unistd.h	Sun Nov 10 10:05:32 2002
@@ -262,6 +262,7 @@
 #define __NR_sys_epoll_ctl	255
 #define __NR_sys_epoll_wait	256
 #define __NR_remap_file_pages	257
+#define __NR_sys_kexec_load	258
 
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.46-bk6/include/linux/kexec.h linux-2.5.46-bk6.x86kexec/include/linux/kexec.h
--- linux-2.5.46-bk6/include/linux/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/include/linux/kexec.h	Sun Nov 10 10:05:32 2002
@@ -0,0 +1,46 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/* 
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+
+struct kimage {
+	kimage_entry_t head;
+	kimage_entry_t *entry;
+	kimage_entry_t *last_entry;
+
+	unsigned long destination;
+	unsigned long offset;
+
+	unsigned long start;
+	void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+	void *buf;
+	size_t bufsz;
+	void *mem;
+	size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+extern spinlock_t kexec_image_lock;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.46-bk6/include/linux/reboot.h linux-2.5.46-bk6.x86kexec/include/linux/reboot.h
--- linux-2.5.46-bk6/include/linux/reboot.h	Fri Oct 11 22:22:47 2002
+++ linux-2.5.46-bk6.x86kexec/include/linux/reboot.h	Sun Nov 10 10:05:32 2002
@@ -21,6 +21,7 @@
  * POWER_OFF   Stop OS and remove all power from system, if possible.
  * RESTART2    Restart system using given command string.
  * SW_SUSPEND  Suspend system using Software Suspend if compiled in
+ * KEXEC       Restart the system using a different kernel.
  */
 
 #define	LINUX_REBOOT_CMD_RESTART	0x01234567
@@ -30,6 +31,7 @@
 #define	LINUX_REBOOT_CMD_POWER_OFF	0x4321FEDC
 #define	LINUX_REBOOT_CMD_RESTART2	0xA1B2C3D4
 #define	LINUX_REBOOT_CMD_SW_SUSPEND	0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC		0x45584543
 
 
 #ifdef __KERNEL__
diff -uNr linux-2.5.46-bk6/kernel/Makefile linux-2.5.46-bk6.x86kexec/kernel/Makefile
--- linux-2.5.46-bk6/kernel/Makefile	Fri Oct 18 11:59:29 2002
+++ linux-2.5.46-bk6.x86kexec/kernel/Makefile	Sun Nov 10 10:05:32 2002
@@ -21,6 +21,7 @@
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.46-bk6/kernel/kexec.c linux-2.5.46-bk6.x86kexec/kernel/kexec.c
--- linux-2.5.46-bk6/kernel/kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.46-bk6.x86kexec/kernel/kexec.c	Sun Nov 10 10:05:32 2002
@@ -0,0 +1,643 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access.  Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory.  And this page must be identity
+ * mapped.  Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ * 
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set 
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ * 
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+	struct kimage *image;
+	image = kmalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return 0;
+	memset(image, 0, sizeof(*image));
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (image->offset != 0) {
+		image->entry++;
+	}
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		ind_page = (void *)__get_free_page(GFP_KERNEL);
+		if (!ind_page) {
+			return -ENOMEM;
+		}
+		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = 
+			ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	image->offset = 0;
+	return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+	int result;
+	
+	/* Assume the page is bad unless we pass the checks */
+	result = -EADDRNOTAVAIL;
+
+	if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+		goto out;
+	}
+
+	/* NOTE: The caller is responsible for making certain we
+	 * don't attempt to load the new image into invalid or
+	 * reserved areas of RAM.
+	 */
+	result =  0;
+out:
+	return result;
+}
+
+static int kimage_set_destination(
+	struct kimage *image, unsigned long destination) 
+{
+	int result;
+	destination &= PAGE_MASK;
+	result = kimage_verify_destination(destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0) {
+		image->destination = destination;
+	}
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+	page &= PAGE_MASK;
+	result = kimage_verify_destination(image->destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0) {
+		image->destination += PAGE_SIZE;
+	}
+	return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+	int result;
+	result = kimage_add_entry(image, IND_DONE);
+	if (result == 0) {
+		/* Point at the terminating element */
+		image->entry--;
+	}
+	return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+	if (!image)
+		return;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION) {
+				free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+			}
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE) {
+			free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+		}
+	}
+	kfree(image);
+}
+
+static int kimage_is_destination_page(
+	struct kimage *image, unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination;
+	destination = 0;
+	page &= PAGE_MASK;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return 1;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_unused_area(
+	struct kimage *image, unsigned long size, unsigned long align,
+	unsigned long *area)
+{
+	/* Walk through mem_map and find the first chunk of
+	 * ununsed memory that is at least size bytes long.
+	 */
+	/* Since the kernel plays with Page_Reseved mem_map is less
+	 * than ideal for this purpose, but it will give us a correct
+	 * conservative estimate of what we need to do. 
+	 */
+	/* For now we take advantage of the fact that all kernel pages
+	 * are marked with PG_resereved to allocate a large
+	 * contiguous area for the reboot code buffer.
+	 */
+	unsigned long addr;
+	unsigned long start, end;
+	unsigned long mask;
+	mask = ((1 << align) -1);
+	start = end = PAGE_SIZE;
+	for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long aligned_start;
+		page = virt_to_page(phys_to_virt(addr));
+		if (PageReserved(page) ||
+			kimage_is_destination_page(image, addr)) {
+			/* The current page is reserved so the start &
+			 * end of the next area must be atleast at the
+			 * next page.
+			 */
+			start = end = addr + PAGE_SIZE;
+		}
+		else {
+			/* O.k.  The current page isn't reserved
+			 * so push up the end of the area.
+			 */
+			end = addr;
+		}
+		aligned_start = (start + mask) & ~mask;
+		if (aligned_start > start) {
+			continue;
+		}
+		if (aligned_start > end) {
+			continue;
+		}
+		if (end - aligned_start >= size) {
+			*area = aligned_start;
+			return 0;
+		}
+	}
+	*area = 0;
+	return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+	struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return ptr;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	for_each_kimage_entry(image, ptr, entry) {
+		unsigned long page;
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			/* nop */
+		}
+		else if (entry & IND_DONE) {
+			/* nop */
+		}
+		else {
+			/* SOURCE & INDIRECTION */
+			page = entry & PAGE_MASK;
+			if (page == destination) {
+				return ptr;
+			}
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+	kimage_entry_t *ptr, *cptr, entry;
+	unsigned long buffer, page;
+	unsigned long destination = 0;
+
+	/* Here we implement safe guards to insure that
+	 * a source page is not copied to it's destination
+	 * page before the data on the destination page is
+	 * no longer useful.
+	 *
+	 * To make it work we actually wind up with a 
+	 * stronger condition.  For every page considered
+	 * it is either it's own destination page or it is
+	 * not a destination page of any page considered.
+	 *
+	 * Invariants 
+	 * 1. buffer is not a destination of a previous page.
+	 * 2. page is not a destination of a previous page.
+	 * 3. destination is not a previous source page.
+	 *
+	 * Result: Either a source page and a destination page 
+	 * are the same or the page is not a destination page.
+	 *
+	 * These checks could be done when we allocate the pages,
+	 * but doing it as a final pass allows us more freedom
+	 * on how we allocate pages.
+	 * 
+	 * Also while the checks are necessary, in practice nothing
+	 * happens.  The destination kernel wants to sit in the
+	 * same physical addresses as the current kernel so we never
+	 * actually allocate a destination page.
+	 *
+	 * BUGS: This is a O(N^2) algorithm.
+	 */
+
+	
+	buffer = __get_free_page(GFP_KERNEL);
+	if (!buffer) {
+		return -ENOMEM;
+	}
+	buffer = virt_to_phys((void *)buffer);
+	for_each_kimage_entry(image, ptr, entry) {
+		/* Here we check to see if an allocated page */
+		kimage_entry_t *limit;
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_INDIRECTION) {
+			/* Indirection pages must include all of their
+			 * contents in limit checking.
+			 */
+			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+		}
+		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+			continue;
+		}
+		page = entry & PAGE_MASK;
+		limit = ptr;
+
+		/* See if a previous page has the current page as it's 
+		 * destination.
+		 * i.e. invariant 2
+		 */
+		cptr = kimage_dst_conflict(image, page, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+			*cptr = page | (centry & ~PAGE_MASK);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = cpage;
+		}
+		if (!(entry & IND_SOURCE)) {
+			continue;
+		}
+
+		/* See if a previous page is our destination page.
+		 * If so claim it now.
+		 * i.e. invariant 3
+		 */
+		cptr = kimage_src_conflict(image, destination, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+			*cptr = buffer | (centry & ~PAGE_MASK);
+			*ptr = cpage | ( entry & ~PAGE_MASK);
+			buffer = page;
+		}
+		/* If the buffer is my destination page do the copy now 
+		 * i.e. invariant 3 & 1
+		 */
+		if (buffer == destination) {
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = page;
+		}
+	}
+	free_page((unsigned long)phys_to_virt(buffer));
+	return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+	unsigned long len)
+{
+	unsigned long pos;
+	int result;
+	for(pos = 0; pos < len; pos += PAGE_SIZE) {
+		char *page;
+		result = -ENOMEM;
+		page = (void *)__get_free_page(GFP_KERNEL);
+		if (!page) {
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result) {
+			goto out;
+		}
+	}
+	result = 0;
+ out:
+	return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+	struct kexec_segment *segment)
+{	
+	unsigned long mstart;
+	int result;
+	unsigned long offset;
+	unsigned long offset_end;
+	unsigned char *buf;
+
+	result = 0;
+	buf = segment->buf;
+	mstart = (unsigned long)segment->mem;
+
+	offset_end = segment->memsz;
+
+	result = kimage_set_destination(image, mstart);
+	if (result < 0) {
+		goto out;
+	}
+	for(offset = 0;  offset < segment->memsz; offset += PAGE_SIZE) {
+		char *page;
+		size_t size, leader;
+		page = (char *)__get_free_page(GFP_KERNEL);
+		if (page == 0) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result < 0) {
+			goto out;
+		}
+		if (segment->bufsz < offset) {
+			/* We are past the end zero the whole page */
+			memset(page, 0, PAGE_SIZE);
+			continue;
+		}
+		size = PAGE_SIZE;
+		leader = 0;
+		if ((offset == 0)) {
+			leader = mstart & ~PAGE_MASK;
+		}
+		if (leader) {
+			/* We are on the first page zero the unused portion */
+			memset(page, 0, leader);
+			size -= leader;
+			page += leader;
+		}
+		if (size > (segment->bufsz - offset)) {
+			size = segment->bufsz - offset;
+		}
+		result = copy_from_user(page, buf + offset, size);
+		if (result) {
+			result = (result < 0)?result : -EIO;
+			goto out;
+		}
+		if (size < (PAGE_SIZE - leader)) {
+			/* zero the trailing part of the page */
+			memset(page + size, 0, (PAGE_SIZE - leader) - size);
+		}
+	}
+ out:
+	return result;
+}
+
+
+/* do_kexec executes a new kernel 
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+	struct kexec_segment *arg_segments, struct kimage *image)
+{
+	struct kexec_segment *segments;
+	size_t segment_bytes;
+	int i;
+
+	int result; 
+	unsigned long reboot_code_buffer;
+	kimage_entry_t *end;
+
+	/* Initialize variables */
+	segments = 0;
+
+	segment_bytes = nr_segments * sizeof(*segments);
+	segments = kmalloc(GFP_KERNEL, segment_bytes);
+	if (segments == 0) {
+		result = -ENOMEM;
+		goto out;
+	}
+	result = copy_from_user(segments, arg_segments, segment_bytes);
+	if (result) {
+		goto out;
+	}
+
+	/* Read in the data from user space */
+	image->start = start;
+	for(i = 0; i < nr_segments; i++) {
+		result = kimage_load_segment(image, &segments[i]);
+		if (result) {
+			goto out;
+		}
+	}
+	
+	/* Terminate early so I can get a place holder. */
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+	end = image->entry;
+
+	/* Usage of the reboot code buffer is subtle.  We first
+	 * find a continguous area of ram, that is not one
+	 * of our destination pages.  We do not allocate the ram.
+	 *
+	 * The algorithm to make certain we do not have address
+	 * conflicts requires each destination region to have some
+	 * backing store so we allocate abitrary source pages.
+	 *
+	 * Later in machine_kexec when we copy data to the
+	 * reboot_code_buffer it still may be allocated for other
+	 * purposes, but we do know there are no source or destination
+	 * pages in that area.  And since the rest of the kernel
+	 * is already shutdown those pages are free for use,
+	 * regardless of their page->count values.
+	 *
+	 * The kernel mapping is of the reboot code buffer is passed to
+	 * the machine dependent code.  If it needs something else
+	 * it is free to set that up.
+	 */
+	result = kimage_get_unused_area(
+		image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+		&reboot_code_buffer);
+	if (result) 
+		goto out;
+
+	/* Allocating pages we should never need  is silly but the
+	 * code won't work correctly unless we have dummy pages to
+	 * work with. 
+	 */
+	result = kimage_set_destination(image, reboot_code_buffer);
+	if (result) 
+		goto out;
+	result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+	if (result)
+		goto out;
+	image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = kimage_get_off_destination_pages(image);
+	if (result)
+		goto out;
+
+	/* Now hide the extra source pages for the reboot code buffer.
+	 */
+	image->entry = end;
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = 0;
+ out:
+	/* cleanup and exit */
+	if (segments)	kfree(segments);
+	return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ * 
+ * This call breaks up into three pieces.  
+ * - A generic part which loads the new kernel from the current
+ *   address space, and very carefully places the data in the
+ *   allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ *   the devices to shut down.  Preventing on-going dmas, and placing
+ *   the devices in a consistent state so a later kernel can
+ *   reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ *   and the copies the image to it's final destination.  And
+ *   jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+spinlock_t kexec_image_lock = SPIN_LOCK_UNLOCKED;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, 
+	struct kexec_segment *segments, unsigned long flags)
+{
+	/* Am I using to much stack space here? */
+	struct kimage *image, *old_image;
+	int result;
+		
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* In case we need just a little bit of special behavior for
+	 * reboot on panic 
+	 */
+	if (flags != 0)
+		return -EINVAL;
+
+	image = 0;
+	if (nr_segments > 0) {
+		image = kimage_alloc();
+		if (!image) {
+			return -ENOMEM;
+		}
+		result = do_kexec(entry, nr_segments, segments, image);
+		if (result) {
+			kimage_free(image);
+			return result;
+		}
+	}
+
+	spin_lock(&kexec_image_lock);
+	old_image = kexec_image;
+	kexec_image = image;
+	spin_unlock(&kexec_image_lock);
+
+	kimage_free(old_image);
+	return 0;
+}
diff -uNr linux-2.5.46-bk6/kernel/sys.c linux-2.5.46-bk6.x86kexec/kernel/sys.c
--- linux-2.5.46-bk6/kernel/sys.c	Tue Nov  5 19:03:56 2002
+++ linux-2.5.46-bk6.x86kexec/kernel/sys.c	Sun Nov 10 10:05:32 2002
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/workqueue.h>
 #include <linux/device.h>
 #include <linux/times.h>
@@ -206,6 +207,7 @@
 cond_syscall(sys_lookup_dcookie)
 cond_syscall(sys_swapon)
 cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
 
 static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
@@ -414,6 +416,27 @@
 		machine_restart(buffer);
 		break;
 
+#ifdef CONFIG_KEXEC
+	case LINUX_REBOOT_CMD_KEXEC:
+	{
+		struct kimage *image;
+		spin_lock(&kexec_image_lock);
+		image = kexec_image;
+		if (!image || arg) {
+			spin_unlock(&kexec_image_lock);
+			unlock_kernel();
+			return -EINVAL;
+		}
+		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+		system_running = 0;
+		device_shutdown();
+		printk(KERN_EMERG "Starting new kernel\n");
+		machine_kexec(image);
+		/* We never get here... */
+		spin_unlock(&kexec_image_lock);
+		break;
+	}
+#endif
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		if (!software_suspend_enabled) {

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10 14:35                                   ` Alan Cox
@ 2002-11-10 18:13                                     ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-10 18:13 UTC (permalink / raw)
  To: Alan Cox
  Cc: Randy.Dunlap, Linus Torvalds, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Andy Pfiffer, Linux Kernel Mailing List,
	Mike Galbraith, Martin J. Bligh, lkcd-general, lkcd-devel

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> On Sun, 2002-11-10 at 02:58, Eric W. Biederman wrote:
> > > What I'm trying to say is that I think the new kernel must
> > > already be loaded when the panic happens.
> > > Is that what you describe later (below)?
> > 
> > Yes that was my meaning.   The new kernel must be preloaded.
> > And only started on panic.
> 
> Another question from the point of view of unifying things. What is
> wrong with
> 
> 	insmod kexec
> 		creates /dev/kexec (or kexecfs is you are Al Viro)
> 		hooks the reboot and panic final notifiers
> 	user copies file to /dev/kexec (which stuffs it into ram)
> 
> 	reboot
> 		kexec module handler jumps to the first page of the
> 		kexec data in a defined state assuming its PIC
> 
> 
> At which point we have clearly reduced kexec/oops reporter/lkcd/netdump 
> to a single common tiny interface.

It would take a special hook that ran after the notifiers, and
device_shutdown.  At least in the normal case running what shutdown
code we can is fairly important.  And hooking the notifier lists
would not give a guarantee of going last.

There is a long ways to go in working with device drivers to even get
the easy kexec case working stably, in non-special circumstances.

The kernel gets there great but it does not cope well with the APICs
activated and the legacy pic disabled during bootup.  

The additional device shutdown code is useful even in the normal
reboot path.  Most BIOS's don't care but it should fix a few problems
with BIOS that are not as paranoid about the state of the system as
they should be when reboot is called.  Little things like always
shutting down on the bootstrap cpu are on my todo list.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-06  0:21                       ` Andy Pfiffer
  2002-11-06  1:10                         ` Werner Almesberger
@ 2002-11-10 18:35                         ` Pavel Machek
  1 sibling, 0 replies; 333+ messages in thread
From: Pavel Machek @ 2002-11-10 18:35 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Werner Almesberger, Alan Cox, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell,
	Linux Kernel Mailing List, lkcd-general, lkcd-devel

Hi!

> > > Let me ask the same dumb question - what does kexec need that a dumper
> > > doesn't.
> > 
> > kexec needs:
> >  - a system call to set it up
> >  - a way to silence devices <snip>
> <snip>
> >  - a bit of glue <snip>
> >  - device drivers that can bring silent devices back to life
> <snip>
> 
> > > In other words given reboot/trap hooks can kexec happily live
> > > as a standalone module ?
> 
> You could probably skip the system call to set it up.  Example: I could
> imagine a bizarre set of pseudo-devices:
> 
> 	# insmod kexec
> 	# cat bzImage > /proc/kexec/next-image
> 	# echo "root=805" > /proc/kexec/next-cmndline
> 	# echo 1 > /proc/kexec/reboot
> 
> and hide away that dirty little sequence with a nice kexec(3) library
> routine.

Actually, sys_reboot has void * parameter. Reusing it as "next-image"
char * seems okay to me.
								Pavel
-- 
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-09 21:21                   ` Pavel Machek
@ 2002-11-11 16:27                     ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-11 16:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Linus Torvalds, Matt D. Robinson, Rusty Russell, linux-kernel,
	lkcd-general, lkcd-devel

Pavel Machek <pavel@ucw.cz> writes:

> I have very similar problem in swsusp (need to deactivate DMA
> devices), and driverfs^H^H^H^H^Hsysfs framework seems to be suitable
> for that.

Yes.  The problem and the solutions are very similar.  Because you are
restoring the kernel code I don't think we can use the same functions,
but similar work needs to be done.    The correct hook for reboots,
halts, kexec, and  other cases where the kernel is going away is
device_shutdown which currently calls device->shutdown().  Since the
implementation has changed recently to avoid other problems no one
actually implements the shutdown method at the moment.  Once that
happens we can probably kill the reboot notifiers.  But there is a lot
of driver work to do on that score.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
  2002-11-07 19:32                           ` kexec (was: [lkcd-devel] Re: What's left over.) Andy Pfiffer
  2002-11-07 22:13                             ` Andy Pfiffer
@ 2002-11-11 17:03                             ` Bill Davidsen
       [not found]                             ` <200211080536.31287.landley@trommello.org>
  2 siblings, 0 replies; 333+ messages in thread
From: Bill Davidsen @ 2002-11-11 17:03 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Eric W. Biederman, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On 7 Nov 2002, Andy Pfiffer wrote:

> Just an idea:
> 
> Could a new, unrunnable process be created to "hold" the image?
> 
> <hand-wave>
> Use a hypothetical sys_kexec() to:
> 1. create an empty process.
> 2. copy the kernel image and parameters into the processes' address
> space.
> 3. put the process to sleep.
> </hand-wave>
> 
> If it's floating out there for weeks or years, the data could get paged
> out and not wired down.  It would show up in ps, so you'd have at least
> some visibility into the allocation.

  The only problem is that if you wanted it to run on panic, you really
couldn't trust the burning embers of a dying kernel to pull in the pages
and run them. I'd actually hope the init (and some cleanup??) code would
be there to get the new kernel going. Where kernel could be something
other than another kernel, hopefully.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
       [not found]                             ` <200211080536.31287.landley@trommello.org>
@ 2002-11-11 17:58                               ` Andy Pfiffer
  2002-11-11 18:25                                 ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-11 17:58 UTC (permalink / raw)
  To: landley
  Cc: Eric W. Biederman, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

On Thu, 2002-11-07 at 21:36, Rob Landley wrote:

> It strikes me that "load a blob of data into physical memory and keep it there 
> until further notice" is actually relatively generic mechanism, and something 
> there might be other reasons for root or various devices to do.  (DSPs that 
> want their firmware in system ram?  3D models and textures for an onboard 
> video card?)  If I'm wrong, would somebody be kind enough to tell me why?
> 
> Rob


Yes, that is rather generic -- somewhat like a variable-sized ramdisk.  

I think the key difference is that the ramdisk wants to hold blobs of
data that will be accessed from user-mode by read & write.

A "blob of bytes" for kexec, and maybe for a few other uses, wants to be
accessed (perhaps a page at a time) by pointers while in kernel space.

I'm not so sure of the generality, though.  It's my guess that there
many special-case requirements that might make it difficult to become
useful infrastructure (eg, maybe the DSP on a soundcard needs it to be
4M aligned and contiguous, or the texture memory for the video card
wants to be able to walk a data structure that only it knows about).

Andy





^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-11-10  3:17                                   ` Linus Torvalds
  2002-11-10  4:26                                     ` Eric W. Biederman
  2002-11-10 18:07                                     ` Kexec 2.5.46-b6 Eric W. Biederman
@ 2002-11-11 18:03                                     ` Eric W. Biederman
  2002-11-11 18:15                                     ` Kexec for v2.5.47 Eric W. Biederman
  3 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-11 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

Linus Torvalds <torvalds@transmeta.com> writes:

> On 9 Nov 2002, Eric W. Biederman wrote:
> > 
> > And despite my utter puzzlement on why you want the syscall cut in two.
> 
> I'm amazed about your puzzlement, since everybody else seem to get my 
> arguments, but as long as you play along I don't much care.

I think this comes from being the guy down in the trenches implementing
the code.   And it is sometimes hard to look up, far enough to have design
discussions.

I totally agree that having a load/exec split is the right
approach now that I can imagine an implementation where the code will
actually work for the panic case.  Before it felt like lying.  Doing
the  split-up, promising that kexec on panic will work eventually,
when I could not even see it as a possibility was at the core of my
objections.

What brought me around is that I can add a flag field to kexec_load.
With that flag field I can tell the kernel please step extra carefully
this code will be used to handle kexec on panic.  Without that I may
be up a creek without a paddle for figuring out how to debug that code.

To be able to support this at all I have had to be very creative in
inventing debugging code.  Which is why I have the serial console
program kexec_test.  It provides visibility into what is happening
when nothing else will.  That and memtest86 which will occasionally
catch DMA's that have not been stopped, (memory errors on good ram) I
at least have a place to start rather than a blank screen when
guessing why the new kernel did not start up.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Kexec for v2.5.47
  2002-11-10  3:17                                   ` Linus Torvalds
                                                       ` (2 preceding siblings ...)
  2002-11-11 18:03                                     ` [lkcd-devel] Re: What's left over Eric W. Biederman
@ 2002-11-11 18:15                                     ` Eric W. Biederman
  2002-11-11 22:52                                       ` Kexec for v2.5.47 (test feedback) Andy Pfiffer
  3 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-11 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Andy Pfiffer,
	Linux Kernel Mailing List, Mike Galbraith, Martin J. Bligh

kexec is a set of system calls that allows you to load another kernel
from the currently executing Linux kernel.  The current implementation
has only been tested, and had the kinks worked out on x86, but the
generic code (kexec_load) should work on any architecture.

Some machines have BIOSes that are either extremely slow to reboot,
or that cannot reliably perform a reboot.  In which case kexec
may be the only alternative to reboot in a reliable and timely
manner.

The patch is archived at:
http://www.xmission.com/~ebiederm/files/kexec/

And is currently kept in two pieces.
The pure system call.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff

And the set of hardware fixes known to help kexec.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff

A compatible user space is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.5.tar.gz
This code boots either a static ELF executable or a bzImage.

A kernel reformater that bypasses setup.S in favor of a version that
uses fewer BIOS calls, (increasing the reliability) is at:
ftp://ftp.lnxi.com/pub/mkelfImage/mkelfImage-1.18.tar.gz

In bug reports please include the serial console output of 
kexec kexec_test.  kexec_test exercises most of the interesting code
paths that are needed to load a kernel (mainly BIOS calls) with lots
of debugging print statements, so hangs can easily be detected.  

To be polite to your user space there are now options:
--load (which just loads the new kernel)
--exec (which starts a previously loaded kernel).
I expect to integrate more gracefully with init as time goes on, but
this is what I can do in a timely manner.

Without applying the hardware fixes you must build a kernel that is
uniprocessor and does not use an APIC, to have a chance at this code
working.  Cleaning up various hardware fixes and getting them
integrated into the kernel is the next step.

Hopefully this has an interface Linus likes now.
        
Eric

 MAINTAINERS                        |    7 
 arch/i386/Kconfig                  |   17 
 arch/i386/kernel/Makefile          |    1 
 arch/i386/kernel/entry.S           |    1 
 arch/i386/kernel/machine_kexec.c   |  142 ++++++++
 arch/i386/kernel/relocate_kernel.S |   99 +++++
 include/asm-i386/kexec.h           |   25 +
 include/asm-i386/unistd.h          |    1 
 include/linux/kexec.h              |   46 ++
 include/linux/reboot.h             |    2 
 kernel/Makefile                    |    1 
 kernel/kexec.c                     |  643 +++++++++++++++++++++++++++++++++++++
 kernel/sys.c                       |   23 +
 13 files changed, 1008 insertions

diff -uNr linux-2.5.47/MAINTAINERS linux-2.5.47.x86kexec/MAINTAINERS
--- linux-2.5.47/MAINTAINERS	Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/MAINTAINERS	Mon Nov 11 00:24:07 2002
@@ -968,6 +968,13 @@
 W:	http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
 S:	Maintained
 
+KEXEC
+P:	Eric Biederman
+M:	ebiederm@xmission.com
+M:	ebiederman@lnxi.com
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+
 LANMEDIA WAN CARD DRIVER
 P:	Andrew Stanley-Jones
 M:	asj@lanmedia.com
diff -uNr linux-2.5.47/arch/i386/Kconfig linux-2.5.47.x86kexec/arch/i386/Kconfig
--- linux-2.5.47/arch/i386/Kconfig	Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/Kconfig	Mon Nov 11 00:26:52 2002
@@ -784,6 +784,23 @@
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config KEXEC
+	bool "kexec system call (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  kexec is a system call that implements the ability to  shutdown your
+	  current kernel, and to start another kernel.  It is like a reboot
+	  but it is indepedent of the system firmware.   And like a reboot
+	  you can start any kernel with it not just Linux.  
+	
+	  The name comes from the similiarity to the exec system call. 
+	
+	  It is on an going process to be certain the hardware in a machine
+	  is properly shutdown, so do not be surprised if this code does not
+	  initially work for you.  It may help to enable device hotplugging
+	  support.  As of this writing the exact hardware interface is
+	  strongly in flux, so no good recommendation can be made.
+
 endmenu
 
 
diff -uNr linux-2.5.47/arch/i386/kernel/Makefile linux-2.5.47.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.47/arch/i386/kernel/Makefile	Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/kernel/Makefile	Mon Nov 11 00:24:07 2002
@@ -24,6 +24,7 @@
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o
 obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 obj-$(CONFIG_PROFILING)		+= profile.o
diff -uNr linux-2.5.47/arch/i386/kernel/entry.S linux-2.5.47.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.47/arch/i386/kernel/entry.S	Mon Nov 11 00:22:33 2002
+++ linux-2.5.47.x86kexec/arch/i386/kernel/entry.S	Mon Nov 11 00:24:07 2002
@@ -743,6 +743,7 @@
 	.long sys_epoll_ctl	/* 255 */
 	.long sys_epoll_wait
  	.long sys_remap_file_pages
+	.long sys_kexec_load
 
 
 	.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.47/arch/i386/kernel/machine_kexec.c linux-2.5.47.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.47/arch/i386/kernel/machine_kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/arch/i386/kernel/machine_kexec.c	Mon Nov 11 00:24:07 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+	unsigned char curidt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curidt)) = limit;
+	(*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+	__asm__ __volatile__ (
+		"lidt %0\n" 
+		: "=m" (curidt)
+		);
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+	unsigned char curgdt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curgdt)) = limit;
+	(*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+	__asm__ __volatile__ (
+		"lgdt %0\n" 
+		: "=m" (curgdt)
+		);
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+	__asm__ __volatile__ (
+		"\tljmp $"STR(__KERNEL_CS)",$1f\n"
+		"\t1:\n"
+		"\tmovl $"STR(__KERNEL_DS)",%eax\n"
+		"\tmovl %eax,%ds\n"
+		"\tmovl %eax,%es\n"
+		"\tmovl %eax,%fs\n"
+		"\tmovl %eax,%gs\n"
+		"\tmovl %eax,%ss\n"
+		);
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+	/* This code is x86 specific...
+	 * general purpose code must be more carful 
+	 * of caches and tlbs...
+	 */
+	pgd_t *pgd;
+	pmd_t *pmd;
+	struct mm_struct *mm = current->mm;
+	spin_lock(&mm->page_table_lock);
+	
+	pgd = pgd_offset(mm, address);
+	pmd = pmd_alloc(mm, pgd, address);
+
+	if (pmd) {
+		pte_t *pte = pte_alloc_map(mm, pmd, address);
+		if (pte) {
+			set_pte(pte, 
+				mk_pte(virt_to_page(phys_to_virt(address)), 
+					PAGE_SHARED));
+			__flush_tlb_one(address);
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+	unsigned long indirection_page, unsigned long reboot_code_buffer,
+	unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+	unsigned long *indirection_page;
+	void *reboot_code_buffer;
+	relocate_new_kernel_t rnk;
+
+	/* Interrupts aren't acceptable while we reboot */
+	local_irq_disable();
+	reboot_code_buffer = image->reboot_code_buffer;
+	indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+	identity_map_page(virt_to_phys(reboot_code_buffer));
+
+	/* copy it out */
+	memcpy(reboot_code_buffer, relocate_new_kernel, 
+		relocate_new_kernel_size);
+
+	/* The segment registers are funny things, they are
+	 * automatically loaded from a table, in memory wherever you
+	 * set them to a specific selector, but this table is never
+	 * accessed again you set the segment to a different selector.
+	 *
+	 * The more common model is are caches where the behide
+	 * the scenes work is done, but is also dropped at arbitrary
+	 * times.
+	 *
+	 * I take advantage of this here by force loading the
+	 * segments, before I zap the gdt with an invalid value.
+	 */
+	load_segments();
+	/* The gdt & idt are now invalid.
+	 * If you want to load them you must set up your own idt & gdt.
+	 */
+	set_gdt(phys_to_virt(0),0);
+	set_idt(phys_to_virt(0),0);
+
+	/* now call it */
+	rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+	(*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), 
+		image->start);
+}
+
diff -uNr linux-2.5.47/arch/i386/kernel/relocate_kernel.S linux-2.5.47.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.47/arch/i386/kernel/relocate_kernel.S	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/arch/i386/kernel/relocate_kernel.S	Mon Nov 11 00:24:07 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+	/* Must be relocatable PIC code callable as a C function, that once
+	 * it starts can not use the previous processes stack.
+	 *
+	 */
+	.globl relocate_new_kernel
+relocate_new_kernel:
+	/* read the arguments and say goodbye to the stack */
+	movl  4(%esp), %ebx /* indirection_page */
+	movl  8(%esp), %ebp /* reboot_code_buffer */
+	movl  12(%esp), %edx /* start address */
+
+	/* zero out flags, and disable interrupts */
+	pushl $0
+	popfl
+
+	/* set a new stack at the bottom of our page... */
+	lea   4096(%ebp), %esp
+
+	/* store the parameters back on the stack */
+	pushl   %edx /* store the start address */
+
+	/* Set cr0 to a known state:
+	 * 31 0 == Paging disabled
+	 * 18 0 == Alignment check disabled
+	 * 16 0 == Write protect disabled
+	 * 3  0 == No task switch
+	 * 2  0 == Don't do FP software emulation.
+	 * 0  1 == Proctected mode enabled
+	 */
+	movl	%cr0, %eax
+	andl	$~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+	orl	$(1<<0), %eax
+	movl	%eax, %cr0
+	jmp 1f
+1:	
+
+	/* Flush the TLB (needed?) */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* Do the copies */
+	cld
+0:	/* top, read another word for the indirection page */
+	movl    %ebx, %ecx
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+	testl	$0x1,   %ecx  /* is it a destination page */
+	jz	1f
+	movl	%ecx,	%edi
+	andl	$0xfffff000, %edi
+	jmp     0b
+1:
+	testl	$0x2,	%ecx  /* is it an indirection page */
+	jz	1f
+	movl	%ecx,	%ebx
+	andl	$0xfffff000, %ebx
+	jmp     0b
+1:
+	testl   $0x4,   %ecx /* is it the done indicator */
+	jz      1f
+	jmp     2f
+1:
+	testl   $0x8,   %ecx /* is it the source indicator */
+	jz      0b	     /* Ignore it otherwise */
+	movl    %ecx,   %esi /* For every source page do a copy */
+	andl    $0xfffff000, %esi
+
+	movl    $1024, %ecx
+	rep ; movsl
+	jmp     0b
+
+2:
+
+	/* To be certain of avoiding problems with self modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+	
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+	
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+	ret
+relocate_new_kernel_end:
+
+	.globl relocate_new_kernel_size
+relocate_new_kernel_size:	
+	.long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.47/include/asm-i386/kexec.h linux-2.5.47.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.47/include/asm-i386/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/include/asm-i386/kexec.h	Mon Nov 11 00:24:07 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) 
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE	4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.47/include/asm-i386/unistd.h linux-2.5.47.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.47/include/asm-i386/unistd.h	Tue Nov  5 19:03:51 2002
+++ linux-2.5.47.x86kexec/include/asm-i386/unistd.h	Mon Nov 11 00:24:07 2002
@@ -262,6 +262,7 @@
 #define __NR_sys_epoll_ctl	255
 #define __NR_sys_epoll_wait	256
 #define __NR_remap_file_pages	257
+#define __NR_sys_kexec_load	258
 
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.47/include/linux/kexec.h linux-2.5.47.x86kexec/include/linux/kexec.h
--- linux-2.5.47/include/linux/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/include/linux/kexec.h	Mon Nov 11 00:24:07 2002
@@ -0,0 +1,46 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/* 
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+
+struct kimage {
+	kimage_entry_t head;
+	kimage_entry_t *entry;
+	kimage_entry_t *last_entry;
+
+	unsigned long destination;
+	unsigned long offset;
+
+	unsigned long start;
+	void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+	void *buf;
+	size_t bufsz;
+	void *mem;
+	size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+extern spinlock_t kexec_image_lock;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.47/include/linux/reboot.h linux-2.5.47.x86kexec/include/linux/reboot.h
--- linux-2.5.47/include/linux/reboot.h	Fri Oct 11 22:22:47 2002
+++ linux-2.5.47.x86kexec/include/linux/reboot.h	Mon Nov 11 00:24:07 2002
@@ -21,6 +21,7 @@
  * POWER_OFF   Stop OS and remove all power from system, if possible.
  * RESTART2    Restart system using given command string.
  * SW_SUSPEND  Suspend system using Software Suspend if compiled in
+ * KEXEC       Restart the system using a different kernel.
  */
 
 #define	LINUX_REBOOT_CMD_RESTART	0x01234567
@@ -30,6 +31,7 @@
 #define	LINUX_REBOOT_CMD_POWER_OFF	0x4321FEDC
 #define	LINUX_REBOOT_CMD_RESTART2	0xA1B2C3D4
 #define	LINUX_REBOOT_CMD_SW_SUSPEND	0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC		0x45584543
 
 
 #ifdef __KERNEL__
diff -uNr linux-2.5.47/kernel/Makefile linux-2.5.47.x86kexec/kernel/Makefile
--- linux-2.5.47/kernel/Makefile	Fri Oct 18 11:59:29 2002
+++ linux-2.5.47.x86kexec/kernel/Makefile	Mon Nov 11 00:24:07 2002
@@ -21,6 +21,7 @@
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.47/kernel/kexec.c linux-2.5.47.x86kexec/kernel/kexec.c
--- linux-2.5.47/kernel/kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47.x86kexec/kernel/kexec.c	Mon Nov 11 00:24:07 2002
@@ -0,0 +1,643 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access.  Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory.  And this page must be identity
+ * mapped.  Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ * 
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set 
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ * 
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+	struct kimage *image;
+	image = kmalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return 0;
+	memset(image, 0, sizeof(*image));
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (image->offset != 0) {
+		image->entry++;
+	}
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		ind_page = (void *)__get_free_page(GFP_KERNEL);
+		if (!ind_page) {
+			return -ENOMEM;
+		}
+		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = 
+			ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	image->offset = 0;
+	return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+	int result;
+	
+	/* Assume the page is bad unless we pass the checks */
+	result = -EADDRNOTAVAIL;
+
+	if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+		goto out;
+	}
+
+	/* NOTE: The caller is responsible for making certain we
+	 * don't attempt to load the new image into invalid or
+	 * reserved areas of RAM.
+	 */
+	result =  0;
+out:
+	return result;
+}
+
+static int kimage_set_destination(
+	struct kimage *image, unsigned long destination) 
+{
+	int result;
+	destination &= PAGE_MASK;
+	result = kimage_verify_destination(destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0) {
+		image->destination = destination;
+	}
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+	page &= PAGE_MASK;
+	result = kimage_verify_destination(image->destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0) {
+		image->destination += PAGE_SIZE;
+	}
+	return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+	int result;
+	result = kimage_add_entry(image, IND_DONE);
+	if (result == 0) {
+		/* Point at the terminating element */
+		image->entry--;
+	}
+	return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+	if (!image)
+		return;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION) {
+				free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+			}
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE) {
+			free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+		}
+	}
+	kfree(image);
+}
+
+static int kimage_is_destination_page(
+	struct kimage *image, unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination;
+	destination = 0;
+	page &= PAGE_MASK;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return 1;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_unused_area(
+	struct kimage *image, unsigned long size, unsigned long align,
+	unsigned long *area)
+{
+	/* Walk through mem_map and find the first chunk of
+	 * ununsed memory that is at least size bytes long.
+	 */
+	/* Since the kernel plays with Page_Reseved mem_map is less
+	 * than ideal for this purpose, but it will give us a correct
+	 * conservative estimate of what we need to do. 
+	 */
+	/* For now we take advantage of the fact that all kernel pages
+	 * are marked with PG_resereved to allocate a large
+	 * contiguous area for the reboot code buffer.
+	 */
+	unsigned long addr;
+	unsigned long start, end;
+	unsigned long mask;
+	mask = ((1 << align) -1);
+	start = end = PAGE_SIZE;
+	for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long aligned_start;
+		page = virt_to_page(phys_to_virt(addr));
+		if (PageReserved(page) ||
+			kimage_is_destination_page(image, addr)) {
+			/* The current page is reserved so the start &
+			 * end of the next area must be atleast at the
+			 * next page.
+			 */
+			start = end = addr + PAGE_SIZE;
+		}
+		else {
+			/* O.k.  The current page isn't reserved
+			 * so push up the end of the area.
+			 */
+			end = addr;
+		}
+		aligned_start = (start + mask) & ~mask;
+		if (aligned_start > start) {
+			continue;
+		}
+		if (aligned_start > end) {
+			continue;
+		}
+		if (end - aligned_start >= size) {
+			*area = aligned_start;
+			return 0;
+		}
+	}
+	*area = 0;
+	return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+	struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return ptr;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	for_each_kimage_entry(image, ptr, entry) {
+		unsigned long page;
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			/* nop */
+		}
+		else if (entry & IND_DONE) {
+			/* nop */
+		}
+		else {
+			/* SOURCE & INDIRECTION */
+			page = entry & PAGE_MASK;
+			if (page == destination) {
+				return ptr;
+			}
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+	kimage_entry_t *ptr, *cptr, entry;
+	unsigned long buffer, page;
+	unsigned long destination = 0;
+
+	/* Here we implement safe guards to insure that
+	 * a source page is not copied to it's destination
+	 * page before the data on the destination page is
+	 * no longer useful.
+	 *
+	 * To make it work we actually wind up with a 
+	 * stronger condition.  For every page considered
+	 * it is either it's own destination page or it is
+	 * not a destination page of any page considered.
+	 *
+	 * Invariants 
+	 * 1. buffer is not a destination of a previous page.
+	 * 2. page is not a destination of a previous page.
+	 * 3. destination is not a previous source page.
+	 *
+	 * Result: Either a source page and a destination page 
+	 * are the same or the page is not a destination page.
+	 *
+	 * These checks could be done when we allocate the pages,
+	 * but doing it as a final pass allows us more freedom
+	 * on how we allocate pages.
+	 * 
+	 * Also while the checks are necessary, in practice nothing
+	 * happens.  The destination kernel wants to sit in the
+	 * same physical addresses as the current kernel so we never
+	 * actually allocate a destination page.
+	 *
+	 * BUGS: This is a O(N^2) algorithm.
+	 */
+
+	
+	buffer = __get_free_page(GFP_KERNEL);
+	if (!buffer) {
+		return -ENOMEM;
+	}
+	buffer = virt_to_phys((void *)buffer);
+	for_each_kimage_entry(image, ptr, entry) {
+		/* Here we check to see if an allocated page */
+		kimage_entry_t *limit;
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_INDIRECTION) {
+			/* Indirection pages must include all of their
+			 * contents in limit checking.
+			 */
+			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+		}
+		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+			continue;
+		}
+		page = entry & PAGE_MASK;
+		limit = ptr;
+
+		/* See if a previous page has the current page as it's 
+		 * destination.
+		 * i.e. invariant 2
+		 */
+		cptr = kimage_dst_conflict(image, page, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+			*cptr = page | (centry & ~PAGE_MASK);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = cpage;
+		}
+		if (!(entry & IND_SOURCE)) {
+			continue;
+		}
+
+		/* See if a previous page is our destination page.
+		 * If so claim it now.
+		 * i.e. invariant 3
+		 */
+		cptr = kimage_src_conflict(image, destination, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+			*cptr = buffer | (centry & ~PAGE_MASK);
+			*ptr = cpage | ( entry & ~PAGE_MASK);
+			buffer = page;
+		}
+		/* If the buffer is my destination page do the copy now 
+		 * i.e. invariant 3 & 1
+		 */
+		if (buffer == destination) {
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = page;
+		}
+	}
+	free_page((unsigned long)phys_to_virt(buffer));
+	return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+	unsigned long len)
+{
+	unsigned long pos;
+	int result;
+	for(pos = 0; pos < len; pos += PAGE_SIZE) {
+		char *page;
+		result = -ENOMEM;
+		page = (void *)__get_free_page(GFP_KERNEL);
+		if (!page) {
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result) {
+			goto out;
+		}
+	}
+	result = 0;
+ out:
+	return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+	struct kexec_segment *segment)
+{	
+	unsigned long mstart;
+	int result;
+	unsigned long offset;
+	unsigned long offset_end;
+	unsigned char *buf;
+
+	result = 0;
+	buf = segment->buf;
+	mstart = (unsigned long)segment->mem;
+
+	offset_end = segment->memsz;
+
+	result = kimage_set_destination(image, mstart);
+	if (result < 0) {
+		goto out;
+	}
+	for(offset = 0;  offset < segment->memsz; offset += PAGE_SIZE) {
+		char *page;
+		size_t size, leader;
+		page = (char *)__get_free_page(GFP_KERNEL);
+		if (page == 0) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result < 0) {
+			goto out;
+		}
+		if (segment->bufsz < offset) {
+			/* We are past the end zero the whole page */
+			memset(page, 0, PAGE_SIZE);
+			continue;
+		}
+		size = PAGE_SIZE;
+		leader = 0;
+		if ((offset == 0)) {
+			leader = mstart & ~PAGE_MASK;
+		}
+		if (leader) {
+			/* We are on the first page zero the unused portion */
+			memset(page, 0, leader);
+			size -= leader;
+			page += leader;
+		}
+		if (size > (segment->bufsz - offset)) {
+			size = segment->bufsz - offset;
+		}
+		result = copy_from_user(page, buf + offset, size);
+		if (result) {
+			result = (result < 0)?result : -EIO;
+			goto out;
+		}
+		if (size < (PAGE_SIZE - leader)) {
+			/* zero the trailing part of the page */
+			memset(page + size, 0, (PAGE_SIZE - leader) - size);
+		}
+	}
+ out:
+	return result;
+}
+
+
+/* do_kexec executes a new kernel 
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+	struct kexec_segment *arg_segments, struct kimage *image)
+{
+	struct kexec_segment *segments;
+	size_t segment_bytes;
+	int i;
+
+	int result; 
+	unsigned long reboot_code_buffer;
+	kimage_entry_t *end;
+
+	/* Initialize variables */
+	segments = 0;
+
+	segment_bytes = nr_segments * sizeof(*segments);
+	segments = kmalloc(GFP_KERNEL, segment_bytes);
+	if (segments == 0) {
+		result = -ENOMEM;
+		goto out;
+	}
+	result = copy_from_user(segments, arg_segments, segment_bytes);
+	if (result) {
+		goto out;
+	}
+
+	/* Read in the data from user space */
+	image->start = start;
+	for(i = 0; i < nr_segments; i++) {
+		result = kimage_load_segment(image, &segments[i]);
+		if (result) {
+			goto out;
+		}
+	}
+	
+	/* Terminate early so I can get a place holder. */
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+	end = image->entry;
+
+	/* Usage of the reboot code buffer is subtle.  We first
+	 * find a continguous area of ram, that is not one
+	 * of our destination pages.  We do not allocate the ram.
+	 *
+	 * The algorithm to make certain we do not have address
+	 * conflicts requires each destination region to have some
+	 * backing store so we allocate abitrary source pages.
+	 *
+	 * Later in machine_kexec when we copy data to the
+	 * reboot_code_buffer it still may be allocated for other
+	 * purposes, but we do know there are no source or destination
+	 * pages in that area.  And since the rest of the kernel
+	 * is already shutdown those pages are free for use,
+	 * regardless of their page->count values.
+	 *
+	 * The kernel mapping is of the reboot code buffer is passed to
+	 * the machine dependent code.  If it needs something else
+	 * it is free to set that up.
+	 */
+	result = kimage_get_unused_area(
+		image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+		&reboot_code_buffer);
+	if (result) 
+		goto out;
+
+	/* Allocating pages we should never need  is silly but the
+	 * code won't work correctly unless we have dummy pages to
+	 * work with. 
+	 */
+	result = kimage_set_destination(image, reboot_code_buffer);
+	if (result) 
+		goto out;
+	result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+	if (result)
+		goto out;
+	image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = kimage_get_off_destination_pages(image);
+	if (result)
+		goto out;
+
+	/* Now hide the extra source pages for the reboot code buffer.
+	 */
+	image->entry = end;
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = 0;
+ out:
+	/* cleanup and exit */
+	if (segments)	kfree(segments);
+	return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ * 
+ * This call breaks up into three pieces.  
+ * - A generic part which loads the new kernel from the current
+ *   address space, and very carefully places the data in the
+ *   allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ *   the devices to shut down.  Preventing on-going dmas, and placing
+ *   the devices in a consistent state so a later kernel can
+ *   reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ *   and the copies the image to it's final destination.  And
+ *   jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+spinlock_t kexec_image_lock = SPIN_LOCK_UNLOCKED;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, 
+	struct kexec_segment *segments, unsigned long flags)
+{
+	/* Am I using to much stack space here? */
+	struct kimage *image, *old_image;
+	int result;
+		
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* In case we need just a little bit of special behavior for
+	 * reboot on panic 
+	 */
+	if (flags != 0)
+		return -EINVAL;
+
+	image = 0;
+	if (nr_segments > 0) {
+		image = kimage_alloc();
+		if (!image) {
+			return -ENOMEM;
+		}
+		result = do_kexec(entry, nr_segments, segments, image);
+		if (result) {
+			kimage_free(image);
+			return result;
+		}
+	}
+
+	spin_lock(&kexec_image_lock);
+	old_image = kexec_image;
+	kexec_image = image;
+	spin_unlock(&kexec_image_lock);
+
+	kimage_free(old_image);
+	return 0;
+}
diff -uNr linux-2.5.47/kernel/sys.c linux-2.5.47.x86kexec/kernel/sys.c
--- linux-2.5.47/kernel/sys.c	Tue Nov  5 19:03:56 2002
+++ linux-2.5.47.x86kexec/kernel/sys.c	Mon Nov 11 00:24:07 2002
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/workqueue.h>
 #include <linux/device.h>
 #include <linux/times.h>
@@ -206,6 +207,7 @@
 cond_syscall(sys_lookup_dcookie)
 cond_syscall(sys_swapon)
 cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
 
 static int set_one_prio(struct task_struct *p, int niceval, int error)
 {
@@ -414,6 +416,27 @@
 		machine_restart(buffer);
 		break;
 
+#ifdef CONFIG_KEXEC
+	case LINUX_REBOOT_CMD_KEXEC:
+	{
+		struct kimage *image;
+		spin_lock(&kexec_image_lock);
+		image = kexec_image;
+		if (!image || arg) {
+			spin_unlock(&kexec_image_lock);
+			unlock_kernel();
+			return -EINVAL;
+		}
+		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+		system_running = 0;
+		device_shutdown();
+		printk(KERN_EMERG "Starting new kernel\n");
+		machine_kexec(image);
+		/* We never get here... */
+		spin_unlock(&kexec_image_lock);
+		break;
+	}
+#endif
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		if (!software_suspend_enabled) {

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: kexec (was: [lkcd-devel] Re: What's left over.)
  2002-11-11 17:58                               ` Andy Pfiffer
@ 2002-11-11 18:25                                 ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-11 18:25 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: landley, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, lkcd-general,
	lkcd-devel

Andy Pfiffer <andyp@osdl.org> writes:

> On Thu, 2002-11-07 at 21:36, Rob Landley wrote:
> 
> > It strikes me that "load a blob of data into physical memory and keep it there
> 
> > until further notice" is actually relatively generic mechanism, and something
> 
> > there might be other reasons for root or various devices to do.  (DSPs that 
> > want their firmware in system ram?  3D models and textures for an onboard 
> > video card?)  If I'm wrong, would somebody be kind enough to tell me why?
> > 
> > Rob
> 
> 
> Yes, that is rather generic -- somewhat like a variable-sized ramdisk.  
>
> I think the key difference is that the ramdisk wants to hold blobs of
> data that will be accessed from user-mode by read & write.

kexec at least at the end, and probably for earlier for handling
panics wants code to be in a very specific location in ram. 

If you want to hook the functionality behind in behind kexec_load,
when say KEXEC_FIXED is passed as a flag, go ahead.  There is enough
other setup to jump to the code loaded into memory that "load a blob
of data into physical memory and keep it there" is not a sufficient
interface.

In the general case I using some kind of scatter gather list seems
the most polite way to go.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47 (test feedback)
  2002-11-11 18:15                                     ` Kexec for v2.5.47 Eric W. Biederman
@ 2002-11-11 22:52                                       ` Andy Pfiffer
  2002-11-12  7:22                                         ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-11 22:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh

On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> kexec is a set of system calls that allows you to load another kernel
> from the currently executing Linux kernel.

> And is currently kept in two pieces.
> The pure system call.
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff

FYI: that patch applies cleanly to pure 2.5.47 (bk ChangeSet@1.823).

The current front of the tree does not patch 100% cleanly (conflicts
with recent module changes).

Results on my usual problem machine:

# ./kexec-1.5 ./kexec_test-1.5
Shutting down devices
Debug: sleeping function called from illegal context at include/asm/semaphore.h9
Call Trace: [<c011a698>]  [<c0216193>]  [<c012b165>]  [<c0132dec>]  [<c0140357> Starting new kernel
kexec_test 1.5 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>

Sorry about the linewrap.

Same as last time, but the good news is that splitting the load and reboot
operations works as expected.

> And the set of hardware fixes known to help kexec.
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff

Missing or inaccessible.  I'll try some duct tape and the
linux-2.5.44.x86kexec-hwfixes.diff and see what happens.

Confirming some earlier suspicions:
CONFIG_SMP=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y

Last time I tried to run a UP kernel (and no APIC support) on this system
it wasn't pretty.  I'll add that to my list of combinations to try.

And as always:
% lspci 
00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
% cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 8
model name	: Pentium III (Coppermine)
stepping	: 10
cpu MHz		: 799.957
cache size	: 256 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips	: 1576.96
% 




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47 (test feedback)
  2002-11-11 22:52                                       ` Kexec for v2.5.47 (test feedback) Andy Pfiffer
@ 2002-11-12  7:22                                         ` Eric W. Biederman
  2002-11-13  0:48                                           ` Andy Pfiffer
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-12  7:22 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh

Andy Pfiffer <andyp@osdl.org> writes:

> On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > kexec is a set of system calls that allows you to load another kernel
> > from the currently executing Linux kernel.
> 
> > And is currently kept in two pieces.
> > The pure system call.
> > http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec.diff
> 
> FYI: that patch applies cleanly to pure 2.5.47 (bk ChangeSet@1.823).
> 
> The current front of the tree does not patch 100% cleanly (conflicts
> with recent module changes).

I will have to take a look next time a snapshot is uploaded.  bk and I have
not yet become friends.
 
> Results on my usual problem machine:
> 
> # ./kexec-1.5 ./kexec_test-1.5
> Shutting down devices
> Debug: sleeping function called from illegal context at include/asm/semaphore.h9
> 
> Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>

Hmm. I wonder what is doing that.  Do you have the semaphore problem on a normal reboot?

> Starting new kernel
> 
> kexec_test 1.5 starting...
> eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> idt: 00000000 C0000000
> gdt: 00000000 C0000000
> Switching descriptors.
> Descriptors changed.
> Legacy pic setup.
> In real mode.
> <hang>

Yep it works until it runs into your apics that are not shutdown.
That looks like one of the next things to tackle.
 
> Same as last time, but the good news is that splitting the load and reboot
> operations works as expected.

That is what my test machine said as well.  But the confirmation is nice. 
And it definitely means I uploaded a working sample user space.
 
> > And the set of hardware fixes known to help kexec.
> >
> http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.47.x86kexec-hwfixes.diff
> 
> 
> Missing or inaccessible.  I'll try some duct tape and the
> linux-2.5.44.x86kexec-hwfixes.diff and see what happens.

The .47 version is pretty much just a forward port.  It is uploaded now.
My apologies for not getting to it earlier.

The challenge is with the apic shutdown is that currently the apics are not
in the device tree so that needs to happen before I can submit a good version
for 2.5.x
 

> Confirming some earlier suspicions:
> CONFIG_SMP=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_X86_IO_APIC=y
> 
> Last time I tried to run a UP kernel (and no APIC support) on this system
> it wasn't pretty.  I'll add that to my list of combinations to try.

I would not worry about it to much.  I'm just happy my tools are good enough
that with a little thinking I can figure out what the problem is.
Getting there was hard.

Eric



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47 (test feedback)
  2002-11-12  7:22                                         ` Eric W. Biederman
@ 2002-11-13  0:48                                           ` Andy Pfiffer
  2002-11-13  4:16                                             ` Eric W. Biederman
                                                               ` (2 more replies)
  0 siblings, 3 replies; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-13  0:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh

On Mon, 2002-11-11 at 23:22, Eric W. Biederman wrote:
> > On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > > kexec is a set of system calls that allows you to load another kernel
> > > from the currently executing Linux kernel.
> > 

> > Results on my usual problem machine:
> > 
> > # ./kexec-1.5 ./kexec_test-1.5
> > Shutting down devices
> > Debug: sleeping function called from illegal context at include/asm/semaphore.h9
> > 
> > Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>
> 
> Hmm. I wonder what is doing that.  Do you have the semaphore problem on a normal reboot?

No clue as of yet.  I do not see this information during a normal
reboot.


> > Starting new kernel
> > 
> > kexec_test 1.5 starting...
> > eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> > esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> > idt: 00000000 C0000000
> > gdt: 00000000 C0000000
> > Switching descriptors.
> > Descriptors changed.
> > Legacy pic setup.
> > In real mode.
> > <hang>
> 
> Yep it works until it runs into your apics that are not shutdown.
> That looks like one of the next things to tackle.

I used the linux-2.5.44.x86kexec-hwfixes.diff (it applied cleanly to
pure 2.5.47 + kexec); I'll try your updated version soon if there are
any major differences.

> The challenge is with the apic shutdown is that currently the apics are not
> in the device tree so that needs to happen before I can submit a good version
> for 2.5.x
>  
> 
> > Confirming some earlier suspicions:
> > CONFIG_SMP=y
> > CONFIG_X86_GOOD_APIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_X86_IO_APIC=y
> > 
> > Last time I tried to run a UP kernel (and no APIC support) on this system
> > it wasn't pretty.  I'll add that to my list of combinations to try.

On this same system, I reconfigured and tried this:
    # CONFIG_SMP is not set
    CONFIG_X86_GOOD_APIC=y
    # CONFIG_X86_UP_APIC is not set
    # CONFIG_X86_LOCAL_APIC is not set
    # CONFIG_X86_IO_APIC is not set
    
None of the "ordinary" APIC initialization messages were output during
the regular BIOS->LILO boot of this kernel.

Using kexec on this kernel to run kexec_test-1.5 stops in the same way:
    # ./kexec-1.5 --debug ./kexec_test-1.5
    Shutting down devices
    Debug: sleeping function called from illegal context at
    include/asm/semaphore.h9Call Trace: [<c0113f7c>]  [<c01ec123>] 
    [<c0120af2>]  [<c0130d5d>]  [<c0130d5d> Starting new kernel
    kexec_test 1.5 starting...
    eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
    esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
    idt: 00000000 C0000000
    gdt: 00000000 C0000000
    Switching descriptors.
    Descriptors changed.
    Legacy pic setup.
    In real mode.
    <hang>

So, does this information suggest looking somewhere other than APIC
shutdown?

Andy



^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47 (test feedback)
  2002-11-13  0:48                                           ` Andy Pfiffer
@ 2002-11-13  4:16                                             ` Eric W. Biederman
  2002-11-13 13:26                                             ` Kexec for v2.5.47-bk2 Eric W. Biederman
  2002-11-18  0:07                                             ` [ANNOUNCE] kexec-tools-1.6 released Eric W. Biederman
  2 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-13  4:16 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh

Andy Pfiffer <andyp@osdl.org> writes:

> On Mon, 2002-11-11 at 23:22, Eric W. Biederman wrote:
> > > On Mon, 2002-11-11 at 10:15, Eric W. Biederman wrote:
> > > > kexec is a set of system calls that allows you to load another kernel
> > > > from the currently executing Linux kernel.
> > > 
> 
> > > Results on my usual problem machine:
> > > 
> > > # ./kexec-1.5 ./kexec_test-1.5
> > > Shutting down devices
> > > Debug: sleeping function called from illegal context at
> include/asm/semaphore.h9
> 
> > > 
> > > Call Trace: [<c011a698>] [<c0216193>] [<c012b165>] [<c0132dec>] [<c0140357>
> > 
> > Hmm. I wonder what is doing that.  Do you have the semaphore problem on a
> normal reboot?
> 
> 
> No clue as of yet.  I do not see this information during a normal
> reboot.

Doh.   I must compile that debugging in when I am testing.  I introduced a spin lock,
and then I called a function that might sleep...  Though I am puzzled by what
in the device_shutdown and reboot notifier path is actually sleeping
but that is academic.

Next version will use a semaphore to be polite.
I should have asked where those addresses mapped to in your
system.map.

Anyway one of the reasons I grumble about splitting it, more global
variables that have to be protected, and more chances to fumble
something.  Oh, well.

> > > Starting new kernel
> > > 
> > > kexec_test 1.5 starting...
> > > eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> > > esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> > > idt: 00000000 C0000000
> > > gdt: 00000000 C0000000
> > > Switching descriptors.
> > > Descriptors changed.
> > > Legacy pic setup.
> > > In real mode.
> > > <hang>
> > 
> > Yep it works until it runs into your apics that are not shutdown.
> > That looks like one of the next things to tackle.
> 
> I used the linux-2.5.44.x86kexec-hwfixes.diff (it applied cleanly to
> pure 2.5.47 + kexec); I'll try your updated version soon if there are
> any major differences.

I don't think there is anything significant.

> > The challenge is with the apic shutdown is that currently the apics are not
> > in the device tree so that needs to happen before I can submit a good version
> > for 2.5.x
> >  
> > 
> > > Confirming some earlier suspicions:
> > > CONFIG_SMP=y
> > > CONFIG_X86_GOOD_APIC=y
> > > CONFIG_X86_LOCAL_APIC=y
> > > CONFIG_X86_IO_APIC=y
> > > 
> > > Last time I tried to run a UP kernel (and no APIC support) on this system
> > > it wasn't pretty.  I'll add that to my list of combinations to try.
> 
> On this same system, I reconfigured and tried this:
>     # CONFIG_SMP is not set
>     CONFIG_X86_GOOD_APIC=y
>     # CONFIG_X86_UP_APIC is not set
>     # CONFIG_X86_LOCAL_APIC is not set
>     # CONFIG_X86_IO_APIC is not set
>     
> None of the "ordinary" APIC initialization messages were output during
> the regular BIOS->LILO boot of this kernel.
> 
> So, does this information suggest looking somewhere other than APIC
> shutdown?

I am not certain.  All that is certain is there is an unhandled
interrupt.  

Anyway the next step will be to enter the Linux kernel in 32bit mode
so I can avoid the whole mess of getting the BIOS working again.  That
should avoid most of these complications as I will be able to skip
the whole step of enabling interrupts.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Kexec for v2.5.47-bk2
  2002-11-13  0:48                                           ` Andy Pfiffer
  2002-11-13  4:16                                             ` Eric W. Biederman
@ 2002-11-13 13:26                                             ` Eric W. Biederman
  2002-11-15  9:24                                               ` Suparna Bhattacharya
  2002-11-18  0:07                                             ` [ANNOUNCE] kexec-tools-1.6 released Eric W. Biederman
  2 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-13 13:26 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh


O.k. and now a version that applies cleanly to 
v2.5.47-bk2 aka ChangeSet@1.845

I killed all of the locks and used xchg. That is what I really wanted
anyway.

Linus care to comment on anything you see wrong?

Eric


 MAINTAINERS                        |    7 
 arch/i386/Kconfig                  |   17 
 arch/i386/kernel/Makefile          |    1 
 arch/i386/kernel/entry.S           |    1 
 arch/i386/kernel/machine_kexec.c   |  142 ++++++++
 arch/i386/kernel/relocate_kernel.S |   99 +++++
 include/asm-i386/kexec.h           |   25 +
 include/asm-i386/unistd.h          |    1 
 include/linux/kexec.h              |   45 ++
 include/linux/reboot.h             |    2 
 kernel/Makefile                    |    1 
 kernel/kexec.c                     |  640 +++++++++++++++++++++++++++++++++++++
 kernel/sys.c                       |   23 +
 13 files changed, 1004 insertions


diff -uNr linux-2.5.47-bk2/MAINTAINERS linux-2.5.47-bk2.x86kexec/MAINTAINERS
--- linux-2.5.47-bk2/MAINTAINERS	Mon Nov 11 00:22:33 2002
+++ linux-2.5.47-bk2.x86kexec/MAINTAINERS	Wed Nov 13 06:08:52 2002
@@ -968,6 +968,13 @@
 W:	http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
 S:	Maintained
 
+KEXEC
+P:	Eric Biederman
+M:	ebiederm@xmission.com
+M:	ebiederman@lnxi.com
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+
 LANMEDIA WAN CARD DRIVER
 P:	Andrew Stanley-Jones
 M:	asj@lanmedia.com
diff -uNr linux-2.5.47-bk2/arch/i386/Kconfig linux-2.5.47-bk2.x86kexec/arch/i386/Kconfig
--- linux-2.5.47-bk2/arch/i386/Kconfig	Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/Kconfig	Wed Nov 13 06:08:52 2002
@@ -784,6 +784,23 @@
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config KEXEC
+	bool "kexec system call (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  kexec is a system call that implements the ability to  shutdown your
+	  current kernel, and to start another kernel.  It is like a reboot
+	  but it is indepedent of the system firmware.   And like a reboot
+	  you can start any kernel with it not just Linux.  
+	
+	  The name comes from the similiarity to the exec system call. 
+	
+	  It is on an going process to be certain the hardware in a machine
+	  is properly shutdown, so do not be surprised if this code does not
+	  initially work for you.  It may help to enable device hotplugging
+	  support.  As of this writing the exact hardware interface is
+	  strongly in flux, so no good recommendation can be made.
+
 endmenu
 
 
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/Makefile linux-2.5.47-bk2.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.47-bk2/arch/i386/kernel/Makefile	Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/Makefile	Wed Nov 13 06:09:36 2002
@@ -24,6 +24,7 @@
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o suspend_asm.o
 obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 obj-$(CONFIG_PROFILING)		+= profile.o
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/entry.S linux-2.5.47-bk2.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.47-bk2/arch/i386/kernel/entry.S	Wed Nov 13 06:08:11 2002
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/entry.S	Wed Nov 13 06:08:52 2002
@@ -743,6 +743,7 @@
 	.long sys_epoll_ctl	/* 255 */
 	.long sys_epoll_wait
  	.long sys_remap_file_pages
+	.long sys_kexec_load
 
 
 	.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/machine_kexec.c linux-2.5.47-bk2.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.47-bk2/arch/i386/kernel/machine_kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/machine_kexec.c	Wed Nov 13 06:08:52 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+	unsigned char curidt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curidt)) = limit;
+	(*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+	__asm__ __volatile__ (
+		"lidt %0\n" 
+		: "=m" (curidt)
+		);
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+	unsigned char curgdt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curgdt)) = limit;
+	(*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+	__asm__ __volatile__ (
+		"lgdt %0\n" 
+		: "=m" (curgdt)
+		);
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+	__asm__ __volatile__ (
+		"\tljmp $"STR(__KERNEL_CS)",$1f\n"
+		"\t1:\n"
+		"\tmovl $"STR(__KERNEL_DS)",%eax\n"
+		"\tmovl %eax,%ds\n"
+		"\tmovl %eax,%es\n"
+		"\tmovl %eax,%fs\n"
+		"\tmovl %eax,%gs\n"
+		"\tmovl %eax,%ss\n"
+		);
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+	/* This code is x86 specific...
+	 * general purpose code must be more carful 
+	 * of caches and tlbs...
+	 */
+	pgd_t *pgd;
+	pmd_t *pmd;
+	struct mm_struct *mm = current->mm;
+	spin_lock(&mm->page_table_lock);
+	
+	pgd = pgd_offset(mm, address);
+	pmd = pmd_alloc(mm, pgd, address);
+
+	if (pmd) {
+		pte_t *pte = pte_alloc_map(mm, pmd, address);
+		if (pte) {
+			set_pte(pte, 
+				mk_pte(virt_to_page(phys_to_virt(address)), 
+					PAGE_SHARED));
+			__flush_tlb_one(address);
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+	unsigned long indirection_page, unsigned long reboot_code_buffer,
+	unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+	unsigned long *indirection_page;
+	void *reboot_code_buffer;
+	relocate_new_kernel_t rnk;
+
+	/* Interrupts aren't acceptable while we reboot */
+	local_irq_disable();
+	reboot_code_buffer = image->reboot_code_buffer;
+	indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+	identity_map_page(virt_to_phys(reboot_code_buffer));
+
+	/* copy it out */
+	memcpy(reboot_code_buffer, relocate_new_kernel, 
+		relocate_new_kernel_size);
+
+	/* The segment registers are funny things, they are
+	 * automatically loaded from a table, in memory wherever you
+	 * set them to a specific selector, but this table is never
+	 * accessed again you set the segment to a different selector.
+	 *
+	 * The more common model is are caches where the behide
+	 * the scenes work is done, but is also dropped at arbitrary
+	 * times.
+	 *
+	 * I take advantage of this here by force loading the
+	 * segments, before I zap the gdt with an invalid value.
+	 */
+	load_segments();
+	/* The gdt & idt are now invalid.
+	 * If you want to load them you must set up your own idt & gdt.
+	 */
+	set_gdt(phys_to_virt(0),0);
+	set_idt(phys_to_virt(0),0);
+
+	/* now call it */
+	rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+	(*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), 
+		image->start);
+}
+
diff -uNr linux-2.5.47-bk2/arch/i386/kernel/relocate_kernel.S linux-2.5.47-bk2.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.47-bk2/arch/i386/kernel/relocate_kernel.S	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/arch/i386/kernel/relocate_kernel.S	Wed Nov 13 06:08:52 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+	/* Must be relocatable PIC code callable as a C function, that once
+	 * it starts can not use the previous processes stack.
+	 *
+	 */
+	.globl relocate_new_kernel
+relocate_new_kernel:
+	/* read the arguments and say goodbye to the stack */
+	movl  4(%esp), %ebx /* indirection_page */
+	movl  8(%esp), %ebp /* reboot_code_buffer */
+	movl  12(%esp), %edx /* start address */
+
+	/* zero out flags, and disable interrupts */
+	pushl $0
+	popfl
+
+	/* set a new stack at the bottom of our page... */
+	lea   4096(%ebp), %esp
+
+	/* store the parameters back on the stack */
+	pushl   %edx /* store the start address */
+
+	/* Set cr0 to a known state:
+	 * 31 0 == Paging disabled
+	 * 18 0 == Alignment check disabled
+	 * 16 0 == Write protect disabled
+	 * 3  0 == No task switch
+	 * 2  0 == Don't do FP software emulation.
+	 * 0  1 == Proctected mode enabled
+	 */
+	movl	%cr0, %eax
+	andl	$~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+	orl	$(1<<0), %eax
+	movl	%eax, %cr0
+	jmp 1f
+1:	
+
+	/* Flush the TLB (needed?) */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* Do the copies */
+	cld
+0:	/* top, read another word for the indirection page */
+	movl    %ebx, %ecx
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+	testl	$0x1,   %ecx  /* is it a destination page */
+	jz	1f
+	movl	%ecx,	%edi
+	andl	$0xfffff000, %edi
+	jmp     0b
+1:
+	testl	$0x2,	%ecx  /* is it an indirection page */
+	jz	1f
+	movl	%ecx,	%ebx
+	andl	$0xfffff000, %ebx
+	jmp     0b
+1:
+	testl   $0x4,   %ecx /* is it the done indicator */
+	jz      1f
+	jmp     2f
+1:
+	testl   $0x8,   %ecx /* is it the source indicator */
+	jz      0b	     /* Ignore it otherwise */
+	movl    %ecx,   %esi /* For every source page do a copy */
+	andl    $0xfffff000, %esi
+
+	movl    $1024, %ecx
+	rep ; movsl
+	jmp     0b
+
+2:
+
+	/* To be certain of avoiding problems with self modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+	
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+	
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+	ret
+relocate_new_kernel_end:
+
+	.globl relocate_new_kernel_size
+relocate_new_kernel_size:	
+	.long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.47-bk2/include/asm-i386/kexec.h linux-2.5.47-bk2.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.47-bk2/include/asm-i386/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/include/asm-i386/kexec.h	Wed Nov 13 06:08:52 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) 
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE	4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.47-bk2/include/asm-i386/unistd.h linux-2.5.47-bk2.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.47-bk2/include/asm-i386/unistd.h	Tue Nov  5 19:03:51 2002
+++ linux-2.5.47-bk2.x86kexec/include/asm-i386/unistd.h	Wed Nov 13 06:08:52 2002
@@ -262,6 +262,7 @@
 #define __NR_sys_epoll_ctl	255
 #define __NR_sys_epoll_wait	256
 #define __NR_remap_file_pages	257
+#define __NR_sys_kexec_load	258
 
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.47-bk2/include/linux/kexec.h linux-2.5.47-bk2.x86kexec/include/linux/kexec.h
--- linux-2.5.47-bk2/include/linux/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/include/linux/kexec.h	Wed Nov 13 06:08:52 2002
@@ -0,0 +1,45 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/* 
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+
+struct kimage {
+	kimage_entry_t head;
+	kimage_entry_t *entry;
+	kimage_entry_t *last_entry;
+
+	unsigned long destination;
+	unsigned long offset;
+
+	unsigned long start;
+	void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+	void *buf;
+	size_t bufsz;
+	void *mem;
+	size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.47-bk2/include/linux/reboot.h linux-2.5.47-bk2.x86kexec/include/linux/reboot.h
--- linux-2.5.47-bk2/include/linux/reboot.h	Fri Oct 11 22:22:47 2002
+++ linux-2.5.47-bk2.x86kexec/include/linux/reboot.h	Wed Nov 13 06:08:52 2002
@@ -21,6 +21,7 @@
  * POWER_OFF   Stop OS and remove all power from system, if possible.
  * RESTART2    Restart system using given command string.
  * SW_SUSPEND  Suspend system using Software Suspend if compiled in
+ * KEXEC       Restart the system using a different kernel.
  */
 
 #define	LINUX_REBOOT_CMD_RESTART	0x01234567
@@ -30,6 +31,7 @@
 #define	LINUX_REBOOT_CMD_POWER_OFF	0x4321FEDC
 #define	LINUX_REBOOT_CMD_RESTART2	0xA1B2C3D4
 #define	LINUX_REBOOT_CMD_SW_SUSPEND	0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC		0x45584543
 
 
 #ifdef __KERNEL__
diff -uNr linux-2.5.47-bk2/kernel/Makefile linux-2.5.47-bk2.x86kexec/kernel/Makefile
--- linux-2.5.47-bk2/kernel/Makefile	Wed Nov 13 06:08:13 2002
+++ linux-2.5.47-bk2.x86kexec/kernel/Makefile	Wed Nov 13 06:08:52 2002
@@ -21,6 +21,7 @@
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.47-bk2/kernel/kexec.c linux-2.5.47-bk2.x86kexec/kernel/kexec.c
--- linux-2.5.47-bk2/kernel/kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.47-bk2.x86kexec/kernel/kexec.c	Wed Nov 13 06:08:52 2002
@@ -0,0 +1,640 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access.  Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory.  And this page must be identity
+ * mapped.  Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ * 
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set 
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ * 
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+	struct kimage *image;
+	image = kmalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return 0;
+	memset(image, 0, sizeof(*image));
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (image->offset != 0) {
+		image->entry++;
+	}
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		ind_page = (void *)__get_free_page(GFP_KERNEL);
+		if (!ind_page) {
+			return -ENOMEM;
+		}
+		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = 
+			ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	image->offset = 0;
+	return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+	int result;
+	
+	/* Assume the page is bad unless we pass the checks */
+	result = -EADDRNOTAVAIL;
+
+	if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+		goto out;
+	}
+
+	/* NOTE: The caller is responsible for making certain we
+	 * don't attempt to load the new image into invalid or
+	 * reserved areas of RAM.
+	 */
+	result =  0;
+out:
+	return result;
+}
+
+static int kimage_set_destination(
+	struct kimage *image, unsigned long destination) 
+{
+	int result;
+	destination &= PAGE_MASK;
+	result = kimage_verify_destination(destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0) {
+		image->destination = destination;
+	}
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+	page &= PAGE_MASK;
+	result = kimage_verify_destination(image->destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0) {
+		image->destination += PAGE_SIZE;
+	}
+	return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+	int result;
+	result = kimage_add_entry(image, IND_DONE);
+	if (result == 0) {
+		/* Point at the terminating element */
+		image->entry--;
+	}
+	return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+	if (!image)
+		return;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION) {
+				free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+			}
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE) {
+			free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+		}
+	}
+	kfree(image);
+}
+
+static int kimage_is_destination_page(
+	struct kimage *image, unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination;
+	destination = 0;
+	page &= PAGE_MASK;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return 1;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_unused_area(
+	struct kimage *image, unsigned long size, unsigned long align,
+	unsigned long *area)
+{
+	/* Walk through mem_map and find the first chunk of
+	 * ununsed memory that is at least size bytes long.
+	 */
+	/* Since the kernel plays with Page_Reseved mem_map is less
+	 * than ideal for this purpose, but it will give us a correct
+	 * conservative estimate of what we need to do. 
+	 */
+	/* For now we take advantage of the fact that all kernel pages
+	 * are marked with PG_resereved to allocate a large
+	 * contiguous area for the reboot code buffer.
+	 */
+	unsigned long addr;
+	unsigned long start, end;
+	unsigned long mask;
+	mask = ((1 << align) -1);
+	start = end = PAGE_SIZE;
+	for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long aligned_start;
+		page = virt_to_page(phys_to_virt(addr));
+		if (PageReserved(page) ||
+			kimage_is_destination_page(image, addr)) {
+			/* The current page is reserved so the start &
+			 * end of the next area must be atleast at the
+			 * next page.
+			 */
+			start = end = addr + PAGE_SIZE;
+		}
+		else {
+			/* O.k.  The current page isn't reserved
+			 * so push up the end of the area.
+			 */
+			end = addr;
+		}
+		aligned_start = (start + mask) & ~mask;
+		if (aligned_start > start) {
+			continue;
+		}
+		if (aligned_start > end) {
+			continue;
+		}
+		if (end - aligned_start >= size) {
+			*area = aligned_start;
+			return 0;
+		}
+	}
+	*area = 0;
+	return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+	struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return ptr;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	for_each_kimage_entry(image, ptr, entry) {
+		unsigned long page;
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			/* nop */
+		}
+		else if (entry & IND_DONE) {
+			/* nop */
+		}
+		else {
+			/* SOURCE & INDIRECTION */
+			page = entry & PAGE_MASK;
+			if (page == destination) {
+				return ptr;
+			}
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+	kimage_entry_t *ptr, *cptr, entry;
+	unsigned long buffer, page;
+	unsigned long destination = 0;
+
+	/* Here we implement safe guards to insure that
+	 * a source page is not copied to it's destination
+	 * page before the data on the destination page is
+	 * no longer useful.
+	 *
+	 * To make it work we actually wind up with a 
+	 * stronger condition.  For every page considered
+	 * it is either it's own destination page or it is
+	 * not a destination page of any page considered.
+	 *
+	 * Invariants 
+	 * 1. buffer is not a destination of a previous page.
+	 * 2. page is not a destination of a previous page.
+	 * 3. destination is not a previous source page.
+	 *
+	 * Result: Either a source page and a destination page 
+	 * are the same or the page is not a destination page.
+	 *
+	 * These checks could be done when we allocate the pages,
+	 * but doing it as a final pass allows us more freedom
+	 * on how we allocate pages.
+	 * 
+	 * Also while the checks are necessary, in practice nothing
+	 * happens.  The destination kernel wants to sit in the
+	 * same physical addresses as the current kernel so we never
+	 * actually allocate a destination page.
+	 *
+	 * BUGS: This is a O(N^2) algorithm.
+	 */
+
+	
+	buffer = __get_free_page(GFP_KERNEL);
+	if (!buffer) {
+		return -ENOMEM;
+	}
+	buffer = virt_to_phys((void *)buffer);
+	for_each_kimage_entry(image, ptr, entry) {
+		/* Here we check to see if an allocated page */
+		kimage_entry_t *limit;
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_INDIRECTION) {
+			/* Indirection pages must include all of their
+			 * contents in limit checking.
+			 */
+			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+		}
+		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+			continue;
+		}
+		page = entry & PAGE_MASK;
+		limit = ptr;
+
+		/* See if a previous page has the current page as it's 
+		 * destination.
+		 * i.e. invariant 2
+		 */
+		cptr = kimage_dst_conflict(image, page, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+			*cptr = page | (centry & ~PAGE_MASK);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = cpage;
+		}
+		if (!(entry & IND_SOURCE)) {
+			continue;
+		}
+
+		/* See if a previous page is our destination page.
+		 * If so claim it now.
+		 * i.e. invariant 3
+		 */
+		cptr = kimage_src_conflict(image, destination, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+			*cptr = buffer | (centry & ~PAGE_MASK);
+			*ptr = cpage | ( entry & ~PAGE_MASK);
+			buffer = page;
+		}
+		/* If the buffer is my destination page do the copy now 
+		 * i.e. invariant 3 & 1
+		 */
+		if (buffer == destination) {
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = page;
+		}
+	}
+	free_page((unsigned long)phys_to_virt(buffer));
+	return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+	unsigned long len)
+{
+	unsigned long pos;
+	int result;
+	for(pos = 0; pos < len; pos += PAGE_SIZE) {
+		char *page;
+		result = -ENOMEM;
+		page = (void *)__get_free_page(GFP_KERNEL);
+		if (!page) {
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result) {
+			goto out;
+		}
+	}
+	result = 0;
+ out:
+	return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+	struct kexec_segment *segment)
+{	
+	unsigned long mstart;
+	int result;
+	unsigned long offset;
+	unsigned long offset_end;
+	unsigned char *buf;
+
+	result = 0;
+	buf = segment->buf;
+	mstart = (unsigned long)segment->mem;
+
+	offset_end = segment->memsz;
+
+	result = kimage_set_destination(image, mstart);
+	if (result < 0) {
+		goto out;
+	}
+	for(offset = 0;  offset < segment->memsz; offset += PAGE_SIZE) {
+		char *page;
+		size_t size, leader;
+		page = (char *)__get_free_page(GFP_KERNEL);
+		if (page == 0) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result < 0) {
+			goto out;
+		}
+		if (segment->bufsz < offset) {
+			/* We are past the end zero the whole page */
+			memset(page, 0, PAGE_SIZE);
+			continue;
+		}
+		size = PAGE_SIZE;
+		leader = 0;
+		if ((offset == 0)) {
+			leader = mstart & ~PAGE_MASK;
+		}
+		if (leader) {
+			/* We are on the first page zero the unused portion */
+			memset(page, 0, leader);
+			size -= leader;
+			page += leader;
+		}
+		if (size > (segment->bufsz - offset)) {
+			size = segment->bufsz - offset;
+		}
+		result = copy_from_user(page, buf + offset, size);
+		if (result) {
+			result = (result < 0)?result : -EIO;
+			goto out;
+		}
+		if (size < (PAGE_SIZE - leader)) {
+			/* zero the trailing part of the page */
+			memset(page + size, 0, (PAGE_SIZE - leader) - size);
+		}
+	}
+ out:
+	return result;
+}
+
+
+/* do_kexec executes a new kernel 
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+	struct kexec_segment *arg_segments, struct kimage *image)
+{
+	struct kexec_segment *segments;
+	size_t segment_bytes;
+	int i;
+
+	int result; 
+	unsigned long reboot_code_buffer;
+	kimage_entry_t *end;
+
+	/* Initialize variables */
+	segments = 0;
+
+	segment_bytes = nr_segments * sizeof(*segments);
+	segments = kmalloc(GFP_KERNEL, segment_bytes);
+	if (segments == 0) {
+		result = -ENOMEM;
+		goto out;
+	}
+	result = copy_from_user(segments, arg_segments, segment_bytes);
+	if (result) {
+		goto out;
+	}
+
+	/* Read in the data from user space */
+	image->start = start;
+	for(i = 0; i < nr_segments; i++) {
+		result = kimage_load_segment(image, &segments[i]);
+		if (result) {
+			goto out;
+		}
+	}
+	
+	/* Terminate early so I can get a place holder. */
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+	end = image->entry;
+
+	/* Usage of the reboot code buffer is subtle.  We first
+	 * find a continguous area of ram, that is not one
+	 * of our destination pages.  We do not allocate the ram.
+	 *
+	 * The algorithm to make certain we do not have address
+	 * conflicts requires each destination region to have some
+	 * backing store so we allocate abitrary source pages.
+	 *
+	 * Later in machine_kexec when we copy data to the
+	 * reboot_code_buffer it still may be allocated for other
+	 * purposes, but we do know there are no source or destination
+	 * pages in that area.  And since the rest of the kernel
+	 * is already shutdown those pages are free for use,
+	 * regardless of their page->count values.
+	 *
+	 * The kernel mapping is of the reboot code buffer is passed to
+	 * the machine dependent code.  If it needs something else
+	 * it is free to set that up.
+	 */
+	result = kimage_get_unused_area(
+		image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+		&reboot_code_buffer);
+	if (result) 
+		goto out;
+
+	/* Allocating pages we should never need  is silly but the
+	 * code won't work correctly unless we have dummy pages to
+	 * work with. 
+	 */
+	result = kimage_set_destination(image, reboot_code_buffer);
+	if (result) 
+		goto out;
+	result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+	if (result)
+		goto out;
+	image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = kimage_get_off_destination_pages(image);
+	if (result)
+		goto out;
+
+	/* Now hide the extra source pages for the reboot code buffer.
+	 */
+	image->entry = end;
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = 0;
+ out:
+	/* cleanup and exit */
+	if (segments)	kfree(segments);
+	return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ * 
+ * This call breaks up into three pieces.  
+ * - A generic part which loads the new kernel from the current
+ *   address space, and very carefully places the data in the
+ *   allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ *   the devices to shut down.  Preventing on-going dmas, and placing
+ *   the devices in a consistent state so a later kernel can
+ *   reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ *   and the copies the image to it's final destination.  And
+ *   jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, 
+	struct kexec_segment *segments, unsigned long flags)
+{
+	/* Am I using to much stack space here? */
+	struct kimage *image, *old_image;
+	int result;
+		
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* In case we need just a little bit of special behavior for
+	 * reboot on panic 
+	 */
+	if (flags != 0)
+		return -EINVAL;
+
+	image = 0;
+	if (nr_segments > 0) {
+		image = kimage_alloc();
+		if (!image) {
+			return -ENOMEM;
+		}
+		result = do_kexec(entry, nr_segments, segments, image);
+		if (result) {
+			kimage_free(image);
+			return result;
+		}
+	}
+
+	old_image = xchg(&kexec_image, image);
+
+	kimage_free(old_image);
+	return 0;
+}
diff -uNr linux-2.5.47-bk2/kernel/sys.c linux-2.5.47-bk2.x86kexec/kernel/sys.c
--- linux-2.5.47-bk2/kernel/sys.c	Wed Nov 13 06:08:13 2002
+++ linux-2.5.47-bk2.x86kexec/kernel/sys.c	Wed Nov 13 06:08:52 2002
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/workqueue.h>
 #include <linux/device.h>
 #include <linux/times.h>
@@ -206,6 +207,7 @@
 cond_syscall(sys_lookup_dcookie)
 cond_syscall(sys_swapon)
 cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
 cond_syscall(sys_init_module)
 cond_syscall(sys_delete_module)
 
@@ -416,6 +418,27 @@
 		machine_restart(buffer);
 		break;
 
+#ifdef CONFIG_KEXEC
+	case LINUX_REBOOT_CMD_KEXEC:
+	{
+		struct kimage *image;
+		if (arg) {
+			unlock_kernel();
+			return -EINVAL;
+		}
+		image = xchg(&kexec_image, 0);
+		if (!image) {
+			unlock_kernel();
+			return -EINVAL;
+		}
+		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+		system_running = 0;
+		device_shutdown();
+		printk(KERN_EMERG "Starting new kernel\n");
+		machine_kexec(image);
+		break;
+	}
+#endif
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		if (!software_suspend_enabled) {

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47-bk2
  2002-11-13 13:26                                             ` Kexec for v2.5.47-bk2 Eric W. Biederman
@ 2002-11-15  9:24                                               ` Suparna Bhattacharya
  2002-11-15 14:14                                                 ` Eric W. Biederman
  2002-11-15 14:37                                                 ` Werner Almesberger
  0 siblings, 2 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-15  9:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Pfiffer, Alan Cox, Werner Almesberger,
	Linux Kernel Mailing List, Martin J. Bligh

On Wed, Nov 13, 2002 at 06:26:29AM -0700, Eric W. Biederman wrote:
> 
> O.k. and now a version that applies cleanly to 
> v2.5.47-bk2 aka ChangeSet@1.845
> 

BTW, results similar to Andy on my SMP system (the same problem
machine we'd talked about earlier). Same problem ?

with 2.5.47-bk2 
+ kexec patch for 2.5.47-bk2 attached in your mail
+ linux-2.5.47.x86kexec-hwfixes
and using
kexec-tools-1.5

Results of kexec kexec_test

[root@llm01 root]# Synchronizing SCSI caches: 
Shutting down devices
Starting new kernel
kexec_test 1.5 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>

What would be best way to pass a parameter or address from the
current kernel to kernel being booted (e.g log buffer address
or crash dump buffer etc) ? Should this be part of the interface,
i.e. could/would it make sense for kexec to support this (rather 
than our having to go and try to fix up kernel parameters ourselves,
or designate a fixed address for this) ? Also thinking
about other arch support for kexec in the future ...

Regards
Suparna

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47-bk2
  2002-11-15  9:24                                               ` Suparna Bhattacharya
@ 2002-11-15 14:14                                                 ` Eric W. Biederman
  2002-11-15 14:37                                                 ` Werner Almesberger
  1 sibling, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-15 14:14 UTC (permalink / raw)
  To: suparna
  Cc: Andy Pfiffer, Alan Cox, Werner Almesberger,
	Linux Kernel Mailing List, Martin J. Bligh

Suparna Bhattacharya <suparna@in.ibm.com> writes:

> On Wed, Nov 13, 2002 at 06:26:29AM -0700, Eric W. Biederman wrote:
> > 
> > O.k. and now a version that applies cleanly to 
> > v2.5.47-bk2 aka ChangeSet@1.845
> > 
> 
> BTW, results similar to Andy on my SMP system (the same problem
> machine we'd talked about earlier). Same problem ?

Something like that.  The good news is that the image is being
loaded the bad news is the BIOS doesn't work, and so the kernels
initial setup code isn't working.

Hopefully this weekend I can do the work in user space to bypass
the BIOS altogether for booting a kernel.  That should make the whole
thing easier to use.
 
> with 2.5.47-bk2 
> + kexec patch for 2.5.47-bk2 attached in your mail
> + linux-2.5.47.x86kexec-hwfixes
> and using
> kexec-tools-1.5
> 
> Results of kexec kexec_test
> 
> [root@llm01 root]# Synchronizing SCSI caches: 
> Shutting down devices
> Starting new kernel
> kexec_test 1.5 starting...
> eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> idt: 00000000 C0000000
> gdt: 00000000 C0000000
> Switching descriptors.
> Descriptors changed.
> Legacy pic setup.
> In real mode.
> <hang>
> 
> What would be best way to pass a parameter or address from the
> current kernel to kernel being booted (e.g log buffer address
> or crash dump buffer etc) ? Should this be part of the interface,
> i.e. could/would it make sense for kexec to support this (rather 
> than our having to go and try to fix up kernel parameters ourselves,
> or designate a fixed address for this) ? Also thinking
> about other arch support for kexec in the future ...

The current interface says load image X at location Y, and entry
at point Z.  Given that every little situation wants a slightly
different tweak I don't think a specific feature in the kernel is
needed.  The user space binaries can incorporate all of the
interesting logic.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47-bk2
  2002-11-15  9:24                                               ` Suparna Bhattacharya
  2002-11-15 14:14                                                 ` Eric W. Biederman
@ 2002-11-15 14:37                                                 ` Werner Almesberger
  2002-11-20  9:44                                                   ` Suparna Bhattacharya
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-15 14:37 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Eric W. Biederman, Andy Pfiffer, Alan Cox,
	Linux Kernel Mailing List, Martin J. Bligh

Suparna Bhattacharya wrote:
> What would be best way to pass a parameter or address from the
> current kernel to kernel being booted (e.g log buffer address
> or crash dump buffer etc) ?

At the moment, perhaps the initrd mechanism might be a useful
interface for this. You'd just leave some space either at the
beginning or at the end of the real initrd (if there's one),
and put your data there.

Afterwards, you can extract it either from the kernel, or even
from user space through /dev/initrd (with "noinitrd")

Advantages:
 - fairly non-intrusive
 - (almost ?) all platforms support this way of handling "some
   object in memory"
 - easy to play with from user space

Drawbacks:
 - needs synchronization with existing uses of initrd
 - a bit hackish

I'd expect that there will be eventually a number of things that
get passed from old to new kernels (e.g. crash data, device scan
results, etc.), so it may be useful to delay designing a "clean"
interface (for this, I expect some TLV structure in the initrd
area would make most sense) until more of those things have
shown up.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* [ANNOUNCE] kexec-tools-1.6 released
  2002-11-13  0:48                                           ` Andy Pfiffer
  2002-11-13  4:16                                             ` Eric W. Biederman
  2002-11-13 13:26                                             ` Kexec for v2.5.47-bk2 Eric W. Biederman
@ 2002-11-18  0:07                                             ` Eric W. Biederman
  2002-11-18  5:46                                               ` Eric W. Biederman
  2 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-18  0:07 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh, Dave Hansen


The kernel interface has finally as stabilized enough I managed to put
some work into the user space side of things.

The new release is at:
http://www.xmission.com/~ebiederm/kexec-tools-1.6.tar.gz

The interface is now more like reboot, so you probably want to change
your shutdown scripts or use kexec --force.

And by default it now enters the kernel in 32bit mode so it should avoid
interrupt controller problems, and work for more people, in more strange
situations.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE] kexec-tools-1.6 released
  2002-11-18  0:07                                             ` [ANNOUNCE] kexec-tools-1.6 released Eric W. Biederman
@ 2002-11-18  5:46                                               ` Eric W. Biederman
  2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-18  5:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Pfiffer, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Linux Kernel Mailing List, Mike Galbraith,
	Martin J. Bligh, Dave Hansen

ebiederm@xmission.com (Eric W. Biederman) writes:

> The kernel interface has finally as stabilized enough I managed to put
> some work into the user space side of things.
> 
> The new release is at:
> http://www.xmission.com/~ebiederm/kexec-tools-1.6.tar.gz

Make that:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.6.tar.gz
And the latest patches can be found at:

http://www.xmission.com/~ebiederm/files/kexec/

The basic breakout is 
linux-2.4.47.x86kexec.diff is the core patch.
linux-2.4.47.x86kexec-hwfixes.diff 
       applies on top and is has some hardware fixes that
       shutdown kernel code, and make things work better.
       Mostly this is the code to get SMP to shutdown properly.

And it looks like .48 is out so I need to do another patch update.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-18  5:46                                               ` Eric W. Biederman
@ 2002-11-18  8:53                                                 ` Eric W. Biederman
  2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
                                                                     ` (3 more replies)
  0 siblings, 4 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-18  8:53 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Andy Pfiffer, Linus Torvalds, Alan Cox, Werner Almesberger,
	Suparna Bhattacharya, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen,
	Linuxbios

kexec is a set of systems call that allows you to load another kernel
from the currently executing Linux kernel.  The current implementation
has only been tested, and had the kinks worked out on x86, but the
generic code should work on any architecture.

Could I get some feed back on where this work and where this breaks.
With the maturation of kexec-tools to skip attempting bios calls,
I expect a new the linux kernel to load for most people.  Though I
also expect some device drivers will not reinitialize after the reboot.

The patch is archived at:
http://www.xmission.com/~ebiederm/files/kexec/

And is currently kept in two pieces.
The pure system call.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff

And the set of hardware fixes known to help kexec.
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff

A compatible user space is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
This code boots either a static ELF executable or a bzImage.

As of version 1.6 /sbin/kexec now works much more like /sbin/reboot.
It is recommend you place /sbin/kexec -e in /etc/init.d/reboot
just before the the call to /sbin/reboot.  If you haven't called
/sbin/kexec previously it will fail, and you can then call
/sbin/reboot.  Given the similiarity it is now the plan to merge in
reboot via kexec into /sbin/reboot.  

One bug was fixed in the move to 2.5.48.  Previously I had failed to
clear PAE and PSE in the kernel.  This caused reboot failures when
CONFIG_HIGHMEM_64G was enabled, as the new kernel would fail when
enabling paging, as these bits remained set.  Is %cr4 present on all
386+ intel cpus, or do I need to conditionalize the code that accesses
it?

As of version 1.6 /sbin/kexec when presented with a bzImage by default
avoids all BIOS calls and jumps directly to the kernels 32 bit entry
point.  The information it would usually get from the BIOS is instead
collected from the current kernel.  Accurately getting things like
the BIOS memory map from the current kernel is a challenge, still
needs to be addressed.  Safe defaults have been provided for the cases
I do not currently have good code to gather the information from the
running kernel.

In bug reports please include the serial console output of 
kexec kexec_test.  kexec_test exercises most of the interesting code
paths that are needed to load a kernel (mainly BIOS calls) with lots
of debugging print statements, so hangs can easily be detected.   

Eric


 MAINTAINERS                        |    7 
 arch/i386/Kconfig                  |   17 
 arch/i386/kernel/Makefile          |    1 
 arch/i386/kernel/entry.S           |    2 
 arch/i386/kernel/machine_kexec.c   |  142 ++++++++
 arch/i386/kernel/relocate_kernel.S |  107 ++++++
 include/asm-i386/kexec.h           |   25 +
 include/asm-i386/unistd.h          |    2 
 include/linux/kexec.h              |   45 ++
 include/linux/reboot.h             |    2 
 kernel/Makefile                    |    1 
 kernel/kexec.c                     |  640 +++++++++++++++++++++++++++++++++++++
 kernel/sys.c                       |   23 +
 13 files changed, 1012 insertions, 2 deletions

diff -uNr linux-2.5.48/MAINTAINERS linux-2.5.48.x86kexec/MAINTAINERS
--- linux-2.5.48/MAINTAINERS	Mon Nov 11 00:22:33 2002
+++ linux-2.5.48.x86kexec/MAINTAINERS	Sun Nov 17 22:53:09 2002
@@ -968,6 +968,13 @@
 W:	http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
 S:	Maintained
 
+KEXEC
+P:	Eric Biederman
+M:	ebiederm@xmission.com
+M:	ebiederman@lnxi.com
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+
 LANMEDIA WAN CARD DRIVER
 P:	Andrew Stanley-Jones
 M:	asj@lanmedia.com
diff -uNr linux-2.5.48/arch/i386/Kconfig linux-2.5.48.x86kexec/arch/i386/Kconfig
--- linux-2.5.48/arch/i386/Kconfig	Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/Kconfig	Sun Nov 17 22:53:09 2002
@@ -784,6 +784,23 @@
 	depends on (SMP || PREEMPT) && X86_CMPXCHG
 	default y
 
+config KEXEC
+	bool "kexec system call (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	help
+	  kexec is a system call that implements the ability to  shutdown your
+	  current kernel, and to start another kernel.  It is like a reboot
+	  but it is indepedent of the system firmware.   And like a reboot
+	  you can start any kernel with it not just Linux.  
+	
+	  The name comes from the similiarity to the exec system call. 
+	
+	  It is on an going process to be certain the hardware in a machine
+	  is properly shutdown, so do not be surprised if this code does not
+	  initially work for you.  It may help to enable device hotplugging
+	  support.  As of this writing the exact hardware interface is
+	  strongly in flux, so no good recommendation can be made.
+
 endmenu
 
 
diff -uNr linux-2.5.48/arch/i386/kernel/Makefile linux-2.5.48.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.48/arch/i386/kernel/Makefile	Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/kernel/Makefile	Sun Nov 17 22:53:09 2002
@@ -24,6 +24,7 @@
 obj-$(CONFIG_X86_MPPARSE)	+= mpparse.o
 obj-$(CONFIG_X86_LOCAL_APIC)	+= apic.o nmi.o
 obj-$(CONFIG_X86_IO_APIC)	+= io_apic.o
+obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_SOFTWARE_SUSPEND)	+= suspend.o suspend_asm.o
 obj-$(CONFIG_X86_NUMAQ)		+= numaq.o
 obj-$(CONFIG_PROFILING)		+= profile.o
diff -uNr linux-2.5.48/arch/i386/kernel/entry.S linux-2.5.48.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.48/arch/i386/kernel/entry.S	Sun Nov 17 22:51:14 2002
+++ linux-2.5.48.x86kexec/arch/i386/kernel/entry.S	Sun Nov 17 22:56:43 2002
@@ -768,7 +768,7 @@
 	.long sys_epoll_wait
  	.long sys_remap_file_pages
  	.long sys_set_tid_address
-
+	.long sys_kexec_load
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long sys_ni_syscall
diff -uNr linux-2.5.48/arch/i386/kernel/machine_kexec.c linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.48/arch/i386/kernel/machine_kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c	Sun Nov 17 22:53:09 2002
@@ -0,0 +1,142 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+	unsigned char curidt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curidt)) = limit;
+	(*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+	__asm__ __volatile__ (
+		"lidt %0\n" 
+		: "=m" (curidt)
+		);
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+	unsigned char curgdt[6];
+
+	/* ia32 supports unaliged loads & stores */
+	(*(__u16 *)(curgdt)) = limit;
+	(*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+	__asm__ __volatile__ (
+		"lgdt %0\n" 
+		: "=m" (curgdt)
+		);
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+	__asm__ __volatile__ (
+		"\tljmp $"STR(__KERNEL_CS)",$1f\n"
+		"\t1:\n"
+		"\tmovl $"STR(__KERNEL_DS)",%eax\n"
+		"\tmovl %eax,%ds\n"
+		"\tmovl %eax,%es\n"
+		"\tmovl %eax,%fs\n"
+		"\tmovl %eax,%gs\n"
+		"\tmovl %eax,%ss\n"
+		);
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+	/* This code is x86 specific...
+	 * general purpose code must be more carful 
+	 * of caches and tlbs...
+	 */
+	pgd_t *pgd;
+	pmd_t *pmd;
+	struct mm_struct *mm = current->mm;
+	spin_lock(&mm->page_table_lock);
+	
+	pgd = pgd_offset(mm, address);
+	pmd = pmd_alloc(mm, pgd, address);
+
+	if (pmd) {
+		pte_t *pte = pte_alloc_map(mm, pmd, address);
+		if (pte) {
+			set_pte(pte, 
+				mk_pte(virt_to_page(phys_to_virt(address)), 
+					PAGE_SHARED));
+			__flush_tlb_one(address);
+		}
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+	unsigned long indirection_page, unsigned long reboot_code_buffer,
+	unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+	unsigned long *indirection_page;
+	void *reboot_code_buffer;
+	relocate_new_kernel_t rnk;
+
+	/* Interrupts aren't acceptable while we reboot */
+	local_irq_disable();
+	reboot_code_buffer = image->reboot_code_buffer;
+	indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+	identity_map_page(virt_to_phys(reboot_code_buffer));
+
+	/* copy it out */
+	memcpy(reboot_code_buffer, relocate_new_kernel, 
+		relocate_new_kernel_size);
+
+	/* The segment registers are funny things, they are
+	 * automatically loaded from a table, in memory wherever you
+	 * set them to a specific selector, but this table is never
+	 * accessed again you set the segment to a different selector.
+	 *
+	 * The more common model is are caches where the behide
+	 * the scenes work is done, but is also dropped at arbitrary
+	 * times.
+	 *
+	 * I take advantage of this here by force loading the
+	 * segments, before I zap the gdt with an invalid value.
+	 */
+	load_segments();
+	/* The gdt & idt are now invalid.
+	 * If you want to load them you must set up your own idt & gdt.
+	 */
+	set_gdt(phys_to_virt(0),0);
+	set_idt(phys_to_virt(0),0);
+
+	/* now call it */
+	rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+	(*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), 
+		image->start);
+}
+
diff -uNr linux-2.5.48/arch/i386/kernel/relocate_kernel.S linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.48/arch/i386/kernel/relocate_kernel.S	Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S	Sun Nov 17 23:58:29 2002
@@ -0,0 +1,107 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+	/* Must be relocatable PIC code callable as a C function, that once
+	 * it starts can not use the previous processes stack.
+	 *
+	 */
+	.globl relocate_new_kernel
+relocate_new_kernel:
+	/* read the arguments and say goodbye to the stack */
+	movl  4(%esp), %ebx /* indirection_page */
+	movl  8(%esp), %ebp /* reboot_code_buffer */
+	movl  12(%esp), %edx /* start address */
+
+	/* zero out flags, and disable interrupts */
+	pushl $0
+	popfl
+
+	/* set a new stack at the bottom of our page... */
+	lea   4096(%ebp), %esp
+
+	/* store the parameters back on the stack */
+	pushl   %edx /* store the start address */
+
+	/* Set cr0 to a known state:
+	 * 31 0 == Paging disabled
+	 * 18 0 == Alignment check disabled
+	 * 16 0 == Write protect disabled
+	 * 3  0 == No task switch
+	 * 2  0 == Don't do FP software emulation.
+	 * 0  1 == Proctected mode enabled
+	 */
+	movl	%cr0, %eax
+	andl	$~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+	orl	$(1<<0), %eax
+	movl	%eax, %cr0
+	
+	/* Set cr4 to a known state:
+	 * Setting everything to zero seems safe.
+	 */
+	movl	%cr4, %eax
+	andl	$0, %eax
+	movl	%eax, %cr4
+	
+	jmp 1f
+1:	
+
+	/* Flush the TLB (needed?) */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+
+	/* Do the copies */
+	cld
+0:	/* top, read another word for the indirection page */
+	movl    %ebx, %ecx
+	movl	(%ebx), %ecx
+	addl	$4, %ebx
+	testl	$0x1,   %ecx  /* is it a destination page */
+	jz	1f
+	movl	%ecx,	%edi
+	andl	$0xfffff000, %edi
+	jmp     0b
+1:
+	testl	$0x2,	%ecx  /* is it an indirection page */
+	jz	1f
+	movl	%ecx,	%ebx
+	andl	$0xfffff000, %ebx
+	jmp     0b
+1:
+	testl   $0x4,   %ecx /* is it the done indicator */
+	jz      1f
+	jmp     2f
+1:
+	testl   $0x8,   %ecx /* is it the source indicator */
+	jz      0b	     /* Ignore it otherwise */
+	movl    %ecx,   %esi /* For every source page do a copy */
+	andl    $0xfffff000, %esi
+
+	movl    $1024, %ecx
+	rep ; movsl
+	jmp     0b
+
+2:
+
+	/* To be certain of avoiding problems with self modifying code
+	 * I need to execute a serializing instruction here.
+	 * So I flush the TLB, it's handy, and not processor dependent.
+	 */
+	xorl	%eax, %eax
+	movl	%eax, %cr3
+	
+	/* set all of the registers to known values */
+	/* leave %esp alone */
+	
+	xorl	%eax, %eax
+	xorl	%ebx, %ebx
+	xorl    %ecx, %ecx
+	xorl    %edx, %edx
+	xorl    %esi, %esi
+	xorl    %edi, %edi
+	xorl    %ebp, %ebp
+	ret
+relocate_new_kernel_end:
+
+	.globl relocate_new_kernel_size
+relocate_new_kernel_size:	
+	.long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.48/include/asm-i386/kexec.h linux-2.5.48.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.48/include/asm-i386/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/include/asm-i386/kexec.h	Sun Nov 17 22:53:09 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) 
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE	4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.48/include/asm-i386/unistd.h linux-2.5.48.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.48/include/asm-i386/unistd.h	Sun Nov 17 22:51:25 2002
+++ linux-2.5.48.x86kexec/include/asm-i386/unistd.h	Sun Nov 17 22:54:03 2002
@@ -263,7 +263,7 @@
 #define __NR_sys_epoll_wait	256
 #define __NR_remap_file_pages	257
 #define __NR_set_tid_address	258
-
+#define __NR_sys_kexec_load	259
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
diff -uNr linux-2.5.48/include/linux/kexec.h linux-2.5.48.x86kexec/include/linux/kexec.h
--- linux-2.5.48/include/linux/kexec.h	Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/include/linux/kexec.h	Sun Nov 17 22:53:09 2002
@@ -0,0 +1,45 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/* 
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION  0x1
+#define IND_INDIRECTION  0x2
+#define IND_DONE         0x4
+#define IND_SOURCE       0x8
+
+struct kimage {
+	kimage_entry_t head;
+	kimage_entry_t *entry;
+	kimage_entry_t *last_entry;
+
+	unsigned long destination;
+	unsigned long offset;
+
+	unsigned long start;
+	void *reboot_code_buffer;
+};
+
+struct kexec_segment {
+	void *buf;
+	size_t bufsz;
+	void *mem;
+	size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments, 
+	struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.48/include/linux/reboot.h linux-2.5.48.x86kexec/include/linux/reboot.h
--- linux-2.5.48/include/linux/reboot.h	Fri Oct 11 22:22:47 2002
+++ linux-2.5.48.x86kexec/include/linux/reboot.h	Sun Nov 17 22:53:09 2002
@@ -21,6 +21,7 @@
  * POWER_OFF   Stop OS and remove all power from system, if possible.
  * RESTART2    Restart system using given command string.
  * SW_SUSPEND  Suspend system using Software Suspend if compiled in
+ * KEXEC       Restart the system using a different kernel.
  */
 
 #define	LINUX_REBOOT_CMD_RESTART	0x01234567
@@ -30,6 +31,7 @@
 #define	LINUX_REBOOT_CMD_POWER_OFF	0x4321FEDC
 #define	LINUX_REBOOT_CMD_RESTART2	0xA1B2C3D4
 #define	LINUX_REBOOT_CMD_SW_SUSPEND	0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC		0x45584543
 
 
 #ifdef __KERNEL__
diff -uNr linux-2.5.48/kernel/Makefile linux-2.5.48.x86kexec/kernel/Makefile
--- linux-2.5.48/kernel/Makefile	Sun Nov 17 22:51:26 2002
+++ linux-2.5.48.x86kexec/kernel/Makefile	Sun Nov 17 22:53:09 2002
@@ -21,6 +21,7 @@
 obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
 
 ifneq ($(CONFIG_IA64),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -uNr linux-2.5.48/kernel/kexec.c linux-2.5.48.x86kexec/kernel/kexec.c
--- linux-2.5.48/kernel/kexec.c	Wed Dec 31 17:00:00 1969
+++ linux-2.5.48.x86kexec/kernel/kexec.c	Sun Nov 17 22:53:09 2002
@@ -0,0 +1,640 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access.  Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory.  And this page must be identity
+ * mapped.  Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ * 
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set 
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ * 
+ */
+
+static struct kimage *kimage_alloc(void)
+{
+	struct kimage *image;
+	image = kmalloc(sizeof(*image), GFP_KERNEL);
+	if (!image)
+		return 0;
+	memset(image, 0, sizeof(*image));
+	image->head = 0;
+	image->entry = &image->head;
+	image->last_entry = &image->head;
+	return image;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+	if (image->offset != 0) {
+		image->entry++;
+	}
+	if (image->entry == image->last_entry) {
+		kimage_entry_t *ind_page;
+		ind_page = (void *)__get_free_page(GFP_KERNEL);
+		if (!ind_page) {
+			return -ENOMEM;
+		}
+		*image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+		image->entry = ind_page;
+		image->last_entry = 
+			ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+	}
+	*image->entry = entry;
+	image->entry++;
+	image->offset = 0;
+	return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+	int result;
+	
+	/* Assume the page is bad unless we pass the checks */
+	result = -EADDRNOTAVAIL;
+
+	if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+		goto out;
+	}
+
+	/* NOTE: The caller is responsible for making certain we
+	 * don't attempt to load the new image into invalid or
+	 * reserved areas of RAM.
+	 */
+	result =  0;
+out:
+	return result;
+}
+
+static int kimage_set_destination(
+	struct kimage *image, unsigned long destination) 
+{
+	int result;
+	destination &= PAGE_MASK;
+	result = kimage_verify_destination(destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, destination | IND_DESTINATION);
+	if (result == 0) {
+		image->destination = destination;
+	}
+	return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+	int result;
+	page &= PAGE_MASK;
+	result = kimage_verify_destination(image->destination);
+	if (result) {
+		return result;
+	}
+	result = kimage_add_entry(image, page | IND_SOURCE);
+	if (result == 0) {
+		image->destination += PAGE_SIZE;
+	}
+	return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+	int result;
+	result = kimage_add_entry(image, IND_DONE);
+	if (result == 0) {
+		/* Point at the terminating element */
+		image->entry--;
+	}
+	return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+	for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+		ptr = (entry & IND_INDIRECTION)? \
+			phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+	kimage_entry_t *ptr, entry;
+	kimage_entry_t ind = 0;
+	if (!image)
+		return;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_INDIRECTION) {
+			/* Free the previous indirection page */
+			if (ind & IND_INDIRECTION) {
+				free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+			}
+			/* Save this indirection page until we are
+			 * done with it.
+			 */
+			ind = entry;
+		}
+		else if (entry & IND_SOURCE) {
+			free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+		}
+	}
+	kfree(image);
+}
+
+static int kimage_is_destination_page(
+	struct kimage *image, unsigned long page)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination;
+	destination = 0;
+	page &= PAGE_MASK;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return 1;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_unused_area(
+	struct kimage *image, unsigned long size, unsigned long align,
+	unsigned long *area)
+{
+	/* Walk through mem_map and find the first chunk of
+	 * ununsed memory that is at least size bytes long.
+	 */
+	/* Since the kernel plays with Page_Reseved mem_map is less
+	 * than ideal for this purpose, but it will give us a correct
+	 * conservative estimate of what we need to do. 
+	 */
+	/* For now we take advantage of the fact that all kernel pages
+	 * are marked with PG_resereved to allocate a large
+	 * contiguous area for the reboot code buffer.
+	 */
+	unsigned long addr;
+	unsigned long start, end;
+	unsigned long mask;
+	mask = ((1 << align) -1);
+	start = end = PAGE_SIZE;
+	for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+		struct page *page;
+		unsigned long aligned_start;
+		page = virt_to_page(phys_to_virt(addr));
+		if (PageReserved(page) ||
+			kimage_is_destination_page(image, addr)) {
+			/* The current page is reserved so the start &
+			 * end of the next area must be atleast at the
+			 * next page.
+			 */
+			start = end = addr + PAGE_SIZE;
+		}
+		else {
+			/* O.k.  The current page isn't reserved
+			 * so push up the end of the area.
+			 */
+			end = addr;
+		}
+		aligned_start = (start + mask) & ~mask;
+		if (aligned_start > start) {
+			continue;
+		}
+		if (aligned_start > end) {
+			continue;
+		}
+		if (end - aligned_start >= size) {
+			*area = aligned_start;
+			return 0;
+		}
+	}
+	*area = 0;
+	return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+	struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	unsigned long destination = 0;
+	for_each_kimage_entry(image, ptr, entry) {
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_SOURCE) {
+			if (page == destination) {
+				return ptr;
+			}
+			destination += PAGE_SIZE;
+		}
+	}
+	return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+	struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+	kimage_entry_t *ptr, entry;
+	for_each_kimage_entry(image, ptr, entry) {
+		unsigned long page;
+		if (ptr == limit) {
+			return 0;
+		}
+		else if (entry & IND_DESTINATION) {
+			/* nop */
+		}
+		else if (entry & IND_DONE) {
+			/* nop */
+		}
+		else {
+			/* SOURCE & INDIRECTION */
+			page = entry & PAGE_MASK;
+			if (page == destination) {
+				return ptr;
+			}
+		}
+	}
+	return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+	kimage_entry_t *ptr, *cptr, entry;
+	unsigned long buffer, page;
+	unsigned long destination = 0;
+
+	/* Here we implement safe guards to insure that
+	 * a source page is not copied to it's destination
+	 * page before the data on the destination page is
+	 * no longer useful.
+	 *
+	 * To make it work we actually wind up with a 
+	 * stronger condition.  For every page considered
+	 * it is either it's own destination page or it is
+	 * not a destination page of any page considered.
+	 *
+	 * Invariants 
+	 * 1. buffer is not a destination of a previous page.
+	 * 2. page is not a destination of a previous page.
+	 * 3. destination is not a previous source page.
+	 *
+	 * Result: Either a source page and a destination page 
+	 * are the same or the page is not a destination page.
+	 *
+	 * These checks could be done when we allocate the pages,
+	 * but doing it as a final pass allows us more freedom
+	 * on how we allocate pages.
+	 * 
+	 * Also while the checks are necessary, in practice nothing
+	 * happens.  The destination kernel wants to sit in the
+	 * same physical addresses as the current kernel so we never
+	 * actually allocate a destination page.
+	 *
+	 * BUGS: This is a O(N^2) algorithm.
+	 */
+
+	
+	buffer = __get_free_page(GFP_KERNEL);
+	if (!buffer) {
+		return -ENOMEM;
+	}
+	buffer = virt_to_phys((void *)buffer);
+	for_each_kimage_entry(image, ptr, entry) {
+		/* Here we check to see if an allocated page */
+		kimage_entry_t *limit;
+		if (entry & IND_DESTINATION) {
+			destination = entry & PAGE_MASK;
+		}
+		else if (entry & IND_INDIRECTION) {
+			/* Indirection pages must include all of their
+			 * contents in limit checking.
+			 */
+			limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+		}
+		if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+			continue;
+		}
+		page = entry & PAGE_MASK;
+		limit = ptr;
+
+		/* See if a previous page has the current page as it's 
+		 * destination.
+		 * i.e. invariant 2
+		 */
+		cptr = kimage_dst_conflict(image, page, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+			*cptr = page | (centry & ~PAGE_MASK);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = cpage;
+		}
+		if (!(entry & IND_SOURCE)) {
+			continue;
+		}
+
+		/* See if a previous page is our destination page.
+		 * If so claim it now.
+		 * i.e. invariant 3
+		 */
+		cptr = kimage_src_conflict(image, destination, limit);
+		if (cptr) {
+			unsigned long cpage;
+ 			kimage_entry_t centry;
+			centry = *cptr;
+			cpage = centry & PAGE_MASK;
+			memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+			memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+			*cptr = buffer | (centry & ~PAGE_MASK);
+			*ptr = cpage | ( entry & ~PAGE_MASK);
+			buffer = page;
+		}
+		/* If the buffer is my destination page do the copy now 
+		 * i.e. invariant 3 & 1
+		 */
+		if (buffer == destination) {
+			memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+			*ptr = buffer | (entry & ~PAGE_MASK);
+			buffer = page;
+		}
+	}
+	free_page((unsigned long)phys_to_virt(buffer));
+	return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+	unsigned long len)
+{
+	unsigned long pos;
+	int result;
+	for(pos = 0; pos < len; pos += PAGE_SIZE) {
+		char *page;
+		result = -ENOMEM;
+		page = (void *)__get_free_page(GFP_KERNEL);
+		if (!page) {
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result) {
+			goto out;
+		}
+	}
+	result = 0;
+ out:
+	return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+	struct kexec_segment *segment)
+{	
+	unsigned long mstart;
+	int result;
+	unsigned long offset;
+	unsigned long offset_end;
+	unsigned char *buf;
+
+	result = 0;
+	buf = segment->buf;
+	mstart = (unsigned long)segment->mem;
+
+	offset_end = segment->memsz;
+
+	result = kimage_set_destination(image, mstart);
+	if (result < 0) {
+		goto out;
+	}
+	for(offset = 0;  offset < segment->memsz; offset += PAGE_SIZE) {
+		char *page;
+		size_t size, leader;
+		page = (char *)__get_free_page(GFP_KERNEL);
+		if (page == 0) {
+			result  = -ENOMEM;
+			goto out;
+		}
+		result = kimage_add_page(image, virt_to_phys(page));
+		if (result < 0) {
+			goto out;
+		}
+		if (segment->bufsz < offset) {
+			/* We are past the end zero the whole page */
+			memset(page, 0, PAGE_SIZE);
+			continue;
+		}
+		size = PAGE_SIZE;
+		leader = 0;
+		if ((offset == 0)) {
+			leader = mstart & ~PAGE_MASK;
+		}
+		if (leader) {
+			/* We are on the first page zero the unused portion */
+			memset(page, 0, leader);
+			size -= leader;
+			page += leader;
+		}
+		if (size > (segment->bufsz - offset)) {
+			size = segment->bufsz - offset;
+		}
+		result = copy_from_user(page, buf + offset, size);
+		if (result) {
+			result = (result < 0)?result : -EIO;
+			goto out;
+		}
+		if (size < (PAGE_SIZE - leader)) {
+			/* zero the trailing part of the page */
+			memset(page + size, 0, (PAGE_SIZE - leader) - size);
+		}
+	}
+ out:
+	return result;
+}
+
+
+/* do_kexec executes a new kernel 
+ */
+static int do_kexec(unsigned long start, unsigned long nr_segments,
+	struct kexec_segment *arg_segments, struct kimage *image)
+{
+	struct kexec_segment *segments;
+	size_t segment_bytes;
+	int i;
+
+	int result; 
+	unsigned long reboot_code_buffer;
+	kimage_entry_t *end;
+
+	/* Initialize variables */
+	segments = 0;
+
+	segment_bytes = nr_segments * sizeof(*segments);
+	segments = kmalloc(GFP_KERNEL, segment_bytes);
+	if (segments == 0) {
+		result = -ENOMEM;
+		goto out;
+	}
+	result = copy_from_user(segments, arg_segments, segment_bytes);
+	if (result) {
+		goto out;
+	}
+
+	/* Read in the data from user space */
+	image->start = start;
+	for(i = 0; i < nr_segments; i++) {
+		result = kimage_load_segment(image, &segments[i]);
+		if (result) {
+			goto out;
+		}
+	}
+	
+	/* Terminate early so I can get a place holder. */
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+	end = image->entry;
+
+	/* Usage of the reboot code buffer is subtle.  We first
+	 * find a continguous area of ram, that is not one
+	 * of our destination pages.  We do not allocate the ram.
+	 *
+	 * The algorithm to make certain we do not have address
+	 * conflicts requires each destination region to have some
+	 * backing store so we allocate abitrary source pages.
+	 *
+	 * Later in machine_kexec when we copy data to the
+	 * reboot_code_buffer it still may be allocated for other
+	 * purposes, but we do know there are no source or destination
+	 * pages in that area.  And since the rest of the kernel
+	 * is already shutdown those pages are free for use,
+	 * regardless of their page->count values.
+	 *
+	 * The kernel mapping is of the reboot code buffer is passed to
+	 * the machine dependent code.  If it needs something else
+	 * it is free to set that up.
+	 */
+	result = kimage_get_unused_area(
+		image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+		&reboot_code_buffer);
+	if (result) 
+		goto out;
+
+	/* Allocating pages we should never need  is silly but the
+	 * code won't work correctly unless we have dummy pages to
+	 * work with. 
+	 */
+	result = kimage_set_destination(image, reboot_code_buffer);
+	if (result) 
+		goto out;
+	result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+	if (result)
+		goto out;
+	image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = kimage_get_off_destination_pages(image);
+	if (result)
+		goto out;
+
+	/* Now hide the extra source pages for the reboot code buffer.
+	 */
+	image->entry = end;
+	result = kimage_terminate(image);
+	if (result)
+		goto out;
+
+	result = 0;
+ out:
+	/* cleanup and exit */
+	if (segments)	kfree(segments);
+	return result;
+}
+
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ * 
+ * This call breaks up into three pieces.  
+ * - A generic part which loads the new kernel from the current
+ *   address space, and very carefully places the data in the
+ *   allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ *   the devices to shut down.  Preventing on-going dmas, and placing
+ *   the devices in a consistent state so a later kernel can
+ *   reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ *   and the copies the image to it's final destination.  And
+ *   jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, 
+	struct kexec_segment *segments, unsigned long flags)
+{
+	/* Am I using to much stack space here? */
+	struct kimage *image, *old_image;
+	int result;
+		
+	/* We only trust the superuser with rebooting the system. */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* In case we need just a little bit of special behavior for
+	 * reboot on panic 
+	 */
+	if (flags != 0)
+		return -EINVAL;
+
+	image = 0;
+	if (nr_segments > 0) {
+		image = kimage_alloc();
+		if (!image) {
+			return -ENOMEM;
+		}
+		result = do_kexec(entry, nr_segments, segments, image);
+		if (result) {
+			kimage_free(image);
+			return result;
+		}
+	}
+
+	old_image = xchg(&kexec_image, image);
+
+	kimage_free(old_image);
+	return 0;
+}
diff -uNr linux-2.5.48/kernel/sys.c linux-2.5.48.x86kexec/kernel/sys.c
--- linux-2.5.48/kernel/sys.c	Sun Nov 17 22:51:26 2002
+++ linux-2.5.48.x86kexec/kernel/sys.c	Sun Nov 17 22:53:09 2002
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/kexec.h>
 #include <linux/workqueue.h>
 #include <linux/device.h>
 #include <linux/times.h>
@@ -206,6 +207,7 @@
 cond_syscall(sys_lookup_dcookie)
 cond_syscall(sys_swapon)
 cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
 cond_syscall(sys_init_module)
 cond_syscall(sys_delete_module)
 
@@ -416,6 +418,27 @@
 		machine_restart(buffer);
 		break;
 
+#ifdef CONFIG_KEXEC
+	case LINUX_REBOOT_CMD_KEXEC:
+	{
+		struct kimage *image;
+		if (arg) {
+			unlock_kernel();
+			return -EINVAL;
+		}
+		image = xchg(&kexec_image, 0);
+		if (!image) {
+			unlock_kernel();
+			return -EINVAL;
+		}
+		notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+		system_running = 0;
+		device_shutdown();
+		printk(KERN_EMERG "Starting new kernel\n");
+		machine_kexec(image);
+		break;
+	}
+#endif
 #ifdef CONFIG_SOFTWARE_SUSPEND
 	case LINUX_REBOOT_CMD_SW_SUSPEND:
 		if (!software_suspend_enabled) {

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
@ 2002-11-19  1:10                                                   ` Andy Pfiffer
  2002-11-19 10:25                                                     ` Eric W. Biederman
  2002-11-20  8:49                                                     ` Suparna Bhattacharya
  2002-11-19  2:15                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Dave Hansen
                                                                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-19  1:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh,
	Dave Hansen, Linuxbios

[-- Attachment #1: Type: text/plain, Size: 1825 bytes --]

On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> kexec is a set of systems call that allows you to load another kernel
> from the currently executing Linux kernel.  The current implementation
> has only been tested, and had the kinks worked out on x86, but the
> generic code should work on any architecture.

Great News, Eric.  For the first time *ever* I got a kexec reboot to
work on my most troublesome machine (see below).

Current .config settings:
# CONFIG_SMP is not set
CONFIG_X86_GOOD_APIC=y
# CONFIG_X86_UP_APIC is not set
CONFIG_KEXEC=y

Oddly, kexec_test still hangs.
# ./kexec-1.7 --force ./kexec_test-1.7
FIXME assuming 6Synchronizing SCSI caches: 4M of ram

Shutting down devices
Starting new kernel
kexec_test 1.7 starting...
eax: 0E1FB007 ebx: 0000111C ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 0000006F 000010A0
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
<hang>

Complete kernel boot-up log attached below.  I'm going to try to find my
other 576MB of RAM with the right command-line magic... ;^)

For those looking to replicate:


    0. apply these two patches to 2.5.48 (bk Changeset 1.842)
    http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
    http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
    
    2. compile this:
    http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
    
    3. my recipe for rebooting:
    a) I have a script that I execute by hand after "init 1" to unmount
    my filesystems and then remount / and /boot read-only.
    b) I have the kexec binary installed in /boot.
    c) ./kexec-1.7 --force --debug "--command-line=ro root=805
    console=ttyS0,9600n8" ./linux-2.5

Thanks, Eric!

Andy


[-- Attachment #2: Type: text/plain, Size: 12206 bytes --]

# ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
FIXME assuming 64M of ram
setup16_end: 00091b1f
FIXME assuming 64M of ram
Synchronizing SCSI caches: 
Shutting down devices
Starting new kernel
Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000001000 - 000000000009ffff (usable)
 BIOS-e820: 0000000000100000 - 0000000003ffffff (usable)
63MB LOWMEM available.
hm, page 00000000 reserved twice.
On node 0 totalpages: 16383
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 12287 pages, LIFO batch:2
  HighMem zone: 0 pages, LIFO batch:1
IBM machine detected. Enabling interrupts during APM calls.
IBM machine detected. Disabling SMBus accesses.
Building zonelist for node : 0
Kernel command line: ro root=805 console=ttyS0,9600n8
Initializing CPU#0
Detected 799.717 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1581.05 BogoMIPS
Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem)
Security Scaffold v1.0.0 initialized
Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel Pentium III (Coppermine) stepping 0a
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
mtrr: v2.0 (20020519)
Linux Plug and Play Support v0.9 (c) Adam Belay
PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1
PCI: Using configuration type 1
BIO: pool of 256 setup, 14Kb (56 bytes/bio)
biovec pool[0]:   1 bvecs: 116 entries (12 bytes)
biovec pool[1]:   4 bvecs: 116 entries (48 bytes)
biovec pool[2]:  16 bvecs:  58 entries (192 bytes)
biovec pool[3]:  64 bvecs:  29 entries (768 bytes)
biovec pool[4]: 128 bvecs:  14 entries (1536 bytes)
biovec pool[5]: 256 bvecs:   7 entries (3072 bytes)
block request queues:
 112 requests per read queue
 112 requests per write queue
 8 requests per batch
 enter congestion at 27
 exit congestion at 29
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
drivers/usb/core/usb.c: registered new driver usbfs
drivers/usb/core/usb.c: registered new driver hub
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Discovered peer bus 01
Starting kswapd
aio_setup: sizeof(struct page) = 40
[c3fb2040] eventpoll: successfully initialized.
Journalled Block Device driver loaded
Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
udf: registering filesystem
Capability LSM initialized
Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
parport0: PC-style at 0x378 [PCSPP]
pty: 256 Unix98 ptys configured
lp0: using parport0 (polling).
Linux agpgart interface v0.99 (c) Jeff Hartmann
agpgart: Maximum main memory to use for agp memory: 27M
agpgart: unable to determine aperture size.
agpgart: Maximum main memory to use for agp memory: 27M
agpgart: unable to determine aperture size.
[drm] Initialized radeon 1.7.0 20020828 on minor 0
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Intel(R) PRO/100 Network Driver - version 2.1.24-k2
Copyright (c) 2002 Intel Corporation

e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B)
  Mem:0xfeb7f000  IRQ:11  Speed:0 Mbps  Dx:N/A
  Hardware receive checksums enabled
  cpu cycle saver enabled

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: ATAPI 48X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
end_request: I/O error, dev hda, sector 0
SCSI subsystem driver Revision: 1.00
PCI: Enabling device 01:03.0 (0156 -> 0157)
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
        <Adaptec aic7892 Ultra160 SCSI adapter>
        aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
  Vendor: IBM-PSG   Model: ST318436LC    !#  Rev: 3281
  Type:   Direct-Access                      ANSI SCSI revision: 03
(scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
  Vendor: IBM-PSG   Model: ST318436LC    !#  Rev: 3281
  Type:   Direct-Access                      ANSI SCSI revision: 03
  Vendor: IBM       Model: YGLv3 S2          Rev: 0   
  Type:   Processor                          ANSI SCSI revision: 02
scsi0:A:0:0: Tagged Queuing enabled.  Depth 64
SCSI device sda: drive cache: write through
SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
scsi0:A:1:0: Tagged Queuing enabled.  Depth 64
SCSI device sdb: drive cache: write through
SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
 sdb: sdb1
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0,  type 3
Initializing USB Mass Storage driver...
drivers/usb/core/usb.c: registered new driver usb-storage
USB Mass Storage support registered.
mice: PS/2 mouse device common for all mice
input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC).
request_module[snd-card-0]: not ready
request_module[snd-card-1]: not ready
request_module[snd-card-2]: not ready
request_module[snd-card-3]: not ready
request_module[snd-card-4]: not ready
request_module[snd-card-5]: not ready
request_module[snd-card-6]: not ready
request_module[snd-card-7]: not ready
ALSA device list:
  No soundcards found.
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 4096 bind 4096)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 304k freed
INIT: version 2.82 booting
Running /etc/init.d/boot
Mounting /proc device                                                done
Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6.  Priority:42 extents:1
nel
Activating swap-devices in /etc/fstab...                             done
showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
Checking file systems...
fsck 1.26 (3-Feb-2002)
/dev/sda5: clean, 16935/66264 files, 104836/265041 blocks
/dev/sda1: clean, 55/10040 files, 24115/40131 blocks
/dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks
/dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks
/dev/sda9: clean, 51895/263296 files, 310582/526120 blocks
/dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks
/dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal
s, 111363/263056 blocks                                              done
Setting up /lib/modules/2.5.48                                       failed
Mounting local file systems...
kjournald starting.  Commit interval 5 seconds
proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw)
deinternal journal
vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode.
 type devpts (rw,mode=0620,gid=5)
/dev/sdb1 on /2nd type ext3 (kjournald starting.  Commit interval 5 seconds
rw)
/dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal
2 (rw)
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda10 on /home type ext3 (rw)
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda9 on /opt type ext3 (rw)
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda8 on /usr type ext3 (rw)
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
/dev/sda7 on /var type ext3 (rw)                                     done
Restore device permissions                                           done
Activating remaining swap-devices in /etc/fstab...                   done
Setting up the CMOS clock                                            done
Setting up timezone data                                             done
Configuring serial ports...
ttyS0 at 0x03f8 (irq = 4) is a 16550A
ttyS1 at 0x02f8 (irq = 3) is a 16550A
Configured serial ports                                              done
Setting up hostname 'joe'                                            done
Setting up loopback interface                                        done
Creating /var/log/boot.msg                                           done
showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
INIT: Entering runlevel: 5
blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
Master Resource Control: previous runlevel: N, switching to runlevel:5
Starting personal-firewall (initial) [not active]                    unused
Initializing random number generator                                 done
Setting up network interfaces:
    lo                                                               done
    eth0      (DHCP) IP address: 172.20.1.38                         done
Starting syslog services                                             done
Starting hotplugging services [ net pci usb ]                        failed
Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex
                                                                     done
Starting RPC portmap daemon                                          done
Starting SSH daemon                                                  done
Starting sound driver:  already running                              done
Starting service at daemon                                           done
Initializing SMTP port (sendmail)                                    done
Loading keymap qwerty/us.map.gz                                      done
Loading compose table winkeys shiftctrl latin1.add                   done
Loading console font lat1-16.psfu                                    done
Loading screenmap none                                               done
Setting up console ttys                                              done
Starting service kdm                                                 done
Starting CRON daemon                                                 done
Starting Name Service Cache Daemon                                   done
Starting inetd                                                       done
Starting personal-firewall (final) [not active]                      unused
Master Resource Control: runlevel 5 has been                         reached
Failed services in runlevel 5:                                   hotplug
Skipped services in runlevel 5:  personal-firewall.initial splash personal-firewall.final


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
  2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
@ 2002-11-19  2:15                                                   ` Dave Hansen
  2002-11-19 10:13                                                     ` Eric W. Biederman
  2002-12-02  4:41                                                   ` [ANNOUNCE] kexec-tools-1.8 Eric W. Biederman
  2002-12-02 15:54                                                   ` Eric W. Biederman
  3 siblings, 1 reply; 333+ messages in thread
From: Dave Hansen @ 2002-11-19  2:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya, Jeff Garzik,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh,
	Linuxbios

Eric W. Biederman wrote:
> kexec is a set of systems call that allows you to load another kernel
> from the currently executing Linux kernel.  The current implementation
> has only been tested, and had the kinks worked out on x86, but the
> generic code should work on any architecture.
> 
> Could I get some feed back on where this work and where this breaks.
> With the maturation of kexec-tools to skip attempting bios calls,
> I expect a new the linux kernel to load for most people.  Though I
> also expect some device drivers will not reinitialize after the reboot.

I give it a big thumbs-up.  Between the NUMAQs and the big xSeries 
machines, we have a lot of slow rebooters.  The 16GB intel boxes take 
at about 5 minutes to get back to the bootloader after a reboot, and 
the 4 and 8-quad NUMAQ's take closer to 10.

The IBM machines I've tried it on are a 4-way and 8-way PIII.  They 
both have aic7xxx cards and the 8-way has a ServeRAID 4 controller. 
They have a collection of acenic, e1000, pcnet32 and eepro100 net 
cards.  All seem to work just fine.

The NUMAQ is another story, though.  I get nothing after "Starting new 
kernel".  But, I wasn't expecting much.  The NUMAQ is pretty weird 
hardware and god knows what is actually happening.  I'll try it some 
more when I'm more confident in what I'm doing.

What's the deal with "FIXME assuming 64M of ram"?  I was a little 
surprised when my 16GB machine started to OOM as I did a "make -j8 
bzImage" :)  Why is it that you need the memory size at load time?
-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19  2:15                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Dave Hansen
@ 2002-11-19 10:13                                                     ` Eric W. Biederman
  2002-11-19 15:28                                                       ` Martin J. Bligh
  2002-11-19 16:24                                                       ` Dave Hansen
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 10:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh

Dave Hansen <haveblue@us.ibm.com> writes:

> Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel.  The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
> > Could I get some feed back on where this work and where this breaks.
> > With the maturation of kexec-tools to skip attempting bios calls,
> > I expect a new the linux kernel to load for most people.  Though I
> > also expect some device drivers will not reinitialize after the reboot.
> 
> I give it a big thumbs-up.  

And you thought I was kidding when I said it was mostly working :)

> Between the NUMAQs and the big xSeries machines, we
> have a lot of slow rebooters.  The 16GB intel boxes take at about 5 minutes to
> get back to the bootloader after a reboot, and the 4 and 8-quad NUMAQ's take
> closer to 10.

Wow. 10 minutes is a pain.  That certainly explains your interest...
 
> The IBM machines I've tried it on are a 4-way and 8-way PIII.  They both have
> aic7xxx cards and the 8-way has a ServeRAID 4 controller. They have a collection
> 
> of acenic, e1000, pcnet32 and eepro100 net cards.  All seem to work just fine.
> 
> The NUMAQ is another story, though.  I get nothing after "Starting new kernel".
> But, I wasn't expecting much.  The NUMAQ is pretty weird hardware and god knows
> what is actually happening.  I'll try it some more when I'm more confident in
> what I'm doing.

I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
special handling.   Does kexec_test not print anything, or were you not patient
enough?
 
> What's the deal with "FIXME assuming 64M of ram"?  I was a little surprised when
> 
> my 16GB machine started to OOM as I did a "make -j8 bzImage" :)  Why is it that
> you need the memory size at load time?

Small steps.   When I bypass the BIOS I need to get all of the information
the kernel normally would get from the BIOS from someplace else.  Currently
you can use the "mem= " kernel command line parameters.  Of you can dig the
/proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
There isn't a really good source, so I started with something that would work,
and I will work the user space tools up to something that works well.

I will happily accept patches :)

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
@ 2002-11-19 10:25                                                     ` Eric W. Biederman
  2002-11-19 17:21                                                       ` Andy Pfiffer
  2002-11-20  8:49                                                     ` Suparna Bhattacharya
  1 sibling, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 10:25 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen

Andy Pfiffer <andyp@osdl.org> writes:

> On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel.  The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
> 
> Great News, Eric.  For the first time *ever* I got a kexec reboot to
> work on my most troublesome machine (see below).

Cool.  I was pretty certain it would get into Linux but the fact the device
drivers are not hanging up is a real plus.
 
> Current .config settings:
> # CONFIG_SMP is not set
> CONFIG_X86_GOOD_APIC=y
> # CONFIG_X86_UP_APIC is not set
> CONFIG_KEXEC=y
> 
> Oddly, kexec_test still hangs.
> # ./kexec-1.7 --force ./kexec_test-1.7
[snip...]
> <hang>

Yep.  I really haven't tracked and fixed the cause of the hang,
I just avoided the issue entirely.  Eventually I will come back
and look into what it takes to improve the odds of having BIOS calls,
work.  --real-mode restores the old kexec behavior.

All of the real changes were to the user space code.  The kernel
patch stayed the same.

> Complete kernel boot-up log attached below.  I'm going to try to find my
> other 576MB of RAM with the right command-line magic... ;^)

Or you can write a routine to gather that information dynamically and send
me a patch for /sbin/kexec.  Though it may take another proc file to do
that one properly.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 10:13                                                     ` Eric W. Biederman
@ 2002-11-19 15:28                                                       ` Martin J. Bligh
  2002-11-19 17:44                                                         ` Eric W. Biederman
  2002-11-19 16:24                                                       ` Dave Hansen
  1 sibling, 1 reply; 333+ messages in thread
From: Martin J. Bligh @ 2002-11-19 15:28 UTC (permalink / raw)
  To: Eric W. Biederman, Dave Hansen; +Cc: Linux Kernel Mailing List

> I suspect the hardware shutdown and start up logic for NUMAQ cpus 
> needs some special handling.   

Almost certainly ;-) One of the main things I do differently on boot
is to use NMIs rather than the normal INIT/STARTUP sequence to bootstrap
CPUs with .... thus they aren't as thoroughly reset. Things like clearing
down the local APIC state (but NOT the LDR) and clearing down the IO-APICs
will be especially important. I haven't looked at your code yet to see
exactly what it does here though.

> Small steps.   When I bypass the BIOS I need to get all of the information
> the kernel normally would get from the BIOS from someplace else.  Currently
> you can use the "mem= " kernel command line parameters.  Of you can dig the
> /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> There isn't a really good source, so I started with something that would work,
> and I will work the user space tools up to something that works well.
> 
> I will happily accept patches :)

Sounds like we should just export back to you the value we parsed from
the BIOS from the existing boot, no? I'll see if I can make you a patch
to do that ...

M.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 10:13                                                     ` Eric W. Biederman
  2002-11-19 15:28                                                       ` Martin J. Bligh
@ 2002-11-19 16:24                                                       ` Dave Hansen
  2002-11-19 17:33                                                         ` Linus Torvalds
  2002-11-19 17:42                                                         ` Eric W. Biederman
  1 sibling, 2 replies; 333+ messages in thread
From: Dave Hansen @ 2002-11-19 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh

Eric W. Biederman wrote:
> Dave Hansen <haveblue@us.ibm.com> writes:
>>The NUMAQ is another story, though.  I get nothing after "Starting new kernel".
>>But, I wasn't expecting much.  The NUMAQ is pretty weird hardware and god knows
>>what is actually happening.  I'll try it some more when I'm more confident in
>>what I'm doing.
> 
> I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
> special handling.   Does kexec_test not print anything, or were you not patient
> enough?

Starting new kernel
kexec_test 1.6 starting...
eax: 0E1FB007 ebx: 0000111C ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 0000006F 000010A0
Switching descriptors.
Descriptors changed.
Legacy pic setup.
In real mode.
Interrupts enabled.
Base memory size: 027E
A20 disabled.
E820 Memory Map.
000000000009FC00 @ 0000000000000000 type: 00000001
00000000EFF00000 @ 0000000000100000 type: 00000001
0000000000180000 @ 00000000FFE80000 type: 00000002
0000000000009000 @ 00000000FEC00000 type: 00000002
0000000100000000 @ 0000000100000000 type: 00000001
E801  Memory size: 003D7400
Mem88 Memory size: FC00
Testing for APM.
APM test done.
Equiptment list: 4426
Sysdesc: F000:E6F5
Video type: VGA
Cursor Position(Row,Column): 0018 0000
Video Mode: 0003
Setting auto repeat rate  done
DASD type: 0300 00FAC53F
EDD:  ok
A20 enabled
Interrupts disabled.
In protected mode.
Halting.

>>What's the deal with "FIXME assuming 64M of ram"?  I was a little surprised when
>>
>>my 16GB machine started to OOM as I did a "make -j8 bzImage" :)  Why is it that
>>you need the memory size at load time?
> 
> Small steps.   When I bypass the BIOS I need to get all of the information
> the kernel normally would get from the BIOS from someplace else.  Currently
> you can use the "mem= " kernel command line parameters.  Of you can dig the
> /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> There isn't a really good source, so I started with something that would work,
> and I will work the user space tools up to something that works well.

I have a couple of ideas.  But, first, is it hard to reconstruct the 
memory map?  Will all 1GB systems have the same memory map?  Do you 
have documentation of the format?  I don't think that any of these 
qualify as the "right thing".  But, as hacks, they should keep me 
happy for a bit.

For now, I can write a quick script to fix it: 
--command-line="`memscript`"

Until it is working a --hack-mem option might be a good idea

Perhaps we could just save a copy off when the kernel loads for the 
first time. If we export it somewhere, the kexec executable can just 
copy it.  For now, we can just printk it and paste it into each 
version of kexec that we compile.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19 10:25                                                     ` Eric W. Biederman
@ 2002-11-19 17:21                                                       ` Andy Pfiffer
  2002-11-19 17:34                                                         ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-19 17:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen

On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > Complete kernel boot-up log attached below.  I'm going to try to find my
> > other 576MB of RAM with the right command-line magic... ;^)
> 
> Or you can write a routine to gather that information dynamically and send
> me a patch for /sbin/kexec.  Though it may take another proc file to do
> that one properly.
> 
> Eric

Just to make sure I understand the problem.  Until we can make all
boot-time BIOS calls work, we need a way to:

    1) capture the initial memory map used by the kernel, and
    2) a way to supply that information to the to-be-run image.
    
On my system, the e820 map looks like this (from full reboot):
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009dc00 (usable)
 BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 0000000027fed140 (usable)
 BIOS-e820: 0000000027fed140 - 0000000027ff0000 (ACPI data)
 BIOS-e820: 0000000027ff0000 - 0000000028000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
639MB LOWMEM available.

And /proc/iomem looks like this:
00000000-0009dbff : System RAM
0009dc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000ca000-000cb7ff : Extension ROM
000cb800-000cffff : Extension ROM
000f0000-000fffff : System ROM
00100000-27fed13f : System RAM
  00100000-00309f9a : Kernel code
  00309f9b-003d873f : Kernel data
27fed140-27feffff : ACPI Tables
27ff0000-27ffffff : reserved
effff000-efffffff : Adaptec AIC-7892P U160/m
  effff000-efffffff : aic7xxx
f0000000-f7ffffff : S3 Inc. Savage 4
fea00000-feafffff : Intel Corp. 82557/8/9 [Ethernet 
  fea00000-feafffff : e100
feb7e000-feb7efff : ServerWorks OSB4/CSB5 USB Contro
feb7f000-feb7ffff : Intel Corp. 82557/8/9 [Ethernet 
  feb7f000-feb7ffff : e100
feb80000-febfffff : S3 Inc. Savage 4
fec00000-ffffffff : reserved

Comparing the two:
Range			e820		/proc/iomem
0000000-0009dbff	usable		System RAM
0100000-27fed140	usable		System RAM

>From a sample of 1 system, it looks like we should be able to use any
ranges marked as "System RAM" that are listed /proc/iomem.  Did I miss
something?

I'll see if I can conjure up something...

Andy




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 16:24                                                       ` Dave Hansen
@ 2002-11-19 17:33                                                         ` Linus Torvalds
  2002-11-19 17:48                                                           ` Eric W. Biederman
  2002-11-19 17:42                                                         ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Linus Torvalds @ 2002-11-19 17:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Andy Pfiffer,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh


On Tue, 19 Nov 2002, Dave Hansen wrote:
> 
> I have a couple of ideas.  But, first, is it hard to reconstruct the 
> memory map?

Hmm.. You shouldn't need to reconstruct it. It's all there in the

	struct e820map e820;

(yeah, we will have modified it to match the setup of the running kernel, 
but on the whole it should all be there, no?)

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19 17:21                                                       ` Andy Pfiffer
@ 2002-11-19 17:34                                                         ` Eric W. Biederman
  2002-11-19 18:17                                                           ` Martin J. Bligh
  2002-11-19 19:29                                                           ` Andy Pfiffer
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 17:34 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen

Andy Pfiffer <andyp@osdl.org> writes:

> On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > > Complete kernel boot-up log attached below.  I'm going to try to find my
> > > other 576MB of RAM with the right command-line magic... ;^)
> > 
> > Or you can write a routine to gather that information dynamically and send
> > me a patch for /sbin/kexec.  Though it may take another proc file to do
> > that one properly.
> > 
> > Eric
> 
> Just to make sure I understand the problem.  Until we can make all
> boot-time BIOS calls work, we need a way to:

A small clarification.  BIOS calls will never work 100%.  Especially in the
interesting cases like kexec on panic.  So entering the kernel in
32bit mode will continue to be the default mode of.  This means the
final solution to problems like this needs to be a good one.
 
>     1) capture the initial memory map used by the kernel, and
>     2) a way to supply that information to the to-be-run image.
>     
> On my system, the e820 map looks like this (from full reboot):
> BIOS-provided physical RAM map:
>  BIOS-e820: 0000000000000000 - 000000000009dc00 (usable)
>  BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
>  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
>  BIOS-e820: 0000000000100000 - 0000000027fed140 (usable)
>  BIOS-e820: 0000000027fed140 - 0000000027ff0000 (ACPI data)
>  BIOS-e820: 0000000027ff0000 - 0000000028000000 (reserved)
>  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> 639MB LOWMEM available.
> 
> And /proc/iomem looks like this:
> 00000000-0009dbff : System RAM
> 0009dc00-0009ffff : reserved
> 000a0000-000bffff : Video RAM area
> 000c0000-000c7fff : Video ROM
> 000ca000-000cb7ff : Extension ROM
> 000cb800-000cffff : Extension ROM
> 000f0000-000fffff : System ROM
> 00100000-27fed13f : System RAM
>   00100000-00309f9a : Kernel code
>   00309f9b-003d873f : Kernel data
> 27fed140-27feffff : ACPI Tables
> 27ff0000-27ffffff : reserved
> effff000-efffffff : Adaptec AIC-7892P U160/m
>   effff000-efffffff : aic7xxx
> f0000000-f7ffffff : S3 Inc. Savage 4
> fea00000-feafffff : Intel Corp. 82557/8/9 [Ethernet 
>   fea00000-feafffff : e100
> feb7e000-feb7efff : ServerWorks OSB4/CSB5 USB Contro
> feb7f000-feb7ffff : Intel Corp. 82557/8/9 [Ethernet 
>   feb7f000-feb7ffff : e100
> feb80000-febfffff : S3 Inc. Savage 4
> fec00000-ffffffff : reserved
> 
> Comparing the two:
> Range			e820		/proc/iomem
> 0000000-0009dbff	usable		System RAM
> 0100000-27fed140	usable		System RAM
> 
> >From a sample of 1 system, it looks like we should be able to use any
> ranges marked as "System RAM" that are listed /proc/iomem.  Did I miss
> something?

Only that /proc/iomem is only useful this way on x86 and that
it doesn't capture the details of the memory map above 4GB.  But
it is much better than only having 4GB of main memory.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 16:24                                                       ` Dave Hansen
  2002-11-19 17:33                                                         ` Linus Torvalds
@ 2002-11-19 17:42                                                         ` Eric W. Biederman
  1 sibling, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 17:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh

Dave Hansen <haveblue@us.ibm.com> writes:

> Eric W. Biederman wrote:
> > Dave Hansen <haveblue@us.ibm.com> writes:
> >>The NUMAQ is another story, though.  I get nothing after "Starting new
> kernel".
> 
> >>But, I wasn't expecting much.  The NUMAQ is pretty weird hardware and god
> knows
> 
> >>what is actually happening.  I'll try it some more when I'm more confident in
> >>what I'm doing.
> > I suspect the hardware shutdown and start up logic for NUMAQ cpus needs some
> > special handling.  Does kexec_test not print anything, or were you not patient
> 
> > enough?
> 
> Starting new kernel
> kexec_test 1.6 starting...
[snip successful run of kexec_test]
 

Hmm.  So it looks like you can make bios calls, on the NUMAQ machine.
It is worth a try to see if "kexec --real_mode bzImage...." will start
up your kernel.   Probably not but at least the basic mechanism of kexec
is working.  I would be very surprised if you couldn't at least start
a uniprocessor kernel.

> I have a couple of ideas.  But, first, is it hard to reconstruct the memory map?

>From your kexec_test run, your memory map...

> E820 Memory Map.
> 000000000009FC00 @ 0000000000000000 type: 00000001
> 00000000EFF00000 @ 0000000000100000 type: 00000001
> 0000000000180000 @ 00000000FFE80000 type: 00000002
> 0000000000009000 @ 00000000FEC00000 type: 00000002
> 0000000100000000 @ 0000000100000000 type: 00000001
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The e820 memory map is printed out on boot up.
> 
> Will all 1GB systems have the same memory map? 

Most will have pretty much the same memory map, but in general all systems
with same amount of ram will have different memory maps.   

>  Do you have documentation of the
> format?  I don't think that any of these qualify as the "right thing".  But, as
> hacks, they should keep me happy for a bit.
> 
> For now, I can write a quick script to fix it: --command-line="`memscript`"
> 
> Until it is working a --hack-mem option might be a good idea
> 
> Perhaps we could just save a copy off when the kernel loads for the first
> time. If we export it somewhere, the kexec executable can just copy it.  For
> now, we can just printk it and paste it into each version of kexec that we
> compile.

Yep, essentially that is what needs to happen.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 15:28                                                       ` Martin J. Bligh
@ 2002-11-19 17:44                                                         ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 17:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Dave Hansen, Linux Kernel Mailing List

"Martin J. Bligh" <mbligh@aracnet.com> writes:

> > I suspect the hardware shutdown and start up logic for NUMAQ cpus 
> > needs some special handling.   
> 
> Almost certainly ;-) One of the main things I do differently on boot
> is to use NMIs rather than the normal INIT/STARTUP sequence to bootstrap
> CPUs with .... thus they aren't as thoroughly reset. Things like clearing
> down the local APIC state (but NOT the LDR) and clearing down the IO-APICs
> will be especially important. I haven't looked at your code yet to see
> exactly what it does here though.

That part is in my x86kexec-hwfixes.diff I have a good first stab
at it that works on most x86 SMPs.  But apparently not on NUMAQ.
 
> > Small steps.   When I bypass the BIOS I need to get all of the information
> > the kernel normally would get from the BIOS from someplace else.  Currently
> > you can use the "mem= " kernel command line parameters.  Of you can dig the
> > /proc/iomem and /proc/meminfo and other places and get the BIOS's memory map.
> > There isn't a really good source, so I started with something that would work,
> 
> > and I will work the user space tools up to something that works well.
> > 
> > I will happily accept patches :)
> 
> Sounds like we should just export back to you the value we parsed from
> the BIOS from the existing boot, no? I'll see if I can make you a patch
> to do that ...

Yep.  But we currently don't export it cleanly...

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 17:33                                                         ` Linus Torvalds
@ 2002-11-19 17:48                                                           ` Eric W. Biederman
  2002-11-19 17:54                                                             ` Dave Jones
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-19 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Linux Kernel Mailing List, Andy Pfiffer, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh

Linus Torvalds <torvalds@transmeta.com> writes:

> On Tue, 19 Nov 2002, Dave Hansen wrote:
> > 
> > I have a couple of ideas.  But, first, is it hard to reconstruct the 
> > memory map?
> 
> Hmm.. You shouldn't need to reconstruct it. It's all there in the
> 
> 	struct e820map e820;
> 
> (yeah, we will have modified it to match the setup of the running kernel, 
> but on the whole it should all be there, no?)

Yep.  We just need to get that information out to user space.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7
  2002-11-19 17:48                                                           ` Eric W. Biederman
@ 2002-11-19 17:54                                                             ` Dave Jones
  0 siblings, 0 replies; 333+ messages in thread
From: Dave Jones @ 2002-11-19 17:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Dave Hansen, Linux Kernel Mailing List,
	Andy Pfiffer, Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh

On Tue, Nov 19, 2002 at 10:48:46AM -0700, Eric W. Biederman wrote:
 > > 	struct e820map e820;
 > > 
 > > (yeah, we will have modified it to match the setup of the running kernel, 
 > > but on the whole it should all be there, no?)
 > 
 > Yep.  We just need to get that information out to user space.

Arjan already did this..
http://www.kernelnewbies.org/kernels/rh80/SOURCES/linux-2.4.0-e820.patch

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19 17:34                                                         ` Eric W. Biederman
@ 2002-11-19 18:17                                                           ` Martin J. Bligh
  2002-11-20  9:19                                                             ` Eric W. Biederman
  2002-11-19 19:29                                                           ` Andy Pfiffer
  1 sibling, 1 reply; 333+ messages in thread
From: Martin J. Bligh @ 2002-11-19 18:17 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Pfiffer
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Dave Hansen

>> Just to make sure I understand the problem.  Until we can make all
>> boot-time BIOS calls work, we need a way to:
> 
> A small clarification.  BIOS calls will never work 100%.  Especially in the
> interesting cases like kexec on panic.  So entering the kernel in
> 32bit mode will continue to be the default mode of.  This means the
> final solution to problems like this needs to be a good one.

Do we still have the mpstables and other such initdata around as well?
Or did we destroy those on boot? If we're going to do kexec on panic,
perhaps all these should be checksummed for corruption detection 
eventually (not now).

M.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19 17:34                                                         ` Eric W. Biederman
  2002-11-19 18:17                                                           ` Martin J. Bligh
@ 2002-11-19 19:29                                                           ` Andy Pfiffer
  1 sibling, 0 replies; 333+ messages in thread
From: Andy Pfiffer @ 2002-11-19 19:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Linus Torvalds, Alan Cox,
	Werner Almesberger, Suparna Bhattacharya, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 937 bytes --]

On Tue, 2002-11-19 at 09:34, Eric W. Biederman wrote:
> Andy Pfiffer <andyp@osdl.org> writes:
> 
> > On Tue, 2002-11-19 at 02:25, Eric W. Biederman wrote:
> > > > Complete kernel boot-up log attached below.  I'm going to try to find my
> > > > other 576MB of RAM with the right command-line magic... ;^)
> > > 
> > > Or you can write a routine to gather that information dynamically and send
> > > me a patch for /sbin/kexec.  Though it may take another proc file to do
> > > that one properly.
> > > 
> > > Eric

Hmmm...I seem to be having some trouble setting "mem=" (system hangs). 
Maybe multiple "mem=NNNK@0xXXXXXXXX" options won't work.

While I try to figure out what's going on, here's a program ("kargs")
that composes a kernel command line from the contents of
"/proc/cmndline" and "/proc/iomem".  It doesn't do as much error
checking as it should...

Usage (sh quoting): kexec --force "--command-line=`kargs`" bzImage

Andy



[-- Attachment #2: kargs.c --]
[-- Type: text/x-c, Size: 2180 bytes --]

/*
 *	andyp@osdl.org
 *	Tue Nov 19 09:26:22 PST 2002
 *
 *	Compose a kernel command line on stdout from the contents
 *	of /proc/iomem and /proc/cmndline.
 */

#include <stdio.h>
#include <stdlib.h>
#include <regex.h>


struct memregion {
	unsigned long		first;
	unsigned long		last;
	struct memregion	*next;
};


int memopt(char *iomem, char *out, int outlen)
{
	FILE	*f;
	struct memregion *list, *tmp;
	char	*pattern;
	regex_t	preg;
	int	cc, kb;
	char	line[256];

	if ((f = fopen(iomem, "r")) == NULL)
		return -1;

	pattern = "^[0-9a-fA-F].*-[0-9a-fA-F].* : System RAM";
	if (regcomp(&preg, pattern, 0)) {
		(void) fclose(f);
		return -1;
	}

	list = (struct memregion *) 0;
	while (fgets(line, sizeof(line), f) != NULL) {
		if (regexec(&preg, line, 0, 0, 0) == REG_NOMATCH)
			continue;
		tmp = (struct memregion *) malloc(sizeof(struct memregion));
		if (tmp == (struct memregion *) 0)
			goto out;
		cc = sscanf(line, "%x-%x", &tmp->first, &tmp->last);
		if (cc != 2) {
			free(tmp);
			goto out;
		}
		tmp->next = list;
		list = tmp;
	}

	out[0] = 0;
	tmp = list;
	while (tmp) {
		strcat(out, "mem=");
		kb = (tmp->last - tmp->first + 1) >> 10;
		sprintf(line, "%dK@0x%08x", kb, tmp->first);
		strcat(out, line);
		if (tmp->next)
			strcat(out, " ");
		tmp = tmp->next;
	}

out:
	while (list) {
		tmp = list->next;
		free(list);
		list = tmp;
	}
	regfree(&preg);
	(void) fclose(f);

	return 0;
}


static int lastcmd(char *cmndline, char *out, int outlen)
{
	FILE	*f;
	char	line[256];

	if ((f = fopen(cmndline, "r")) == NULL)
		return -1;
	memset(out, 0, outlen);
	if (fgets(line, sizeof(line), f) != NULL)
		strncpy(out, line, strlen(line) - 1);
	fclose(f);
	return 0;
}


int main(int argc, char **argv)
{
	int	cc;
	char	*name;
	char	memline[256];
	char	curline[256];

	name = "/proc/iomem";
	cc = memopt(name, memline, sizeof(memline));
	if (cc < 0) {
		perror(name);
		exit(1);
	}

	name = "/proc/cmdline";
	cc = lastcmd(name, curline, sizeof(curline));
	if (cc < 0) {
		perror(name);
		exit(1);
	}

	printf("%s %s\n", curline, memline);
	exit(0);
}

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
  2002-11-19 10:25                                                     ` Eric W. Biederman
@ 2002-11-20  8:49                                                     ` Suparna Bhattacharya
  2002-11-20  9:17                                                       ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-20  8:49 UTC (permalink / raw)
  To: Andy Pfiffer
  Cc: Eric W. Biederman, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Werner Almesberger, Jeff Garzik, Matt D. Robinson,
	Rusty Russell, Mike Galbraith, Martin J. Bligh, Dave Hansen,
	Linuxbios

On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > kexec is a set of systems call that allows you to load another kernel
> > from the currently executing Linux kernel.  The current implementation
> > has only been tested, and had the kinks worked out on x86, but the
> > generic code should work on any architecture.
> 
> Great News, Eric.  For the first time *ever* I got a kexec reboot to
> work on my most troublesome machine (see below).

Same here - preloading the new kernel and issuing kexec -e after 
init 1 works on the troublesome SMP system I'd earlier been sending 
you earlier. Bootimg used to work on this setup, so bypassing the 
bios calls had the expected effect.

If I issue the call earlier though, it runs into trouble with aic7xxx
reporting interrupts during setup. Guess you know why we are looking
at that case - eventually need to be able to transition directly at dump 
time without a chance to go through user-space shutdown ... 

Regards
Suparna

> 
> For those looking to replicate:
> 
> 
>     0. apply these two patches to 2.5.48 (bk Changeset 1.842)
>     http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
>     http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
>     
>     2. compile this:
>     http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
>     
>     3. my recipe for rebooting:
>     a) I have a script that I execute by hand after "init 1" to unmount
>     my filesystems and then remount / and /boot read-only.
>     b) I have the kexec binary installed in /boot.
>     c) ./kexec-1.7 --force --debug "--command-line=ro root=805
>     console=ttyS0,9600n8" ./linux-2.5
> 
> Thanks, Eric!
> 
> Andy
> 

> # ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
> FIXME assuming 64M of ram
> setup16_end: 00091b1f
> FIXME assuming 64M of ram
> Synchronizing SCSI caches: 
> Shutting down devices
> Starting new kernel
> Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002
> Video mode to be used for restore is ffff
> BIOS-provided physical RAM map:
>  BIOS-e820: 0000000000001000 - 000000000009ffff (usable)
>  BIOS-e820: 0000000000100000 - 0000000003ffffff (usable)
> 63MB LOWMEM available.
> hm, page 00000000 reserved twice.
> On node 0 totalpages: 16383
>   DMA zone: 4096 pages, LIFO batch:1
>   Normal zone: 12287 pages, LIFO batch:2
>   HighMem zone: 0 pages, LIFO batch:1
> IBM machine detected. Enabling interrupts during APM calls.
> IBM machine detected. Disabling SMBus accesses.
> Building zonelist for node : 0
> Kernel command line: ro root=805 console=ttyS0,9600n8
> Initializing CPU#0
> Detected 799.717 MHz processor.
> Console: colour VGA+ 80x25
> Calibrating delay loop... 1581.05 BogoMIPS
> Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem)
> Security Scaffold v1.0.0 initialized
> Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
> Inode-cache hash table entries: 4096 (order: 3, 32768 bytes)
> Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
> -> /dev
> -> /dev/console
> -> /root
> CPU: L1 I cache: 16K, L1 D cache: 16K
> CPU: L2 cache: 256K
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#0.
> CPU: Intel Pentium III (Coppermine) stepping 0a
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Checking 'hlt' instruction... OK.
> POSIX conformance testing by UNIFIX
> Linux NET4.0 for Linux 2.4
> Based upon Swansea University Computer Society NET3.039
> Initializing RT netlink socket
> mtrr: v2.0 (20020519)
> Linux Plug and Play Support v0.9 (c) Adam Belay
> PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1
> PCI: Using configuration type 1
> BIO: pool of 256 setup, 14Kb (56 bytes/bio)
> biovec pool[0]:   1 bvecs: 116 entries (12 bytes)
> biovec pool[1]:   4 bvecs: 116 entries (48 bytes)
> biovec pool[2]:  16 bvecs:  58 entries (192 bytes)
> biovec pool[3]:  64 bvecs:  29 entries (768 bytes)
> biovec pool[4]: 128 bvecs:  14 entries (1536 bytes)
> biovec pool[5]: 256 bvecs:   7 entries (3072 bytes)
> block request queues:
>  112 requests per read queue
>  112 requests per write queue
>  8 requests per batch
>  enter congestion at 27
>  exit congestion at 29
> isapnp: Scanning for PnP cards...
> isapnp: No Plug & Play device found
> drivers/usb/core/usb.c: registered new driver usbfs
> drivers/usb/core/usb.c: registered new driver hub
> PCI: Probing PCI hardware
> PCI: Probing PCI hardware (bus 00)
> PCI: Discovered peer bus 01
> Starting kswapd
> aio_setup: sizeof(struct page) = 40
> [c3fb2040] eventpoll: successfully initialized.
> Journalled Block Device driver loaded
> Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
> udf: registering filesystem
> Capability LSM initialized
> Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
> ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
> parport0: PC-style at 0x378 [PCSPP]
> pty: 256 Unix98 ptys configured
> lp0: using parport0 (polling).
> Linux agpgart interface v0.99 (c) Jeff Hartmann
> agpgart: Maximum main memory to use for agp memory: 27M
> agpgart: unable to determine aperture size.
> agpgart: Maximum main memory to use for agp memory: 27M
> agpgart: unable to determine aperture size.
> [drm] Initialized radeon 1.7.0 20020828 on minor 0
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a National Semiconductor PC87306
> Intel(R) PRO/100 Network Driver - version 2.1.24-k2
> Copyright (c) 2002 Intel Corporation
> 
> e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B)
>   Mem:0xfeb7f000  IRQ:11  Speed:0 Mbps  Dx:N/A
>   Hardware receive checksums enabled
>   cpu cycle saver enabled
> 
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: ATAPI 48X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> end_request: I/O error, dev hda, sector 0
> SCSI subsystem driver Revision: 1.00
> PCI: Enabling device 01:03.0 (0156 -> 0157)
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4
>         <Adaptec aic7892 Ultra160 SCSI adapter>
>         aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
> 
> (scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
>   Vendor: IBM-PSG   Model: ST318436LC    !#  Rev: 3281
>   Type:   Direct-Access                      ANSI SCSI revision: 03
> (scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit)
>   Vendor: IBM-PSG   Model: ST318436LC    !#  Rev: 3281
>   Type:   Direct-Access                      ANSI SCSI revision: 03
>   Vendor: IBM       Model: YGLv3 S2          Rev: 0   
>   Type:   Processor                          ANSI SCSI revision: 02
> scsi0:A:0:0: Tagged Queuing enabled.  Depth 64
> SCSI device sda: drive cache: write through
> SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB)
>  sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 >
> Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
> scsi0:A:1:0: Tagged Queuing enabled.  Depth 64
> SCSI device sdb: drive cache: write through
> SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB)
>  sdb: sdb1
> Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
> Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0,  type 3
> Initializing USB Mass Storage driver...
> drivers/usb/core/usb.c: registered new driver usb-storage
> USB Mass Storage support registered.
> mice: PS/2 mouse device common for all mice
> input: ImPS/2 Generic Wheel Mouse on isa0060/serio1
> serio: i8042 AUX port at 0x60,0x64 irq 12
> input: AT Set 2 keyboard on isa0060/serio0
> serio: i8042 KBD port at 0x60,0x64 irq 1
> Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC).
> request_module[snd-card-0]: not ready
> request_module[snd-card-1]: not ready
> request_module[snd-card-2]: not ready
> request_module[snd-card-3]: not ready
> request_module[snd-card-4]: not ready
> request_module[snd-card-5]: not ready
> request_module[snd-card-6]: not ready
> request_module[snd-card-7]: not ready
> ALSA device list:
>   No soundcards found.
> NET4: Linux TCP/IP 1.0 for NET4.0
> IP: routing cache hash table of 512 buckets, 4Kbytes
> TCP: Hash tables configured (established 4096 bind 4096)
> NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> VFS: Mounted root (ext3 filesystem) readonly.
> Freeing unused kernel memory: 304k freed
> INIT: version 2.82 booting
> Running /etc/init.d/boot
> Mounting /proc device                                                done
> Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6.  Priority:42 extents:1
> nel
> Activating swap-devices in /etc/fstab...                             done
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
> Checking file systems...
> fsck 1.26 (3-Feb-2002)
> /dev/sda5: clean, 16935/66264 files, 104836/265041 blocks
> /dev/sda1: clean, 55/10040 files, 24115/40131 blocks
> /dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks
> /dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks
> /dev/sda9: clean, 51895/263296 files, 310582/526120 blocks
> /dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks
> /dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal
> s, 111363/263056 blocks                                              done
> Setting up /lib/modules/2.5.48                                       failed
> Mounting local file systems...
> kjournald starting.  Commit interval 5 seconds
> proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw)
> deinternal journal
> vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode.
>  type devpts (rw,mode=0620,gid=5)
> /dev/sdb1 on /2nd type ext3 (kjournald starting.  Commit interval 5 seconds
> rw)
> /dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal
> 2 (rw)
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda10 on /home type ext3 (rw)
> kjournald starting.  Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda9 on /opt type ext3 (rw)
> kjournald starting.  Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda8 on /usr type ext3 (rw)
> kjournald starting.  Commit interval 5 seconds
> EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> /dev/sda7 on /var type ext3 (rw)                                     done
> Restore device permissions                                           done
> Activating remaining swap-devices in /etc/fstab...                   done
> Setting up the CMOS clock                                            done
> Setting up timezone data                                             done
> Configuring serial ports...
> ttyS0 at 0x03f8 (irq = 4) is a 16550A
> ttyS1 at 0x02f8 (irq = 3) is a 16550A
> Configured serial ports                                              done
> Setting up hostname 'joe'                                            done
> Setting up loopback interface                                        done
> Creating /var/log/boot.msg                                           done
> showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel
> INIT: Entering runlevel: 5
> blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled
> Master Resource Control: previous runlevel: N, switching to runlevel:5
> Starting personal-firewall (initial) [not active]                    unused
> Initializing random number generator                                 done
> Setting up network interfaces:
>     lo                                                               done
>     eth0      (DHCP) IP address: 172.20.1.38                         done
> Starting syslog services                                             done
> Starting hotplugging services [ net pci usb ]                        failed
> Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex
>                                                                      done
> Starting RPC portmap daemon                                          done
> Starting SSH daemon                                                  done
> Starting sound driver:  already running                              done
> Starting service at daemon                                           done
> Initializing SMTP port (sendmail)                                    done
> Loading keymap qwerty/us.map.gz                                      done
> Loading compose table winkeys shiftctrl latin1.add                   done
> Loading console font lat1-16.psfu                                    done
> Loading screenmap none                                               done
> Setting up console ttys                                              done
> Starting service kdm                                                 done
> Starting CRON daemon                                                 done
> Starting Name Service Cache Daemon                                   done
> Starting inetd                                                       done
> Starting personal-firewall (final) [not active]                      unused
> Master Resource Control: runlevel 5 has been                         reached
> Failed services in runlevel 5:                                   hotplug
> Skipped services in runlevel 5:  personal-firewall.initial splash personal-firewall.final
> 


-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-20  8:49                                                     ` Suparna Bhattacharya
@ 2002-11-20  9:17                                                       ` Eric W. Biederman
  2002-11-20 11:59                                                         ` Suparna Bhattacharya
  2002-11-20 15:05                                                         ` Werner Almesberger
  0 siblings, 2 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-20  9:17 UTC (permalink / raw)
  To: suparna
  Cc: Andy Pfiffer, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Werner Almesberger, Matt D. Robinson, Rusty Russell,
	Mike Galbraith, Martin J. Bligh, Dave Hansen

Suparna Bhattacharya <suparna@in.ibm.com> writes:

> On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> > On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > > kexec is a set of systems call that allows you to load another kernel
> > > from the currently executing Linux kernel.  The current implementation
> > > has only been tested, and had the kinks worked out on x86, but the
> > > generic code should work on any architecture.
> > 
> > Great News, Eric.  For the first time *ever* I got a kexec reboot to
> > work on my most troublesome machine (see below).
> 
> Same here - preloading the new kernel and issuing kexec -e after 
> init 1 works on the troublesome SMP system I'd earlier been sending 
> you earlier. Bootimg used to work on this setup, so bypassing the 
> bios calls had the expected effect.
> 
> If I issue the call earlier though, it runs into trouble with aic7xxx
> reporting interrupts during setup. Guess you know why we are looking
> at that case - eventually need to be able to transition directly at dump 
> time without a chance to go through user-space shutdown ... 

The needed hooks are there.  You can make certain an appropriate
->shutdown()/reboot_notifier method is present, or you can fix the driver
so it can initialize the device from any random state.  

I really don't know what kinds of failures you hope to recover
from with the kexec on panic code, so I really can't comment on
how well things will work.  There will always be a set of failures
that are non-recoverable, but that doesn't mean there isn't a useful
subset.  Anyway there is certainly plenty of material for you to
experiment with and see what works usefully in practice.

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-19 18:17                                                           ` Martin J. Bligh
@ 2002-11-20  9:19                                                             ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-20  9:19 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andy Pfiffer, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Dave Hansen

"Martin J. Bligh" <mbligh@aracnet.com> writes:

> >> Just to make sure I understand the problem.  Until we can make all
> >> boot-time BIOS calls work, we need a way to:
> > 
> > A small clarification.  BIOS calls will never work 100%.  Especially in the
> > interesting cases like kexec on panic.  So entering the kernel in
> > 32bit mode will continue to be the default mode of.  This means the
> > final solution to problems like this needs to be a good one.
> 
> Do we still have the mpstables and other such initdata around as well?

The mp tables, and all of the other tables we pick up after we are
in 32bit mode the kernel explicitly preserves and leaves right where
they are.  There is no need to do anything to convey them to the next
kernel as pointers to them are in well known locations. 

Eric


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47-bk2
  2002-11-15 14:37                                                 ` Werner Almesberger
@ 2002-11-20  9:44                                                   ` Suparna Bhattacharya
  2002-11-20 17:28                                                     ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-20  9:44 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Eric W. Biederman, Andy Pfiffer, Alan Cox,
	Linux Kernel Mailing List, Martin J. Bligh, torvalds

On Fri, Nov 15, 2002 at 11:37:07AM -0300, Werner Almesberger wrote:
> Suparna Bhattacharya wrote:
> > What would be best way to pass a parameter or address from the
> > current kernel to kernel being booted (e.g log buffer address
> > or crash dump buffer etc) ?
> 
> At the moment, perhaps the initrd mechanism might be a useful
> interface for this. You'd just leave some space either at the
> beginning or at the end of the real initrd (if there's one),
> and put your data there.
> 
> Afterwards, you can extract it either from the kernel, or even
> from user space through /dev/initrd (with "noinitrd")
> 
> Advantages:
>  - fairly non-intrusive
>  - (almost ?) all platforms support this way of handling "some
>    object in memory"
>  - easy to play with from user space
> 
> Drawbacks:
>  - needs synchronization with existing uses of initrd
>  - a bit hackish
> 
> I'd expect that there will be eventually a number of things that
> get passed from old to new kernels (e.g. crash data, device scan
> results, etc.), so it may be useful to delay designing a "clean"
> interface (for this, I expect some TLV structure in the initrd
> area would make most sense) until more of those things have
> shown up.

Yes indeed. At the moment however I was just looking at something 
as simple as a single (or more) parameter to pass from an old 
kernel to the new one. That parameter could be a scalar value/
variable or denote the address of a control block, or something 
requiring more complicated interpretation like you mention.
If the parameter is a pointer to an address block right now the
code to put it in a place that doesn't get overwritten when the
new kernel loads is left as the responsibility of the caller.
Designing a generic and clean interface for that would require
more thought and is best delayed a bit till we understand all the
needs better. Mcore for example (as you probably know already)
passes a map of affected pages to the new kernel and during early 
bootmem initialization those pages (from the previous boot) are 
marked as reserved, instead of moving them to a contiguous memory 
area. Its just the start of the map (crash header) that's still 
passed in as a fixed location (rather its relative to the end of
the current image) and I was looking at a nice way to avoid that.

One way of course is to add a kernel parameter(s) and set this 
through user-space (after extracting it from the
kernel .. possibly via kmem) when loading the image (kexec tools
does all the work of filling up the parameter block). Probably
that's what was intended.

Eric, Is that correct ? BTW, did you have an option (or plan 
to add one) in kexec tools to use the current kernel's parameters 
and append additional options to it ?

Regards
Suparna

> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-20  9:17                                                       ` Eric W. Biederman
@ 2002-11-20 11:59                                                         ` Suparna Bhattacharya
  2002-11-20 15:05                                                         ` Werner Almesberger
  1 sibling, 0 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-20 11:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Pfiffer, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Werner Almesberger, Matt D. Robinson, Rusty Russell,
	Mike Galbraith, Martin J. Bligh, Dave Hansen

On Wed, Nov 20, 2002 at 02:17:04AM -0700, Eric W. Biederman wrote:
> Suparna Bhattacharya <suparna@in.ibm.com> writes:
> 
> > On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
> > > On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
> > > > kexec is a set of systems call that allows you to load another kernel
> > > > from the currently executing Linux kernel.  The current implementation
> > > > has only been tested, and had the kinks worked out on x86, but the
> > > > generic code should work on any architecture.
> > > 
> > > Great News, Eric.  For the first time *ever* I got a kexec reboot to
> > > work on my most troublesome machine (see below).
> > 
> > Same here - preloading the new kernel and issuing kexec -e after 
> > init 1 works on the troublesome SMP system I'd earlier been sending 
> > you earlier. Bootimg used to work on this setup, so bypassing the 
> > bios calls had the expected effect.
> > 
> > If I issue the call earlier though, it runs into trouble with aic7xxx
> > reporting interrupts during setup. Guess you know why we are looking
> > at that case - eventually need to be able to transition directly at dump 
> > time without a chance to go through user-space shutdown ... 
> 
> The needed hooks are there.  You can make certain an appropriate
> ->shutdown()/reboot_notifier method is present, or you can fix the driver
> so it can initialize the device from any random state.  
> 
> I really don't know what kinds of failures you hope to recover
> from with the kexec on panic code, so I really can't comment on
> how well things will work.  There will always be a set of failures
> that are non-recoverable, but that doesn't mean there isn't a useful

I agree. If we can get as far with this for situations in which
mcore with bootimg worked (but then we never did try that on 
2.5 and am not sure if it was using the current aic7xx driver) that 
would be a lot - handling of more difficult cases can 
happen bit by bit after that. Whatever can be covered is useful even 
if it doesn't address all kinds of troublesome situations.

> subset.  Anyway there is certainly plenty of material for you to
> experiment with and see what works usefully in practice.

Yes there is, thanks :)

Regards
Suparna

> 
> Eric
> 

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-20  9:17                                                       ` Eric W. Biederman
  2002-11-20 11:59                                                         ` Suparna Bhattacharya
@ 2002-11-20 15:05                                                         ` Werner Almesberger
  2002-11-20 16:48                                                           ` Eric W. Biederman
  1 sibling, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-11-20 15:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: suparna, Andy Pfiffer, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Matt D. Robinson, Rusty Russell, Mike Galbraith,
	Martin J. Bligh, Dave Hansen

Eric W. Biederman wrote:
> The needed hooks are there.  You can make certain an appropriate
> ->shutdown()/reboot_notifier method is present, or you can fix the driver
> so it can initialize the device from any random state.  

In the case of a crash, you may not be able to use the normal
shutdown, but there may still be pending bus master accesses, e.g.
from an on-going transfer, or free buffers that will eventually
(i.e. there's no use in "waiting for the operation to finish") get
used.

Initializing the device from any state is certainly a good feature,
and it will cure the most visible symptoms, but problems may still
occur if the device decides to scribble over memory after leaving
the original kernel, and before the reset has occurred under the
new kernel. (Or did you mean to initialize before invoking kexec ?)

I see several possible approaches for this:

 0) do as bootimg did, and ignore the problem :-)
 1) try to call the regular device shutdown. In the case of a
    crash, this may hang, or corrupt the system further.
 2) add a new callback that just silences the device, without
    trying to clean things up. This is probably the best
    long-term solution.
 3) if there's a way to just reset some or all devices on the
    PCI bus without knowing what they are, this should have the
    desired effect, while being relatively easy to implement.
    (This probably still leaves things like AGP, multi-level PCI
    bus structures, non-PCI, etc.)

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story!
  2002-11-20 15:05                                                         ` Werner Almesberger
@ 2002-11-20 16:48                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-20 16:48 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: suparna, Andy Pfiffer, Linux Kernel Mailing List, Linus Torvalds,
	Alan Cox, Matt D. Robinson, Rusty Russell, Mike Galbraith,
	Martin J. Bligh, Dave Hansen

Werner Almesberger <wa@almesberger.net> writes:

> Eric W. Biederman wrote:
> > The needed hooks are there.  You can make certain an appropriate
> > ->shutdown()/reboot_notifier method is present, or you can fix the driver
> > so it can initialize the device from any random state.  
> 
> In the case of a crash, you may not be able to use the normal
> shutdown, but there may still be pending bus master accesses, e.g.
> from an on-going transfer, or free buffers that will eventually
> (i.e. there's no use in "waiting for the operation to finish") get
> used.
> 
> Initializing the device from any state is certainly a good feature,
> and it will cure the most visible symptoms, but problems may still
> occur if the device decides to scribble over memory after leaving
> the original kernel, and before the reset has occurred under the
> new kernel. (Or did you mean to initialize before invoking kexec ?


In this case I suspect the best route is to locate the kexec_on_panic
buffers for kexec where we want to use them.  Then even in most
cases a devices is scribbling on memory, unless the device was
improperly setup, it isn't scribbling on memory necessary to get
the new kernel going.  

> I see several possible approaches for this:
> 
>  0) do as bootimg did, and ignore the problem :-)
>  1) try to call the regular device shutdown. In the case of a
>     crash, this may hang, or corrupt the system further.
>  2) add a new callback that just silences the device, without
>     trying to clean things up. This is probably the best
>     long-term solution.

Roughly that is ->shutdown() it was separated from the ->remove()
case so that it could be stripped down to a minimal implementation.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: Kexec for v2.5.47-bk2
  2002-11-20  9:44                                                   ` Suparna Bhattacharya
@ 2002-11-20 17:28                                                     ` Eric W. Biederman
  0 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-11-20 17:28 UTC (permalink / raw)
  To: suparna
  Cc: Werner Almesberger, Andy Pfiffer, Alan Cox,
	Linux Kernel Mailing List, Martin J. Bligh, torvalds

Suparna Bhattacharya <suparna@in.ibm.com> writes:

> Yes indeed. At the moment however I was just looking at something 
> as simple as a single (or more) parameter to pass from an old 
> kernel to the new one. 

Currently we pass all kinds of parameters, the e820 memory map being
one of the significant ones.  Though the arch specific locations are
not generally the best ones to use.

> That parameter could be a scalar value/
> variable or denote the address of a control block, or something 
> requiring more complicated interpretation like you mention.
> If the parameter is a pointer to an address block right now the
> code to put it in a place that doesn't get overwritten when the
> new kernel loads is left as the responsibility of the caller.
> Designing a generic and clean interface for that would require
> more thought and is best delayed a bit till we understand all the
> needs better. Mcore for example (as you probably know already)
> passes a map of affected pages to the new kernel and during early 
> bootmem initialization those pages (from the previous boot) are 
> marked as reserved, instead of moving them to a contiguous memory 
> area. Its just the start of the map (crash header) that's still 
> passed in as a fixed location (rather its relative to the end of
> the current image) and I was looking at a nice way to avoid that.

When you can do it passing tables, at a fixed or a relatively fixed
address is a powerful way to do things..  At least when they are
supposed to have a long lifetime.  I'm not quite certain about
a temporary solution.

> One way of course is to add a kernel parameter(s) and set this 
> through user-space (after extracting it from the
> kernel .. possibly via kmem) when loading the image (kexec tools
> does all the work of filling up the parameter block). Probably
> that's what was intended.
>
 
> Eric, Is that correct ? 

Yes.  Getting the information down to user space and then putting
it in the kernel is a reasonable thing to do.

> BTW, did you have an option (or plan 
> to add one) in kexec tools to use the current kernel's parameters 
> and append additional options to it ?

For command line arguments that is trivial 
--command-line="`cat /proc/cmdline` extra arguments".  

For the rest it would require a little more work, as all of the
kernels current parameters are not currently preserved.  But my basic
take is that I would rather derive/create the parameters to the new
kernel than just copy them from some fixed location.  Then passing
the current values just becomes a matter of policy, which the user can
control. 

For me it is important to be able to boot new kernels, and things
other than linux.  And especially in those cases the policy needs to
be driven from user space, as there is no real standardization of
parameters or what can be passed.  Nor is there much desire among
the various kernel authors, and bootloader authors to come up with a
standard format they all can use.  A good proposal with an unchanging
story and years of history behind it may eventually change some
minds, but I'm not holding my breath.

So beyond what functionality is currently there, I am not real
enthusiastic about optimizing the case of do what I just did.  For me
that is not an especially interesting case.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* [ANNOUNCE] kexec-tools-1.8
  2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
  2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
  2002-11-19  2:15                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Dave Hansen
@ 2002-12-02  4:41                                                   ` Eric W. Biederman
  2002-12-03  2:30                                                     ` Dave Hansen
  2002-12-02 15:54                                                   ` Eric W. Biederman
  3 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-12-02  4:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh,
	Dave Hansen, Klingaman, Aaron L


kexec-tools-1.8 is now available at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz

Dave Hansen has a patch that allows /proc/iomem to export resources
above 4GB which is needed on machines on with > 4GB of RAM.

Changes:
- /proc/iomem is now parsed so the new kernels memory map should be correct.
- initrds are now actually read into memory so they should work, as well.

That should make kexec quite useable.

The syscall:
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
and the fixes
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
continue to apply to 2.5.50 so I have not updated them.  

The archive is at:
http://www.xmission.com/~ebiederm/files/kexec/

My apologies for not getting this sooner.  Along with the holidays I have been
battling a cold...

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* [ANNOUNCE] kexec-tools-1.8
  2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
                                                                     ` (2 preceding siblings ...)
  2002-12-02  4:41                                                   ` [ANNOUNCE] kexec-tools-1.8 Eric W. Biederman
@ 2002-12-02 15:54                                                   ` Eric W. Biederman
  3 siblings, 0 replies; 333+ messages in thread
From: Eric W. Biederman @ 2002-12-02 15:54 UTC (permalink / raw)
  To: Linux Kernel Mailing List


kexec-tools-1.8 is now available at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.8.tar.gz

Dave Hansen has a patch that allows /proc/iomem to export resources
above 4GB which is needed on machines on with > 4GB of RAM.

Changes:
- /proc/iomem is now parsed so the new kernels memory map should be correct.
- initrds are now actually read into memory so they should work, as well.

That should make kexec quite useable.

The syscall:
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
and the fixes
http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff
continue to apply to 2.5.50 so I have not updated them.  

The archive is at:
http://www.xmission.com/~ebiederm/files/kexec/

My apologies for not getting this sooner.  Along with the holidays I have been
battling a cold...

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE] kexec-tools-1.8
  2002-12-02  4:41                                                   ` [ANNOUNCE] kexec-tools-1.8 Eric W. Biederman
@ 2002-12-03  2:30                                                     ` Dave Hansen
  2002-12-03  7:35                                                       ` Eric W. Biederman
  0 siblings, 1 reply; 333+ messages in thread
From: Dave Hansen @ 2002-12-03  2:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh,
	Klingaman, Aaron L

It booted on my first try, even with the 64-bit /proc/iomem changes. 
I tried it on machines with 16GB and 1GB of RAM.  (insert clapping here)

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE] kexec-tools-1.8
  2002-12-03  2:30                                                     ` Dave Hansen
@ 2002-12-03  7:35                                                       ` Eric W. Biederman
  2002-12-13  2:00                                                         ` Dave Hansen
  0 siblings, 1 reply; 333+ messages in thread
From: Eric W. Biederman @ 2002-12-03  7:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Andy Pfiffer, Linus Torvalds,
	Alan Cox, Werner Almesberger, Suparna Bhattacharya,
	Matt D. Robinson, Rusty Russell, Mike Galbraith, Martin J. Bligh,
	Klingaman, Aaron L

Dave Hansen <haveblue@us.ibm.com> writes:

> It booted on my first try, even with the 64-bit /proc/iomem changes. I tried it
> on machines with 16GB and 1GB of RAM.  (insert clapping here)

Thanks.  The code for reading /proc/iomem was a modeled after 
Andy Pfiffer's work, and your earlier patch.  I just cleaned them
up and integrated it cleanly with my existing code base.

I guess that means I should shake off the bit rot and resubmit
to Linus.

Eric

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [ANNOUNCE] kexec-tools-1.8
  2002-12-03  7:35                                                       ` Eric W. Biederman
@ 2002-12-13  2:00                                                         ` Dave Hansen
  0 siblings, 0 replies; 333+ messages in thread
From: Dave Hansen @ 2002-12-13  2:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Suparna Bhattacharya, Martin J. Bligh

I got around to trying it on a NUMA-Q again.  It makes it well into 
the kernel this time.  I've been getting some strange CPU numbering 
problems, but that was happening to a lesser extent before I threw 
kexec in there.

Right now it's dying in the memory allocator, but that is probably 
just something that didn't get initialized right, or some cross-quad 
memory that isn't set up right.

I would really like to see this go into 2.5.  The fact that it gets 
this far on something as exotic as a NUMA-Q is a tribute to its 
maturity.
-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 23:39 ` Werner Almesberger
@ 2002-11-05 12:45   ` Suparna Bhattacharya
  0 siblings, 0 replies; 333+ messages in thread
From: Suparna Bhattacharya @ 2002-11-05 12:45 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Richard J Moore, Jeff Garzik, linux-kernel, lkcd-devel,
	lkcd-devel-admin, lkcd-general, Rusty Russell, Linus Torvalds,
	Matt D. Robinson

On Thu, Oct 31, 2002 at 08:39:35PM -0300, Werner Almesberger wrote:
> Richard J Moore wrote:
> > and so do many people. In fact netdump, mcode and lkcd are all
> > complementary parts of the same need.
> 
> It's the "complementary" that worries me. Once you have mcore, what
> good are direct dumps to the network or the disk for ? With mcore,
> the whole issue of accessing stable storage is eliminated.
> 
> I don't know if the approach of having multiple quasi-equivalent
> means of storing a dump is something that Linus dislikes about
> LKCD, but I think it might be worth exploring if LKCD's chance of
> acceptance could be improved by focusing on a single but general
> mechanism.

The very question that's kept me up late some nights :)
And one of the reasons for spending so much time in integrating 
mcore seamlessly into the lkcd framework rather than plug it in 
as is at a high level. Precisely to avoid bloat while retaining 
flexibility and to move from something that works today to
more improved schemes in the future. 

The decision on what dump device implementations - block, net,
memory, and other special types to include could be a separate 
one from the base dump system, and could change as time passes.

> 
> I think it would be a pity if we ended up not having crash dumps
> in 2.6 only because they're over-featured ...

The dump driver interface is pretty simple, if you look at it
.. though it was meant to be powerful enough to do a lot of nice
things in the future. 

Regards
Suparna

> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by: Influence the future 
> of Java(TM) technology. Join the Java Community 
> Process(SM) (JCP(SM)) program now. 
> http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
> _______________________________________________
> lkcd-devel mailing list
> lkcd-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lkcd-devel

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 22:04     ` Bernhard Kaindl
@ 2002-11-01  0:33       ` Werner Almesberger
  0 siblings, 0 replies; 333+ messages in thread
From: Werner Almesberger @ 2002-11-01  0:33 UTC (permalink / raw)
  To: Bernhard Kaindl; +Cc: linux-kernel, Linus Torvalds, lkcd-general

Bernhard Kaindl wrote:
> An analogy to doctors, hospitals and patients:

I have a simpler medical analogy:

 - in many cases, all you know is that the patient died
   (e.g. think of a router - it has no console, no user
   interacting with it, etc.)
 - the Oops tells you the the patient died of a heart failure
   (NULL pointer dereferenced in this or that function, called
   from ...)
 - but it's only the autopsy (the crash dump) that reveals that
   the patient was poisoned, and that this is not a routine
   case

I view crash dumps as a tool that helps me imagine what the
machine was doing. Without that, I can learn many interesting
things about the code, but I won't necessarily find the actual
bug.

Examples of non-obvious bugs can be found in the various module
unload race discussions. There, usually competent people
suggested incorrect designs, simply because they failed to
imagine some constellations, and no amount of staring at the
source could have helped this lack of imagination.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 22:47 Richard J Moore
@ 2002-10-31 23:39 ` Werner Almesberger
  2002-11-05 12:45   ` Suparna Bhattacharya
  0 siblings, 1 reply; 333+ messages in thread
From: Werner Almesberger @ 2002-10-31 23:39 UTC (permalink / raw)
  To: Richard J Moore
  Cc: Jeff Garzik, linux-kernel, lkcd-devel, lkcd-devel-admin,
	lkcd-general, Rusty Russell, Linus Torvalds, Matt D. Robinson

Richard J Moore wrote:
> and so do many people. In fact netdump, mcode and lkcd are all
> complementary parts of the same need.

It's the "complementary" that worries me. Once you have mcore, what
good are direct dumps to the network or the disk for ? With mcore,
the whole issue of accessing stable storage is eliminated.

I don't know if the approach of having multiple quasi-equivalent
means of storing a dump is something that Linus dislikes about
LKCD, but I think it might be worth exploring if LKCD's chance of
acceptance could be improved by focusing on a single but general
mechanism.

I think it would be a pity if we ended up not having crash dumps
in 2.6 only because they're over-featured ...

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina         wa@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
@ 2002-10-31 22:47 Richard J Moore
  2002-10-31 23:39 ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: Richard J Moore @ 2002-10-31 22:47 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Jeff Garzik, linux-kernel, lkcd-devel, lkcd-devel-admin,
	lkcd-general, Rusty Russell, Linus Torvalds, Matt D. Robinson


> I'm not so convinced about this. I like the Mission Critical
> approach:

and so do many people. In fact netdump, mcode and lkcd are all
complementary parts of the same need. That's why we are working with
mcrit's blessing to merge mcore into lkcd. That's a big piece of work,
which we hope to make progress with during 2003 - Suparna's the expert :-)

Richard


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 21:08   ` Benjamin LaHaise
@ 2002-10-31 22:04     ` Bernhard Kaindl
  2002-11-01  0:33       ` Werner Almesberger
  0 siblings, 1 reply; 333+ messages in thread
From: Bernhard Kaindl @ 2002-10-31 22:04 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, lkcd-general

On Thu, 31 Oct 2002, Benjamin LaHaise wrote:
> On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> > And imnsho, debugging the kernel on a source level is the way to do it.
> >
> > Which is why it's not going to be me who merges it.
> >
> > Read my emails.
>
> That is one of the reasons that crash dumps are useful.  Quite a few
> problems that customers hit are not easy to reproduce, but when they
> provide a dump file that can be loaded into gdb with the original
> kernel debugging info and the backtrace command issued and various
> bits of internal structures examined, usually a good hypothesis can
> be made for the cause.  Feed that back into a code audit and you end
> up fixing problems that are decidedly challenging.
>
> 		-ben

I could not have said it better. I've a good real-life example for it,
one which really happened and one just as example to give an image.

[ I'm not an expert, I'm just writing about my experiance ]
[ in order to try to make linux even better than it is    ]

About debugging at source level:

Dump analysis does not say that you are not debugging on a source level,
with a vmlinux compiled with -g, (which could be stripped before making
the image) crash analysis tools could operate at source level(depending
on the compiler's reorderings of course, the assumtion that -O2 maps
source:binary 1:1 is of course not from this world)

An analogy to doctors, hospitals and patients:

dump analysis says you don't need to have a living patient
in order to cure a disease. It says you may have slept on the
other side of the world while the disease murdered your fellow
at home. But as you don't like that it happens again to another
fellow, you want to have a remote lab which gives you every info
you need to have in order to know what might have murdered him.

The dump tools are this remote lab. If you don't have it, you
may need to fly over to the site where the disease is, monitor
the patient and try to find out what's happening and you can't
find out what's up without at least one another dead patient at
the end.

But the hospital may not like to even have one single dead
patient more than neccesary(best 0) and would choose a doctor
who has the remote lab where he can quickly check what's up
and find a cure *before* the next patient gets ill.

Back to the computer world, this would mean that an OS having
the remote lab(dump tools) would be favoured over on OS that
don't has. The same goes for LTT and Dynamic Probes.

Back to crash dump: In some environments like laboratory or blood
bank information systems you need to use computers in order to
efficiently process, store and distribute data, and organize
the handling of blood. In such environments, the life of people
can change on a fast, efficiently and stably working organsation.

Of course you need to be able to recover and continue such
organisation even with the laboratory information system being
down for a reboot or maintenance.

But you simply cannot go there, halt all the distributed information
retrieval and automated job control with the laboratory apparatuses,
block all the users(maybe thousands) for debugging the kernel and
check what is going on while the whole hospital is waiting for you.

Of course you can do this, but only once or only in at a time
where every use of the system can be organized to bypass it und
use paper, in-house mail and phone to do the things the system
is normally doing. A hospital with thousands of patients cannot
wait while debugging.

> Which is why it's not going to be me who merges it.

Sure, but it would help Linux World Domination if the base
kernel would support it also.

Bernd

PS: Sorry for the extreme example but this is an example
I know from my previous work and I've just tried to describe
it as real as possible.


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 20:40 ` Linus Torvalds
  2002-10-31 20:54   ` Patrick Finnegan
@ 2002-10-31 21:08   ` Benjamin LaHaise
  2002-10-31 22:04     ` Bernhard Kaindl
  1 sibling, 1 reply; 333+ messages in thread
From: Benjamin LaHaise @ 2002-10-31 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Herrmann, linux-kernel, lkcd-devel, lkcd-devel-admin,
	lkcd-general, Rusty Russell, Matt D. Robinson

On Thu, Oct 31, 2002 at 12:40:28PM -0800, Linus Torvalds wrote:
> And imnsho, debugging the kernel on a source level is the way to do it.
> 
> Which is why it's not going to be me who merges it.
> 
> Read my emails.

That is one of the reasons that crash dumps are useful.  Quite a few 
problems that customers hit are not easy to reproduce, but when they 
provide a dump file that can be loaded into gdb with the original 
kernel debugging info and the backtrace command issued and various 
bits of internal structures examined, usually a good hypothesis can 
be made for the cause.  Feed that back into a code audit and you end 
up fixing problems that are decidedly challenging.

		-ben
-- 
"Do you seek knowledge in time travel?"

^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 20:40 ` Linus Torvalds
@ 2002-10-31 20:54   ` Patrick Finnegan
  2002-10-31 21:08   ` Benjamin LaHaise
  1 sibling, 0 replies; 333+ messages in thread
From: Patrick Finnegan @ 2002-10-31 20:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Herrmann, linux-kernel, lkcd-devel, lkcd-devel-admin,
	lkcd-general, Rusty Russell, Matt D. Robinson

On Thu, 31 Oct 2002, Linus Torvalds wrote:

> On Thu, 31 Oct 2002, Andreas Herrmann wrote:
> >
> > A dump mechanism within the kernel is a base for much easier
> > kernel debugging.
> > IMHO, analyzing a dump is much more effective than guessing
> > a kernel bug solely with help of an oops message.
>
> And imnsho, debugging the kernel on a source level is the way to do it.
>
> Which is why it's not going to be me who merges it.

But, LKCD is useful also for tracing crashes back to hardware that causes
it.  It's really hard to find problems in hardware using source code,
since the source code DOENS'T have anything to do with the problems.

Pat
--
Purdue Universtiy ITAP/RCS
Information Technology at Purdue
Research Computing and Storage
http://www-rcd.cc.purdue.edu

http://dilbert.com/comics/dilbert/archive/images/dilbert2040637020924.gif




^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
  2002-10-31 20:22 Andreas Herrmann
@ 2002-10-31 20:40 ` Linus Torvalds
  2002-10-31 20:54   ` Patrick Finnegan
  2002-10-31 21:08   ` Benjamin LaHaise
  0 siblings, 2 replies; 333+ messages in thread
From: Linus Torvalds @ 2002-10-31 20:40 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: linux-kernel, lkcd-devel, lkcd-devel-admin, lkcd-general,
	Rusty Russell, Matt D. Robinson


On Thu, 31 Oct 2002, Andreas Herrmann wrote:
> 
> A dump mechanism within the kernel is a base for much easier
> kernel debugging.
> IMHO, analyzing a dump is much more effective than guessing
> a kernel bug solely with help of an oops message.

And imnsho, debugging the kernel on a source level is the way to do it.

Which is why it's not going to be me who merges it.

Read my emails.

		Linus


^ permalink raw reply	[flat|nested] 333+ messages in thread

* Re: [lkcd-devel] Re: What's left over.
@ 2002-10-31 20:22 Andreas Herrmann
  2002-10-31 20:40 ` Linus Torvalds
  0 siblings, 1 reply; 333+ messages in thread
From: Andreas Herrmann @ 2002-10-31 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, lkcd-devel, lkcd-devel-admin, lkcd-general,
	Rusty Russell, Matt D. Robinson


      Linus Torvalds <torvalds@transmeta.com>
      Sent by: lkcd-devel-admin@lists.sourceforge.net
      10/31/02 04:46 PM

On Wed, 30 Oct 2002, Matt D. Robinson wrote:

  > People have to realize that my kernel is not for random new
  > features. The stuff I consider important are things that people
  > use on their own, or stuff that is the base for other work.

A dump mechanism within the kernel is a base for much easier
kernel debugging.
IMHO, analyzing a dump is much more effective than guessing
a kernel bug solely with help of an oops message.
Using lkcd/lcrash, I've debugged enough problems in
kernel modules that were otherwise quite hard to determine.
It is hard to understand why developers do not want the
aid of dump/dump-analysis for kernel development.


Regards,

Andreas


^ permalink raw reply	[flat|nested] 333+ messages in thread

* RE: [lkcd-devel] Re: What's left over.
@ 2002-10-31 18:17 Deepak Kumar Gupta, Noida
  0 siblings, 0 replies; 333+ messages in thread
From: Deepak Kumar Gupta, Noida @ 2002-10-31 18:17 UTC (permalink / raw)
  To: Chris Friesen, Linus Torvalds
  Cc: Matt D. Robinson, Rusty Russell, linux-kernel, lkcd-general, lkcd-devel

> Linus Torvalds wrote:
> 
> > 	In particular when it comes to this project, I'm told about
> > 	"netdump", which doesn't try to dump to a disk, but 
> over the net.
> > 	And quite frankly, my immediate reaction is to say "Hell, I
> > 	_never_ want the dump touching my disk, but over the network
> > 	sounds like a great idea".
> > 
> > To me this says "LKCD is stupid". Which means that I'm not 
> going to apply 
> > it, and I'm going to need some real reason to do so - ie 
> being proven 
> > wrong in the field.
> 
> How do you deal with netdump when your network driver is what 
> caused the 
> crash?
> 
> Ideally I would like to see a dump framework that can have a 
> number of 
> possible dump targets.  We should be able to dump to any 
> combination of 
> network, serial, disk, flash, unused ram that isn't wiped 
> over restarts, 
> etc...
This is what the LKCD with generic interface is .. LKCD with generic
interface has the capability to include various dump targets in a very clean
way. Originally the LKCD meant for saving dump only on the disks, but its
generic interface has provided the option to have a number of dump targets.

Regards
Deepak.

^ permalink raw reply	[flat|nested] 333+ messages in thread

end of thread, other threads:[~2002-12-13  1:53 UTC | newest]

Thread overview: 333+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-10-31  2:07 What's left over Rusty Russell
2002-10-31  2:31 ` Linus Torvalds
2002-10-31  2:43   ` Alexander Viro
2002-10-31 16:36     ` Oliver Xymoron
2002-10-31 17:04       ` Stephen Frost
2002-10-31 17:38       ` Linus Torvalds
2002-10-31 18:00         ` Oliver Xymoron
2002-11-06 20:52           ` Florian Weimer
2002-10-31 22:57     ` Pavel Machek
2002-10-31 22:28       ` Xavier Bestel
2002-10-31 23:08         ` Pavel Machek
2002-11-01  9:55         ` Miquel van Smoorenburg
2002-10-31  3:00   ` Rusty Russell
2002-10-31  3:19     ` tridge
2002-10-31  6:21       ` Chris Wedgwood
2002-11-05  3:38         ` Andreas Gruenbacher
2002-10-31  3:22     ` Christoph Hellwig
2002-10-31  3:31       ` tridge
2002-10-31 10:15     ` Joe Thornber
2002-10-31 14:26       ` Jeff Garzik
2002-10-31 14:55         ` Alan Cox
2002-10-31 21:14       ` Rusty Russell
2002-11-01  8:20         ` Joe Thornber
2002-10-31 11:03     ` Geert Uytterhoeven
2002-10-31 21:17       ` James Simmons
2002-10-31  3:06   ` Rik van Riel
2002-10-31  3:19     ` Stephen Frost
2002-10-31 21:09       ` john stultz
2002-10-31 21:49         ` Werner Almesberger
2002-10-31 22:32           ` john stultz
2002-10-31 22:54             ` Werner Almesberger
2002-11-01  0:54               ` john stultz
2002-11-01  1:31                 ` Werner Almesberger
2002-11-05  3:58                 ` Andreas Gruenbacher
2002-10-31  6:22     ` Chris Wedgwood
2002-10-31  6:48       ` Dax Kelson
2002-10-31  6:56         ` Chris Wedgwood
2002-10-31 14:31           ` Jeff Garzik
2002-10-31 18:12             ` Chris Wedgwood
2002-10-31 18:49               ` Linus Torvalds
2002-10-31 19:43                 ` Chris Wedgwood
2002-11-01 15:25                   ` Linus Torvalds
2002-11-01 15:35                     ` bert hubert
2002-11-01 15:50                     ` Gerald Britton
2002-11-01 18:17                       ` Matt Porter
2002-11-01 16:15                     ` Michael Clark
2002-11-01 16:16                     ` Erik Andersen
2002-11-01 20:43                     ` romieu
2002-10-31 18:28           ` Nicholas Wourms
2002-10-31 18:58             ` Alexander Viro
2002-10-31 19:14               ` Nicholas Wourms
2002-10-31 19:20             ` Alan Cox
2002-10-31 19:17               ` Nicholas Wourms
2002-10-31 20:45               ` Jeff Garzik
2002-11-01  6:00               ` James Morris
2002-10-31  7:10         ` Alexander Viro
2002-10-31  7:21           ` Dax Kelson
2002-10-31  7:42             ` Alexander Viro
2002-10-31 16:24               ` Stephen Wille Padnos
2002-10-31 16:44                 ` Alexander Viro
2002-10-31 17:11                   ` Stephen Frost
2002-10-31 17:30                     ` Alexander Viro
2002-10-31 17:39                       ` Linus Torvalds
2002-10-31 17:36                   ` Richard Gooch
2002-11-02 17:35               ` LA Walsh
2002-11-02 20:44                 ` Chris Wedgwood
2002-10-31 22:53           ` Pavel Machek
2002-10-31  9:44     ` Lech Szychowski
2002-10-31  3:14   ` Karim Yaghmour
2002-10-31 16:00     ` LTT for inclusion into 2.5 bob
2002-10-31 16:19       ` Is your idea good? [was: Re: LTT for inclusion into 2.5] Larry McVoy
2002-10-31 16:38         ` Cort Dougan
2002-10-31 16:47         ` bob
2002-10-31 17:35         ` Karim Yaghmour
2002-10-31  3:21   ` What's left over Stephen Lord
2002-10-31  3:59   ` Andreas Dilger
2002-10-31  4:20   ` Patrick Finnegan
2002-10-31  4:25     ` Christoph Hellwig
2002-10-31  4:31       ` Patrick Finnegan
2002-10-31  5:13   ` Dax Kelson
2002-10-31  6:07   ` [PATCH] kexec for 2.5.45 Eric W. Biederman
2002-10-31  6:25   ` What's left over Matt D. Robinson
2002-10-31 15:46     ` Linus Torvalds
2002-10-31 17:10       ` Patrick Finnegan
2002-10-31 17:13       ` Michael Shuey
2002-10-31 19:04         ` Alan Cox
2002-10-31 19:42           ` Michael Shuey
2002-11-01 22:25           ` Pavel Machek
2002-11-02 13:30             ` Michael Shuey
2002-10-31 17:18       ` Matt D. Robinson
2002-10-31 17:25         ` Linus Torvalds
2002-10-31 17:54           ` Matt D. Robinson
2002-10-31 17:54             ` Linus Torvalds
2002-10-31 18:21               ` Patrick Finnegan
2002-10-31 18:31               ` John Alvord
2002-11-02 23:44             ` Horst von Brand
2002-11-03  1:14               ` Matt D. Robinson
2002-10-31 18:10           ` Chris Friesen
2002-10-31 18:22             ` Linus Torvalds
2002-10-31 20:59               ` Dave Anderson
2002-10-31 21:49                 ` Oliver Xymoron
2002-11-01  1:25                 ` [lkcd-devel] " Matt D. Robinson
2002-11-01  6:34               ` Bill Davidsen
2002-11-01 13:26                 ` Alan Cox
2002-11-01 19:00                   ` Joel Becker
2002-11-01 19:18                     ` Linus Torvalds
2002-11-01 20:06                       ` Steven King
2002-11-02  5:17                         ` Bill Davidsen
2002-11-02  5:36                           ` Zwane Mwaikambo
2002-11-03 14:08                             ` Bill Davidsen
2002-11-02 15:29                           ` Alan Cox
2002-11-03  1:24                             ` [lkcd-general] " Matt D. Robinson
2002-11-03  1:49                               ` Alan Cox
2002-11-03  9:34                                 ` [lkcd-devel] " Matt D. Robinson
2002-11-03 14:33                                 ` Bill Davidsen
2002-11-03 15:34                                   ` Bernd Eckenfels
2002-11-03 16:32                                   ` Alan Cox
2002-11-03 17:08                                     ` [lkcd-devel] " Matt D. Robinson
2002-11-05 18:07                                     ` Bill Davidsen
2002-11-03  3:10                               ` Christoph Hellwig
2002-11-01 20:21                       ` David Lang
2002-11-01 22:25                         ` Werner Almesberger
2002-11-01 22:42                           ` Karim Yaghmour
2002-11-01 22:54                             ` Werner Almesberger
2002-11-01 23:10                               ` Karim Yaghmour
2002-11-01 20:22                       ` [lkcd-devel] " Matt D. Robinson
2002-11-02 13:02                         ` Kai Henningsen
2002-11-01 20:37                       ` Hugh Dickins
2002-11-02 18:23                         ` Geert Uytterhoeven
2002-11-03  2:25                         ` Horst von Brand
2002-11-04 16:18                           ` Hugh Dickins
2002-11-03 13:48                   ` Bill Davidsen
2002-11-03 14:26                     ` yodaiken
2002-11-05 17:09                       ` Bill Davidsen
2002-11-05 17:36                         ` yodaiken
2002-11-04  2:44                     ` [lkcd-general] " Jennie Haywood
2002-11-04 14:45                       ` Henning P. Schmiedehausen
2002-11-04 15:29                         ` Alan Cox
2002-11-04 15:27                           ` Henning P. Schmiedehausen
2002-11-04 15:38                             ` Patrick Finnegan
2002-11-04 16:51                               ` Henning P. Schmiedehausen
2002-11-05  4:57                         ` Werner Almesberger
2002-10-31 18:50             ` Alan Cox
2002-10-31 21:33             ` Rusty Russell
2002-11-01  1:19               ` [lkcd-devel] " Matt D. Robinson
2002-11-01  2:59                 ` Rusty Russell
2002-10-31 18:15           ` Andrew Morton
2002-10-31 19:58             ` Bernhard Kaindl
2002-11-02  0:49             ` What's left over. - Dave's crash code supports a gdb interface for LKCD crash dumps Piet Delaney
2002-10-31 18:16           ` What's left over Oliver Xymoron
2002-10-31 18:26             ` Linus Torvalds
2002-10-31 18:49           ` Rik van Riel
2002-10-31 21:02           ` Jeff Garzik
2002-10-31 22:37             ` Werner Almesberger
2002-11-05 11:42               ` [lkcd-devel] " Suparna Bhattacharya
2002-11-05 18:00                 ` Werner Almesberger
2002-11-05 18:36                   ` Alan Cox
2002-11-05 19:19                     ` Werner Almesberger
2002-11-05 20:10                       ` Alan Cox
2002-11-05 23:25                         ` Werner Almesberger
2002-11-06  0:21                       ` Andy Pfiffer
2002-11-06  1:10                         ` Werner Almesberger
2002-11-06  1:37                           ` Alexander Viro
2002-11-06  2:05                             ` Werner Almesberger
2002-11-07  6:04                               ` Eric W. Biederman
2002-11-07 12:17                                 ` Werner Almesberger
2002-11-06  4:07                             ` Eric W. Biederman
2002-11-06  4:47                               ` Eric W. Biederman
2002-11-06 19:24                               ` Rob Landley
2002-11-10 18:35                         ` Pavel Machek
2002-11-06  2:48                     ` Eric W. Biederman
2002-11-06  4:29                     ` Eric W. Biederman
2002-11-06  6:25                       ` Linus Torvalds
2002-11-06  6:38                         ` Suparna Bhattacharya
2002-11-06  7:48                         ` Eric W. Biederman
2002-11-06  9:11                           ` Suparna Bhattacharya
2002-11-06 22:05                           ` Michal Jaegermann
2002-11-06 16:13                         ` Eric W. Biederman
2002-11-07  8:50                         ` Eric W. Biederman
2002-11-07 15:44                           ` Linus Torvalds
2002-11-09 23:05                             ` Eric W. Biederman
2002-11-09 23:33                               ` Linus Torvalds
2002-11-10  1:37                                 ` Eric W. Biederman
2002-11-10  2:12                                   ` Alan Cox
2002-11-10  2:16                                     ` Eric W. Biederman
2002-11-10  3:03                                       ` Werner Almesberger
2002-11-10  3:23                                         ` Eric W. Biederman
2002-11-10 14:30                                       ` Alan Cox
2002-11-10 16:56                                         ` Eric W. Biederman
2002-11-10  3:17                                   ` Linus Torvalds
2002-11-10  4:26                                     ` Eric W. Biederman
2002-11-10 18:07                                     ` Kexec 2.5.46-b6 Eric W. Biederman
2002-11-11 18:03                                     ` [lkcd-devel] Re: What's left over Eric W. Biederman
2002-11-11 18:15                                     ` Kexec for v2.5.47 Eric W. Biederman
2002-11-11 22:52                                       ` Kexec for v2.5.47 (test feedback) Andy Pfiffer
2002-11-12  7:22                                         ` Eric W. Biederman
2002-11-13  0:48                                           ` Andy Pfiffer
2002-11-13  4:16                                             ` Eric W. Biederman
2002-11-13 13:26                                             ` Kexec for v2.5.47-bk2 Eric W. Biederman
2002-11-15  9:24                                               ` Suparna Bhattacharya
2002-11-15 14:14                                                 ` Eric W. Biederman
2002-11-15 14:37                                                 ` Werner Almesberger
2002-11-20  9:44                                                   ` Suparna Bhattacharya
2002-11-20 17:28                                                     ` Eric W. Biederman
2002-11-18  0:07                                             ` [ANNOUNCE] kexec-tools-1.6 released Eric W. Biederman
2002-11-18  5:46                                               ` Eric W. Biederman
2002-11-18  8:53                                                 ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Eric W. Biederman
2002-11-19  1:10                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 -- Success Story! Andy Pfiffer
2002-11-19 10:25                                                     ` Eric W. Biederman
2002-11-19 17:21                                                       ` Andy Pfiffer
2002-11-19 17:34                                                         ` Eric W. Biederman
2002-11-19 18:17                                                           ` Martin J. Bligh
2002-11-20  9:19                                                             ` Eric W. Biederman
2002-11-19 19:29                                                           ` Andy Pfiffer
2002-11-20  8:49                                                     ` Suparna Bhattacharya
2002-11-20  9:17                                                       ` Eric W. Biederman
2002-11-20 11:59                                                         ` Suparna Bhattacharya
2002-11-20 15:05                                                         ` Werner Almesberger
2002-11-20 16:48                                                           ` Eric W. Biederman
2002-11-19  2:15                                                   ` [ANNOUNCE][CFT] kexec for v2.5.48 && kexec-tools-1.7 Dave Hansen
2002-11-19 10:13                                                     ` Eric W. Biederman
2002-11-19 15:28                                                       ` Martin J. Bligh
2002-11-19 17:44                                                         ` Eric W. Biederman
2002-11-19 16:24                                                       ` Dave Hansen
2002-11-19 17:33                                                         ` Linus Torvalds
2002-11-19 17:48                                                           ` Eric W. Biederman
2002-11-19 17:54                                                             ` Dave Jones
2002-11-19 17:42                                                         ` Eric W. Biederman
2002-12-02  4:41                                                   ` [ANNOUNCE] kexec-tools-1.8 Eric W. Biederman
2002-12-03  2:30                                                     ` Dave Hansen
2002-12-03  7:35                                                       ` Eric W. Biederman
2002-12-13  2:00                                                         ` Dave Hansen
2002-12-02 15:54                                                   ` Eric W. Biederman
2002-11-09 23:39                               ` [lkcd-devel] Re: What's left over Randy.Dunlap
2002-11-10  2:58                                 ` Eric W. Biederman
2002-11-10 14:35                                   ` Alan Cox
2002-11-10 18:13                                     ` Eric W. Biederman
2002-11-10  1:31                               ` Werner Almesberger
2002-11-10  3:10                                 ` Eric W. Biederman
2002-11-10  3:30                                   ` Werner Almesberger
2002-11-10  3:49                                     ` Eric W. Biederman
2002-11-10  3:49                                   ` Linus Torvalds
2002-11-10  2:08                               ` Alan Cox
2002-11-10  2:18                                 ` Eric W. Biederman
2002-11-10 14:31                                   ` Alan Cox
2002-11-07 15:48                           ` Linus Torvalds
2002-11-07 19:32                           ` kexec (was: [lkcd-devel] Re: What's left over.) Andy Pfiffer
2002-11-07 22:13                             ` Andy Pfiffer
2002-11-07 22:56                               ` Werner Almesberger
2002-11-11 17:03                             ` Bill Davidsen
     [not found]                             ` <200211080536.31287.landley@trommello.org>
2002-11-11 17:58                               ` Andy Pfiffer
2002-11-11 18:25                                 ` Eric W. Biederman
2002-11-08 18:01                           ` [lkcd-devel] Re: What's left over Alan Cox
2002-11-09 21:21                   ` Pavel Machek
2002-11-11 16:27                     ` Eric W. Biederman
2002-11-01  1:35             ` Matt D. Robinson
2002-11-01  2:06               ` Jeff Garzik
2002-11-01  3:46                 ` Matt D. Robinson
2002-11-01  4:45                   ` Linus Torvalds
2002-11-01  4:57                     ` Patrick Finnegan
2002-11-01  9:18                       ` Henning P. Schmiedehausen
2002-11-01 14:55                         ` Patrick Finnegan
2002-11-01 15:16                           ` Alexander Viro
2002-11-01 15:27                             ` Patrick Finnegan
2002-11-01 16:16                             ` Patrick Finnegan
2002-11-01 16:32                               ` Larry McVoy
2002-11-01 16:44                                 ` Linux without Linus was " Brian Jackson
2002-11-01 16:58                                   ` Paul Fulghum
2002-11-01 19:14                                 ` Shawn
2002-11-01 19:36                                   ` Shawn
2002-11-01 17:56                               ` Nicolas Pitre
2002-11-01 18:23                               ` Shane R. Stixrud
2002-11-01 19:18                                 ` John Alvord
2002-11-04  2:13                               ` Rob Landley
2002-11-04 14:58                                 ` Patrick Finnegan
2002-11-04 12:59                                   ` Rob Landley
2002-11-01 15:32                           ` Richard B. Johnson
2002-11-01 13:30             ` Alan Cox
2002-11-01 22:28               ` Rusty Russell
2002-11-01  6:27           ` Bill Davidsen
2002-11-01  6:36             ` Linus Torvalds
2002-11-01  7:00               ` [lkcd-devel] " Castor Fu
2002-11-01  8:23               ` Craig I. Hagan
2002-11-01 14:03                 ` Patrick Finnegan
2002-11-02  4:57                 ` Bill Davidsen
2002-11-01 13:28               ` Alan Cox
2002-11-02  5:00                 ` Bill Davidsen
2002-11-02 15:30                   ` Alan Cox
2002-11-02 18:55                   ` Arnaldo Carvalho de Melo
2002-11-02 19:19                     ` romieu
2002-11-02 19:21                       ` Arnaldo Carvalho de Melo
2002-11-02 19:32                         ` romieu
2002-11-02 19:42                           ` Arnaldo Carvalho de Melo
2002-11-02 20:23                             ` romieu
2002-11-02 20:31                     ` Alan Cox
2002-11-02 20:12                       ` Arnaldo Carvalho de Melo
2002-11-01  9:20             ` Henning P. Schmiedehausen
2002-11-01 13:29             ` Alan Cox
2002-10-31 22:20         ` Shawn
2002-10-31 23:14           ` [lkcd-general] " Bernhard Kaindl
2002-11-01  2:01           ` Matt D. Robinson
2002-11-02 10:36             ` Brad Hards
2002-11-02 19:28               ` [lkcd-devel] " Matt D. Robinson
2002-10-31 17:55       ` [lkcd-general] " Dave Craft
2002-10-31 18:45         ` Patrick Mochel
2002-10-31 19:16           ` Stephen Hemminger
2002-10-31 19:57             ` george anzinger
2002-10-31 20:48               ` Stephen Hemminger
2002-10-31 19:33       ` [lkcd-devel] " Castor Fu
2002-10-31  7:46   ` Ville Herva
2002-10-31  9:23     ` Geert Uytterhoeven
2002-10-31  9:39       ` Ville Herva
2002-10-31 10:16   ` Trever L. Adams
2002-10-31 18:08     ` Nicholas Wourms
2002-10-31 13:36   ` mbs
2002-10-31 14:21   ` Chris Friesen
2002-10-31 14:52   ` Suparna Bhattacharya
2002-10-31 16:37   ` Henning P. Schmiedehausen
2002-11-01  0:52   ` James Simmons
2002-11-01 10:24   ` What's left over. (Fbdev rewrite) Helge Hafting
2002-11-05 17:29 ` kexec (was: Re: What's left over.) Werner Almesberger
2002-11-05 18:10   ` Benjamin LaHaise
2002-11-05 19:06   ` Martin J. Bligh
2002-10-31 18:17 [lkcd-devel] Re: What's left over Deepak Kumar Gupta, Noida
2002-10-31 20:22 Andreas Herrmann
2002-10-31 20:40 ` Linus Torvalds
2002-10-31 20:54   ` Patrick Finnegan
2002-10-31 21:08   ` Benjamin LaHaise
2002-10-31 22:04     ` Bernhard Kaindl
2002-11-01  0:33       ` Werner Almesberger
2002-10-31 22:47 Richard J Moore
2002-10-31 23:39 ` Werner Almesberger
2002-11-05 12:45   ` Suparna Bhattacharya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).