All of lore.kernel.org
 help / color / mirror / Atom feed
* new execve/kernel_thread design
@ 2012-10-16 22:35 Al Viro
  2012-10-17  5:32 ` Max Filippov
                   ` (6 more replies)
  0 siblings, 7 replies; 93+ messages in thread
From: Al Viro @ 2012-10-16 22:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linus-arch, Linus Torvalds, Catalin Marinas, Haavard Skinnemoen,
	Mike Frysinger, Jesper Nilsson, David Howells, Tony Luck,
	Benjamin Herrenschmidt, Hirokazu Takata, Geert Uytterhoeven,
	Michal Simek, Jonas Bonn, James E.J. Bottomley, Richard Kuo,
	Martin Schwidefsky, Lennox Wu, David S. Miller, Paul Mundt,
	Chris Zankel, Chris Metcalf, Yoshinori Sato, Guan Xuetao

[apologies for enormous Cc; I've talked to some of you in private mail
and after being politely asked to explain WTF was all that thing for
and how was it supposed to work, well...]

	Below is an attempt to describe how kernel threads work now.  I'm
going to put a cleaned up variant into Documentation/something, so any
questions, suggestions of improvements, etc. are very welcome.

	1.  Basic rules for process lifetime.
Except for the initial process (init_task, eventual idle thread on the boot
CPU) all processes are created by do_fork().  There are three classes of
those: kernel threads, userland processes and idle threads to be.  There are
few low-level operations involved:
	* a kernel thread can spawn a new kernel thread; the primitive
doing that is kernel_thread().
	* a userland process can spawn a new userland process; that's
done by sys_fork()/sys_vfork()/sys_clone()/sys_clone2().
	* a kernel thread can become a userland process.  The primitive
is kernel_execve().
	* a kernel thread can spawn a future idle thread; that's done
by fork_idle().  Result is *not* scheduled until the secondary CPU gets
initialized and its state is heavily overwritten in process.
Under no circumstances a userland process can become a kernel thread or
spawn one.  And kernel threads never do fork(2) et.al.

Note that kernel_thread() and kernel_execve() are really very low-level.
In particular, any process, be it a userland one or a kernel thread, can
ask a dedicated kernel thread (kthreadd) to spawn a kernel thread and
have a given function executed in it.  Or to stop a thread that had been
spawned that way.  Or ask to spawn it tied to given CPU, etc.  The public
interfaces are in linux/kthread.h and the code implementing them is in
kernel/kthread.c; kernel_thread() is what it uses internally.  Another
group of related public APIs deals with spawning a userland process from
kernel - call_usermodehelper() and friends in linux/kmod.h and kernel/kmod.c.
These two groups cover everything kernel-thread-related we care about in
the kernel and I'm not going to deal with them here.  What I'm going to
describe is the primitives used to implement those mechanisms.

Historically the situation used to be different - kernel_thread() used to
be a fairly widely used public API until 2006 or so.  Some out-of-tree
code might still be using it; the proper fix is to switch to use of
kthread_run() and be done with that.  kthread_run() has calling conventions
and rules for callback similar to what kernel_thread() used to have, so
conversion tends to be trivial.

The rules for kernel_thread() callbacks (all 6 of them ;-) have changed,
though.  What we currently have is
	kernel_thread(fn, arg)
where arg is void * and fn is an int-returning function with void * as
argument.  New kernel thread is created and fn(arg) is called in it.
It should either never return (run forever, as kthreadd or call something
that would terminate the thread - do_exit() or, in one case, panic())
or return 0.  That is done after kernel_execve() has returned 0 and
then the thread will proceed into userland context created by that execve.
Note that some architectures still have kernel_execve() itself switch
to userland upon success; that's fine - this is just another case of
callback never returning to caller.  In other words, this switchover
to new model isn't a flagday affair - all callbacks are already in the
form that works both for converted and unconverted architectures.

	2. How should kernel_thread() and kernel_execve() work for
converted architecture?

Recall how the fork() works.  We have the syscall call do_fork(), passing
it the pointer to struct pt_regs created on syscall entry and holding the
userland state of caller.  do_fork() has new task_struct and new kernel
stack allocated; then it calls copy_thread(), which sets the arch-dependent
things for new process.  Then it makes the new task_struct visible to
scheduler and once it's picked for execution, it'll be woken up and proceed
to return to userland, restoring the userland state copied from the parent.
The work of copy_thread() is to arrange the things up for that.

It copies pt_regs to wherever the child would expect to find them on return
from syscall (usually on child's kernel stack) and sets things up so that
when the scheduler finally does switch_to() into the newborn, it will be
woken up in the code that will drive it to userland.  Normally switch_to()
wakes the next process up in the place where it has given the CPU last time,
i.e. in the same switch_to().  We could, in principle, set the things up
for newborn so that they would look that way.  No architecture goes to
such pains, though - no point faking a fairly deep call chain, especially
since changes in scheduler might require modifying all such fakers.  What's
done instead is a much shorter call chain - we act as if we had given CPU
up in the very beginning of ret_from_fork(), called from the syscall entry
glue.  Since we won't be going through the parts of schedule() done after
switch_to(), ret_from_fork() starts with calling schedule_tail() to mop
up.  Then it's off to the normal return from syscall.

Old implementation of kernel_thread() had been rather convoluted.  In the
best case, it filled struct pt_regs according to its arguments and passed
them to do_fork().  The goal was to fool the code doing return from
syscall into *not* leaving the kernel mode, so that newborn would have
(after emptying its kernel stack) end up in a helper function with the
right values in registers.  Running in kernel mode.  The helper took
fn and arg, and called the former passing it the latter.  Then it called
do_exit(), assuming it got that far.  Contortions came from the "fool
the return from syscall into leaving us in kernel mode" part.

New implementation is much simpler.  Generic kernel_thread() still does
do_fork(), but instead of filling pt_regs it passes fn and arg in a couple
of arguments that are blindly passed to copy_thread() and passes NULL as
pt_regs pointer.  In that case copy_thread() should still arrange the things
up for switch_to(), but instead of ret_from_fork() we want to wake up in
a slightly different function.  The name (just as in case of ret_from_fork)
is entirely up to the architecture; I've called it ret_from_kernel_thread
in most of the cases.  What it does is almost identical to what ret_from_fork()
does; it calls schedule_tail() to mop up, then it does fn(arg), using the
information left to it by copy_thread(), then it's off to return from syscall.
Note the difference between that and the old one: instead of
	* schedule_tail() finishes the things for scheduler
	* return to userland, fooled into leaving us in kernel mode; registers
are set from what we'd left in pt_regs.
	* we are in helper() (and still in kernel mode, with empty kernel
stack), which calls fn(arg)
	* fn(arg) either never returns or does successful kernel_execve(),
which does magic to switch to user mode and jump into the image we got from
the binary loaded by kernel_execve()
we have
	* schedule_tail() finishes the things for scheduler
	* fn(arg) is called
	* fn(arg) either never returns or does successful kernel_execve(),
which doesn't have to do any magic - it has pt_regs on kernel stack in
the right position, so filling them up as usual and returning 0 to caller
is just fine
	* we proceed to return to userland.  Nobody needs to be fooled,
everything happens as on normal return from execve(2) - registers are
set as needed by the contents of pt_regs, as filled by do_execve() and
we are off to user mode at the entry point of new binary.

The new variant is obviously nowhere near as hairy.  Moreover, kernel_execve()
can be completely generic as well.  Even better, we don't have to cope with
clone(2) or execve(2) done with non-empty kernel stack (which was a fairly
common way to do aforementioned black magic in kernel_execve()), so sys_execve()
doesn't need anything convoluted to find the pt_regs to pass to do_execve().
In other words, on converted platforms we can switch sys_execve() to
completely generic version as well.

	3.  Gory details.
As I mentioned above, there's a couple of do_fork() arguments that are passed
to copy_thread() as-is, without even looking at them.  Those we use to pass
fn and arg.  It's the second and the third argument resp.; for userland
clone2(2) we'd be passing userland stack pointer and stack size in them.
Only ia64 has clone2() wired, so usually copy_thread() instance names them
'usp' and 'unused' resp. - fork()/vfork()/clone() pass userland stack pointer
to do_fork(), but pass 0 as the 3rd argument.  In any case, for copy_thread()
it's something like
	if (unlikely(!regs)) {
		set the things up, expecting (unsigned long)fn in argument 2
and (unsigned long)arg in argument 3
	} else {
		copy *regs to child, etc.
	}
How to set the things up depends mostly on the way switch_to() is implemented.
In any case, we need to clean the child's pt_regs - it wouldn't do to leak
random kernel data in to userland registers if the child eventually becomes
a userland process.  If switch_to() restores callee-saved registers of the
process it switches to before jumping to the place where that process should
be woken up (i.e. if the processes appear to sleep at the very end of
switch_to() and not in the middle), it's probably the best to pass fn and
arg in a pair of callee-saved registers; then ret_from_kernel_thread() will
find them already loaded into those registers by switch_to().  If switch_to()
is something like
	save callee-saved registers of last process
	save stack pointer
	save l as wakep location
	restore stack pointer of the next process
	jump to wakeup location of the next process
l:	restore callee-saved registers of the next process
(i.e. the wakeup location is in the middle of switch_to), it's probably best
to save fn and arg in child's pt_regs and read them explicitly in
ret_from_kernel_thread(), since switch_to() won't get to restoring callee-saved
registers when it switches to newborn.  Stack pointer is almost certainly
switched before the jump; if it isn't, we are going to notice that as soon
as you look at ret_from_fork() - it would be in the same situation and it
would have to set the stack pointer itself, so we can just duplicate that.

In any case, ret_from_fork() and ret_from_kernel_thread() will be very
similar.  So much that on a predicated architecture it might make sense
to merge them and just make the call of payload predicated on "is it
a kernel thread" or something equivalent.  I'd done that for arm (and
fucked up the call setup in case of thumb-mode kernel, which rmk had fun
to debug) and ia64.  Probably the same could be done for parisc as well.

One note about clearing the child's pt_regs: we won't be using it to
fool the return from syscall into anything, so most of the convolutions
go away.  It's probably a good idea to set it so that user_mode(child_regs)
would be false - more robust that way.  The only case when they end up
anywhere near return from syscall codepath is if we'd done successful
do_execve(), so we can count on at least start_thread() having been done.
Usually that's enough; in one case (ppc64) I ended up with a lovely
detection job finding out why it wasn't.  Turned out that there was a
field (childregs->softe) that was set to 1 by kernel_execve() black
magic; return from syscall logics relied on it being non-zero.  Something
similar might easily be true elsewhere; that's a potential pitfall that
might be useful to keep in mind debugging that stuff.

	4. What is done?
I've done the conversions for almost all architectures, but quite a few
are completely untested.

I'm fairly sure about alpha, x86 and um.  Tested and I understand the
architecture well enough.  arm, mips and c6x had been tested by architecture
maintainers.  This stuff also works.  alpha, arm, x86 and um are fully
converted in mainline by now.

Next group: m68k, ppc, s390, arm64, parisc.  I'm reasonably sure those
are OK, but I'd like the maintainers to take a look.

sparc: Dave said he'll look it through.  I'm still in one piece and not
charred, so either it's OK or he didn't have time to read it yet.  Works
here, anyway.

Next comes the pile of embedded architectures where the best that can be said
about what I have is that it might serve as a starting point for producing
something that works.  I've no hardware, no emulated setups and my
knowledge of architecture comes from architecture manuals and nothing
else.  Those are
	avr32
	blackfin
	cris
	frv
	h8300
	m32r
	microblaze
	mn10300
	score
	sh
	unicore32
Maintainers are Cc'd.  My (very, _very_ tentative) patchsets are in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal arch-$ARCH

Nearly in the same state: ia64.  The only difference is that I've tested
it under ski(1) and it seems to work.  Accuracy of ski(1) for the purposes
of finding bugs in asm glue is not inspiring, though.

Not even a tentative patchset: hexagon, openrisc, tile, xtensa.

I would very much appreciate ACKs/testing/fixes/outright replacements/etc.
for this stuff.  Right now all infrastructure is in the mainline and
per-architecture bits are entirely independent from each other.  As soon
as maintainer in question is OK with what's in such per-architecture branch,
I'll be quite happy to put it into never-rebased mode, so that it would be
safe to pull.  There are some fun things that'll become possible once
all architectures are converted, but let's handle that stuff first, OK?

^ permalink raw reply	[flat|nested] 93+ messages in thread
* Re: new execve/kernel_thread design
@ 2012-10-19 15:55 Al Viro
  2012-10-21 10:35 ` James Bottomley
  0 siblings, 1 reply; 93+ messages in thread
From: Al Viro @ 2012-10-19 15:55 UTC (permalink / raw)
  To: linux-arch

[Sorry; forgot about that typo in Cc...  Repost to linux-arch alone]

On Tue, Oct 16, 2012 at 11:35:08PM +0100, Al Viro wrote:
> 	1.  Basic rules for process lifetime.
> Except for the initial process (init_task, eventual idle thread on the boot
> CPU) all processes are created by do_fork().  There are three classes of
> those: kernel threads, userland processes and idle threads to be.  There are
> few low-level operations involved:
> 	* a kernel thread can spawn a new kernel thread; the primitive
> doing that is kernel_thread().
> 	* a userland process can spawn a new userland process; that's
> done by sys_fork()/sys_vfork()/sys_clone()/sys_clone2().
> 	* a kernel thread can become a userland process.  The primitive
> is kernel_execve().
> 	* a kernel thread can spawn a future idle thread; that's done
> by fork_idle().  Result is *not* scheduled until the secondary CPU gets
> initialized and its state is heavily overwritten in process.

Minor correction: while the first two cases go through do_fork() to
copy_process() to copy_thread(), fork_idle() calls copy_process() directly.

> 	4. What is done?
> I've done the conversions for almost all architectures, but quite a few
> are completely untested.
> 
> I'm fairly sure about alpha, x86 and um.  Tested and I understand the
> architecture well enough.  arm, mips and c6x had been tested by architecture
> maintainers.  This stuff also works.  alpha, arm, x86 and um are fully
> converted in mainline by now.

arm64 fixed and tested by maintainer, put in no-rebase mode.

sparc corrected to avoid branching beyond what ba,pt allows, ACKed by Davem
in that form.  In no-rebase mode.

m68k tested and ACKed on coldfire; I think that along with aranym testing
here that is enough.  In no-rebase mode.

Surprisingly enough, ia64 one seems to work on actual hardware; I have sent
Tony an incremental patch cleaning copy_thread() up, waiting for results of
testing that on SMP box.

Even more surprisingly, unicore32 variant turned out to contain only one
obvious typo.  Fixed and tested by maintainer of unicore32 tree and actually
applied there, I've pulled his branch at that point.

microblaze: some fixes from Michal folded, still breakage with kernel_execve()
side of things.

Since there had been no signs of life from hexagon folks, I'd done (absolutely
blind and untested) tentative patches; see #arch-hexagon.  Same situation
as with most of the embedded architectures - i.e. take with a cartload of salt,
that pair of patches is intended to be a possible starting point for producing
something working.

At that point we have the following situation:
alpha                   done
arm                     done
arm64                   done
avr32                   untested
blackfin                untested
c6x                     done
cris                    untested
frv                     untested, maintainer going to test
h8300                   untested
hexagon                 untested
ia64                    apparently works, needs the final ACK from Tony.
m32r                    untested
m68k                    done
microblaze              partially tested, maintainer hunting breakage down
mips                    done
mn10300                 untested
openrisc                maintainers said to have partially working variant
parisc                  should work, needs testing and ACK
powerpc                 should work, needs testing and ACK
s390                    should work, needs testing and ACK
score                   untested
sh                      untested, maintainers planned reviewing and testing
sparc                   done
tile                    maintainers writing that one
um                      done
unicore32               done
x86                     done
xtensa                  maintainers writing that one

One more thing: AFAICS, just about everything has something along the lines
of
	if (!usp)
		usp = <current userland sp>
	do_fork(flags, usp, ....)
in their sys_clone().  How about taking that into copy_thread()?  After
all, the logics there is
	copy all the state, including userland stack pointer to child
	override userland stack pointer with what the caller passed to
copy_thread()
often enough with "... and if we are about to override it with something
different, do the following extra work".  Turning that into
	copy all the state, including userland stack pointer to child
	if (usp) {
		override the userland stack pointer for child and maybe do
		some extra work
	}
would seem to be a fairly natural thing.  Does anybody see problems with
doing that on their architecture?  Note that with that fork() becomes
simply
#ifndef CONFIG_MMU
	return -EINVAL;
#else
	return do_fork(SIGCHLD, 0, current_pt_regs(), 0, NULL, NULL);
#endif
and similar for vfork().  And these can definitely drop the Cthulhu-awful
kludges for obtaining pt_regs (OK, on everything that doesn't do
kernel_thread() via syscall-from-kernel, but by now only xtensa is still
doing that).  In some cases we need to do a bit of work before that
(gather callee-saved registers so that the child could get them as on alpha,
mips, m68k, openrisc, parisc, ppc and x86, flush userland register windows
on sparc and get psr/wim values on sparc32), but a lot more architectures
lose the asm wrappers for those and the rest can get rid of assorted
ugliness involved in getting that struct pt_regs *.

BTW, alpha seems to be doing an absolutely pointless work on the way out of
sys_fork() et.al. - saving callee-saved registers is needed, all right,
but why bother restoring all of them on the way out in the parent?  All
we need is rp; that's ~0.3Kb of useless reads from memory on each fork()...

The same goes for m68k; there the amount of traffic is less, but still, what
the hell for?  Child needs callee-saved registers restored (and usually will
have that done by switch_to()), but the parent needs only to make sure they
are saved and available for copy_thread() to bring them to child (incidentally,
copying registers is needed only when they are not embedded into task_struct.
At least um is doing a memcpy() for no reason whatsoever; fix will be sent
to rw shortly and ISTR seeing something similar on some of the other
architectures).

Another cross-architecture thing: folks, watch out for what's being done with
thread flags; I've just found a lovely bug on alpha where we have prctl(2)
doing non-atomic modifications of those (as in ti->flags = (ti->flags&~x)|y;),
which is obviously broken; TIF_SIGPENDING can be set asynchronously and even
from an interrupt.  Fix for this one is going to Linus shortly (adding
a separate field for thread-synchronous flags, taking obviously t-s ones
there, including the UAC_... bunch set by that prctl()), but I don't think
that I can audit that for all architectures efficiently; cursory look has
found a braino on frv (fix being discussed with dhowells), but there may bloody
well be more of that fun.

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2012-12-13  1:55 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16 22:35 new execve/kernel_thread design Al Viro
2012-10-17  5:32 ` Max Filippov
2012-10-17  5:43   ` Al Viro
2012-10-17 14:07 ` Jonas Bonn
2012-10-17 14:27   ` Michal Simek
2012-10-17 14:27     ` Michal Simek
2012-10-17 16:07     ` Al Viro
2012-10-17 16:07       ` Al Viro
2012-10-17 16:19       ` Al Viro
2012-10-17 16:19         ` Al Viro
2012-11-15 16:41         ` Michal Simek
2012-11-15 16:41           ` Michal Simek
2012-11-15 21:55           ` Al Viro
2012-11-15 21:55             ` Al Viro
2012-11-16  7:59             ` Michal Simek
2012-11-18  5:45               ` sigaltstack fun (was Re: new execve/kernel_thread design) Al Viro
2012-11-18 18:45                 ` Linus Torvalds
2012-11-18 19:03                   ` sigaltstack fun David Miller
2012-11-18 19:59                     ` Al Viro
2012-11-18 20:48                       ` David Miller
2012-11-19  4:55                         ` Greg KH
2012-11-18 21:02                       ` Al Viro
2012-11-18 21:18                         ` David Miller
2012-11-19  1:10                           ` Al Viro
2012-11-19  1:30                             ` David Miller
2012-11-19  2:35                               ` Al Viro
2012-11-19  3:27                                 ` David Miller
2012-11-26  5:10                                   ` Al Viro
2012-11-26  5:15                                     ` Al Viro
2012-12-04  3:03                                       ` David Miller
2012-12-04  2:58                                     ` David Miller
2012-11-21  1:53                   ` sigaltstack fun (was Re: new execve/kernel_thread design) Al Viro
2012-10-19 15:49 ` new execve/kernel_thread design Al Viro
2012-10-19 17:16   ` Luck, Tony
2012-10-19 17:30     ` Al Viro
2012-10-19 18:01       ` Tony Luck
2012-10-19 18:33         ` Al Viro
2012-10-19 20:25 ` [PATCH] tile: support GENERIC_KERNEL_THREAD and GENERIC_KERNEL_EXECVE Chris Metcalf
2012-10-19 20:25   ` Chris Metcalf
2012-10-19 21:35   ` Al Viro
2012-10-20 13:06     ` Chris Metcalf
2012-10-20 13:06       ` Chris Metcalf
2012-10-20 15:34       ` Al Viro
2012-10-20 17:16         ` Al Viro
2012-10-23 17:30           ` Chris Metcalf
2012-10-23 17:30             ` Chris Metcalf
2012-10-23 18:41             ` Al Viro
2012-10-23 19:22               ` Chris Metcalf
2012-10-23 19:22                 ` Chris Metcalf
2012-10-23 20:36                 ` Al Viro
2012-10-25 13:31                   ` Chris Metcalf
2012-10-25 13:31                     ` Chris Metcalf
2012-10-25 14:25                     ` Al Viro
2012-10-23 20:47               ` Thomas Gleixner
2012-10-23 20:51                 ` Jeff King
2012-10-23 21:09                   ` Catalin Marinas
2012-10-23 21:22                     ` Jeff King
2012-10-24 11:18                       ` Catalin Marinas
2012-10-23 21:25                   ` Thomas Gleixner
2012-10-23 21:47                     ` Jeff King
2012-10-23 22:06                       ` Marc Gauthier
2012-10-23 22:23                         ` Jeff King
2012-10-24  6:02                           ` Johannes Sixt
2012-10-24  1:02                     ` Linus Torvalds
2012-10-24  1:56                       ` Al Viro
2012-10-24  2:14                         ` Linus Torvalds
2012-10-24  6:02                       ` Ingo Molnar
2012-10-23 17:30           ` [PATCH] arch/tile: eliminate pt_regs trampolines for syscalls Chris Metcalf
2012-10-23 17:30             ` Chris Metcalf
2012-10-22 14:23         ` [PATCH] tile: support GENERIC_KERNEL_THREAD and GENERIC_KERNEL_EXECVE Catalin Marinas
2012-10-19 20:25 ` Chris Metcalf
2012-10-25 16:54 ` new execve/kernel_thread design Richard Kuo
2012-10-26 18:31 ` [update] " Al Viro
2012-10-26 18:31   ` Al Viro
2012-10-27  3:32   ` Al Viro
2012-10-27  3:32     ` Al Viro
2012-10-29  7:53   ` Martin Schwidefsky
2012-10-29  7:53     ` Martin Schwidefsky
2012-10-29 13:25     ` Al Viro
2012-10-29 13:25       ` Al Viro
2012-10-29 14:38       ` Martin Schwidefsky
2012-10-29 14:38         ` Martin Schwidefsky
2012-10-29 14:57         ` Al Viro
2012-10-29 14:57           ` Al Viro
2012-12-07 22:23   ` Al Viro
2012-12-07 22:23     ` Al Viro
2012-12-08  2:40     ` Chris Metcalf
2012-12-08  2:40       ` Chris Metcalf
2012-12-08  2:40       ` Chris Metcalf
2012-12-13  1:54     ` Hirokazu Takata
2012-12-13  1:54       ` Hirokazu Takata
2012-10-19 15:55 Al Viro
2012-10-21 10:35 ` James Bottomley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.