[PATCH 0 of 4] Generic AIO by scheduling stacks

* [PATCH 0 of 4] Generic AIO by scheduling stacks
@ 2007-01-30 20:39 Zach Brown
  2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
                   ` (7 more replies)
  0 siblings, 8 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 12811 bytes --]

This very rough patch series introduces a different way to provide AIO support
for system calls.

Right now to provide AIO support for a system call you have to express your
interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
translate this into some call path in the kernel that passes in an iocb, and
then update your code path implement with either completion-based (EIOCBQUEUED)
or retry-based (EIOCBRETRY) AIO with the iocb.

This patch series changes this by moving the complexity into generic code such
that a system call handler would provide AIO support in exactly the same way
that it supports a synchronous call.  It does this by letting a task have
multiple stacks executing system calls in the kernel.  Stacks are switched in
schedule() as they block and are made runnable.

First, let's introduce the term 'fibril'.  It means small fiber or thread.  It
represents the stack and the bits of data which manage its scheduling under the
task_struct.  There's a 1:1:1 relationship between a fibril, the stack it's
managing, and the thread_struct it's operating under.  They're static for the
fibril's lifetime.  I came to choosing a funny new term after a few iterations
of trying to use existing terms (stack, call, thread, task, path, fiber)
without going insane.  They were just too confusing to use with a clear
conscious.

So, let's illustrate the changes by walking through the execution of an asys
call.  Let's say sys_nanosleep().  Then we'll talk about the trade-offs.

Maybe it'd make sense to walk through the patches in another window while
reading the commentary.  I'm sorry if this seems tedious, but this is
non-trivial stuff.  I want to get the point across.

We start in sys_asys_submit().  It allocates a fibril for this executing
submission syscall and hangs it off the task_struct.  This lets this submission
fibril be scheduled along with the later asys system calls themselves.

sys_asys_submit() has arguments which specify system call numbers and
arguments.  For each of these calls the submission syscall allocates a fibril
and constructs an initial stack.  The fibril has eip pointing to the system
call handler and esp pointing to the new stack such that when the handler is
called its arguments will be on the stack.  The stack is also constructed so
that when the handler returns it jumps to a function which queues the return
code for collection in userspace.

sys_asys_submit() asks the scheduler to put these new fibrils on the simple run
queue in the task_struct.  It's just a list_head for now.  It then calls
schedule().

After we've gotten the run queue lock in the scheduler we notice that there are
fibrils on the task_struct's run queue.  Instead of continuing with schedule(),
then, we switch fibrils.  The old submission fibril will still be in schedule()
and we'll start executing sys_nanosleep() in the context of the submission
task_struct.

The specific switching mechanics of this implementation rely on the notion of
tracking a stack as a full thread_info pointer.  To make the switch we transfer
the non-stack bits of the thread_info from the old fibril's ti to the new
fibril's ti.  We update the book keeping in the task_struct to 
consider the new thread_info as the current thread_info for the task.  Like so:

        *next->ti = *ti;
        *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

        current->thread_info = next->ti;
        current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
        current->fibril = next;
        current->state = next->state;
        current->per_call = next->per_call;

Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
Especially from the perspective of worrying about all the archs.

Did everyone catch that "per_call" thing there?  That's to switch members of
task_struct which are local to a specific call.  link_count, journal_info, that
sort of thing.  More on that as we talk about the costs later.

After the switch we're executing in sys_nanosleep().  Eventually it gets to the
point where it's building its timer which will wake it after the sleep
interval.  Currently it would store a bare task_struct reference for
wake_up_process().  Instead we introduce a helper which returns a cookie that
is given to a specific wake_up_*() variant.  Like so:

-       sl->task = task;
+       sl->wake_target = task_wake_target(task);

It then marks itself as TASK_INTERRUPTIBLE and calls schedule().  schedule()
notices that we have another fibril on the run queue.  It's the submission
fibril that we switched from earlier.  As we switched we saw that it was still
TASK_RUNNING so we put it on the run queue as we switched.  We now switch back to
the submission fibril, leaving the sys_nanosleep() fibril sleeping.  Let's say
the submission task returns to userspace which then immediately calls
sys_asys_await_completion().  It's an easy case :).  It goes to sleep, there
are no running fibrils and the schedule() path really puts the task to sleep.

Eventually the timer fires and the hrtimer code path wakes the fibril:

-       if (task)
-               wake_up_process(task);
+       if (wake_target)
+               wake_up_target(wake_target);

We've doctored try_to_wake_up() to be able to tell if its argument is a
task_struct or one of these fibril targets.  In the fibril case it calls
try_to_wake_up_fibril().  It notices that the target fibril does need to be
woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
it puts the fibril on the task's fibril run queue and wakes the task.  There's
grossness here.  It needs the task to come through schedule() again so that it
can find the new runnable fibril instead of continuing to execute its current
fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
This seems to work, but there are some pretty terrifying interactions between
schedule, wake-up, and the maintenance of fibril->state and task->state that
need to be sorted out.

Remember our task was sleeping in asys_await_completion()?  The task was woken
by the fibril wake-up path, but it's still executing the
asys_await_completion() fibril.  It comes out of schedule() and sees
TIF_NEED_RESCHED and comes back through the top of schedule().  This time it
finds the runnable sys_nanosleep() fibril and switches to it.  sys_nanosleep()
runs to completion and it returns which, because of the way we built its stack,
calls asys_teardown_stack().

asys_teardown_stack() takes the return code and puts it off in a list for
asys_await_completion().  It wakes a wait_queue to notify waiters of pending
completions.  In so doing it wakes up the asys_await_completion() fibril that
was sleeping in our task.

Then it has to tear down the fibril for the call that just completed.  In the
current implementation the fibril struct is actually embedded in an "asys_call"
struct.  asys_teardown_stack() frees the asys_call struct, and so the fibril,
after having cleared current->fibril.  It then calls schedule().  Our
asys_await_completion() fibril is on the run queue so we switch to it.
Switching notices the null current->fibril that we're switching from and takes
that as a queue to mark the previous thread_info for freeing *after* the
switch.

After the switch we're in asys_await_completion().  We find the waiting return
code completion struct in the list that was left by asys_teardown_stack().  We
return it to userspace.

Phew.  OK, so what are the trade-offs here?  I'll start with the benefits for
obvious reasons :).

- With get AIO support for all syscalls.  Every single one.  (Well, please, no
asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
crypto ioctls, open, mmap, getdents, the entire splice API, etc.

- The syscall API does not change just because calls are being issued AIO,
particularly things that reference task_struct.  AIO sys_getpid() does what
you'd expect, signal masking, etc.  You don't have to worry about your AIO call
being secretly handled by some worker threads that get very different results
from current-> references.

- We wouldn't multiply testing and maintenance burden with separate AIO paths.
No is_sync_kiocb() testing and divergence between returning or calling
aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
after any significant references of current->.  No worries about completion
racing from the submission return path and some aio_complete() being called
from another context.  In this scheme if your sync syscall path isn't broken,
your AIO path stands a great chance of working.

- The submission syscall won't block while handling submitted calls.  Not for
metadata IO, not for memory allocation, not for mutex contention, nothing.  

- AIO syscalls which *don't* block see very little overhead.  They'll allocate
stacks and juggle the run queue locks a little, but they'll execute in turn on
the submitting (cache-hot, presumably) processor.  There's room to optimize
this path, too, of course.

- We don't need to duplicate interfaces for each AIO interface we want to
support.  No iocb unions mirroring the syscall API, no magical AIO sys_
variants.

And the costs?  It's not free.

- The 800lb elephant in the room.  It uses a full stack per blocked operation.
I believe this is a reasonable price to pay for the flexibility of having *any*
call pending.  It rules out some loads which would want to keep *millions* of
operations pending, but I humbly submit that a load rarely takes that number of
concurrent ops to saturate a resource.  (think of it this way: we've gotten
this far by having to burn a full *task* to have *any* syscall pending.)  While
not optimal, it opens to door to a lot of functionality without having to
rewrite the kernel as a giant non-blocking state machine.

It should be noted that my very first try was to copy the used part of stacks
in to and out of one full allocated stack.  This uses less memory per blocking
operation at the cpu cost of copying the used regions.  And it's a terrible
idea, for at least two reasons.  First, to actually get the memory overhead
savings you allocate at stack switch time.  If that allocation can't be
satisfied you are in *trouble* because you might not be able to switch over to
a fibril that is trying to free up memory.  Deadlock city.  Second, it means
that *you can't reference on-stack data in the wake-up path*.  This is a
nightmare.  Even our trivial sys_nanosleep() example would have had to take its
hrtimer_sleeper off the stack and allocate it.  Never mind, you know, basically
every single user of <linux/wait.h>.   My current thinking is that it's just
not worth it.

- We would now have some measure of task_struct concurrency.  Read that twice,
it's scary.  As two fibrils execute and block in turn they'll each be
referencing current->.  It means that we need to audit task_struct to make sure
that paths can handle racing as its scheduled away.  The current implementation
*does not* let preemption trigger a fibril switch.  So one only has to worry
about racing with voluntary scheduling of the fibril paths.  This can mean
moving some task_struct members under an accessor that hides them in a struct
in task_struct so they're switched along with the fibril.  I think this is a
manageable burden.

- The fibrils can only execute in the submitter's task_struct.  While I think
this is in fact a feature, it does imply some interesting behaviour.
Submitters will be required to explicitly manage any concurrent between CPUs by
issuing their ops in tasks.  To guarantee forward-progress in syscall handling
paths (releasing i_mutex, say) we'll have to interrupt userspace when a fibril
is ready to run.

- Signals.  I have no idea what behaviour we want.  Help?  My first guess is
that we'll want signal state to be shared by fibrils by keeping it in the
task_struct.  If we want something like individual cancellation,  we'll augment
signal_pending() with some some per-fibril test which will cause it to return
from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
cancellation, I'll argue) as it would have if a signal was pending.

- lockdep and co. will need to be updated to track fibrils instead of tasks.
sysrq-t might want to dump fibril stacks, too.  That kind of thing.  Sorry.

As for the current implementation, it's obviously just a rough sketch.  I'm
sending it out in this state because this is the first point at which a tree
walker using AIO openat(), fstat(), and getdents() actually worked on ext3.
Generally, though, these are paths that I don't have the most experience in.
I'd be thrilled to implement whatever the experts think is the right way to do
this. 

Blah blah blah.  Too much typing.  What do people think?

^ permalink raw reply	[flat|nested] 151+ messages in thread