All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0 of 4] Generic AIO by scheduling stacks
@ 2007-01-30 20:39 Zach Brown
  2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
                   ` (7 more replies)
  0 siblings, 8 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 12811 bytes --]

This very rough patch series introduces a different way to provide AIO support
for system calls.

Right now to provide AIO support for a system call you have to express your
interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
translate this into some call path in the kernel that passes in an iocb, and
then update your code path implement with either completion-based (EIOCBQUEUED)
or retry-based (EIOCBRETRY) AIO with the iocb.

This patch series changes this by moving the complexity into generic code such
that a system call handler would provide AIO support in exactly the same way
that it supports a synchronous call.  It does this by letting a task have
multiple stacks executing system calls in the kernel.  Stacks are switched in
schedule() as they block and are made runnable.

First, let's introduce the term 'fibril'.  It means small fiber or thread.  It
represents the stack and the bits of data which manage its scheduling under the
task_struct.  There's a 1:1:1 relationship between a fibril, the stack it's
managing, and the thread_struct it's operating under.  They're static for the
fibril's lifetime.  I came to choosing a funny new term after a few iterations
of trying to use existing terms (stack, call, thread, task, path, fiber)
without going insane.  They were just too confusing to use with a clear
conscious.

So, let's illustrate the changes by walking through the execution of an asys
call.  Let's say sys_nanosleep().  Then we'll talk about the trade-offs.

Maybe it'd make sense to walk through the patches in another window while
reading the commentary.  I'm sorry if this seems tedious, but this is
non-trivial stuff.  I want to get the point across.

We start in sys_asys_submit().  It allocates a fibril for this executing
submission syscall and hangs it off the task_struct.  This lets this submission
fibril be scheduled along with the later asys system calls themselves.

sys_asys_submit() has arguments which specify system call numbers and
arguments.  For each of these calls the submission syscall allocates a fibril
and constructs an initial stack.  The fibril has eip pointing to the system
call handler and esp pointing to the new stack such that when the handler is
called its arguments will be on the stack.  The stack is also constructed so
that when the handler returns it jumps to a function which queues the return
code for collection in userspace.

sys_asys_submit() asks the scheduler to put these new fibrils on the simple run
queue in the task_struct.  It's just a list_head for now.  It then calls
schedule().

After we've gotten the run queue lock in the scheduler we notice that there are
fibrils on the task_struct's run queue.  Instead of continuing with schedule(),
then, we switch fibrils.  The old submission fibril will still be in schedule()
and we'll start executing sys_nanosleep() in the context of the submission
task_struct.

The specific switching mechanics of this implementation rely on the notion of
tracking a stack as a full thread_info pointer.  To make the switch we transfer
the non-stack bits of the thread_info from the old fibril's ti to the new
fibril's ti.  We update the book keeping in the task_struct to 
consider the new thread_info as the current thread_info for the task.  Like so:

        *next->ti = *ti;
        *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

        current->thread_info = next->ti;
        current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
        current->fibril = next;
        current->state = next->state;
        current->per_call = next->per_call;

Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
Especially from the perspective of worrying about all the archs.

Did everyone catch that "per_call" thing there?  That's to switch members of
task_struct which are local to a specific call.  link_count, journal_info, that
sort of thing.  More on that as we talk about the costs later.

After the switch we're executing in sys_nanosleep().  Eventually it gets to the
point where it's building its timer which will wake it after the sleep
interval.  Currently it would store a bare task_struct reference for
wake_up_process().  Instead we introduce a helper which returns a cookie that
is given to a specific wake_up_*() variant.  Like so:

-       sl->task = task;
+       sl->wake_target = task_wake_target(task);

It then marks itself as TASK_INTERRUPTIBLE and calls schedule().  schedule()
notices that we have another fibril on the run queue.  It's the submission
fibril that we switched from earlier.  As we switched we saw that it was still
TASK_RUNNING so we put it on the run queue as we switched.  We now switch back to
the submission fibril, leaving the sys_nanosleep() fibril sleeping.  Let's say
the submission task returns to userspace which then immediately calls
sys_asys_await_completion().  It's an easy case :).  It goes to sleep, there
are no running fibrils and the schedule() path really puts the task to sleep.

Eventually the timer fires and the hrtimer code path wakes the fibril:

-       if (task)
-               wake_up_process(task);
+       if (wake_target)
+               wake_up_target(wake_target);

We've doctored try_to_wake_up() to be able to tell if its argument is a
task_struct or one of these fibril targets.  In the fibril case it calls
try_to_wake_up_fibril().  It notices that the target fibril does need to be
woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
it puts the fibril on the task's fibril run queue and wakes the task.  There's
grossness here.  It needs the task to come through schedule() again so that it
can find the new runnable fibril instead of continuing to execute its current
fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
This seems to work, but there are some pretty terrifying interactions between
schedule, wake-up, and the maintenance of fibril->state and task->state that
need to be sorted out.

Remember our task was sleeping in asys_await_completion()?  The task was woken
by the fibril wake-up path, but it's still executing the
asys_await_completion() fibril.  It comes out of schedule() and sees
TIF_NEED_RESCHED and comes back through the top of schedule().  This time it
finds the runnable sys_nanosleep() fibril and switches to it.  sys_nanosleep()
runs to completion and it returns which, because of the way we built its stack,
calls asys_teardown_stack().

asys_teardown_stack() takes the return code and puts it off in a list for
asys_await_completion().  It wakes a wait_queue to notify waiters of pending
completions.  In so doing it wakes up the asys_await_completion() fibril that
was sleeping in our task.

Then it has to tear down the fibril for the call that just completed.  In the
current implementation the fibril struct is actually embedded in an "asys_call"
struct.  asys_teardown_stack() frees the asys_call struct, and so the fibril,
after having cleared current->fibril.  It then calls schedule().  Our
asys_await_completion() fibril is on the run queue so we switch to it.
Switching notices the null current->fibril that we're switching from and takes
that as a queue to mark the previous thread_info for freeing *after* the
switch.

After the switch we're in asys_await_completion().  We find the waiting return
code completion struct in the list that was left by asys_teardown_stack().  We
return it to userspace.

Phew.  OK, so what are the trade-offs here?  I'll start with the benefits for
obvious reasons :).

- With get AIO support for all syscalls.  Every single one.  (Well, please, no
asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
crypto ioctls, open, mmap, getdents, the entire splice API, etc.

- The syscall API does not change just because calls are being issued AIO,
particularly things that reference task_struct.  AIO sys_getpid() does what
you'd expect, signal masking, etc.  You don't have to worry about your AIO call
being secretly handled by some worker threads that get very different results
from current-> references.

- We wouldn't multiply testing and maintenance burden with separate AIO paths.
No is_sync_kiocb() testing and divergence between returning or calling
aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
after any significant references of current->.  No worries about completion
racing from the submission return path and some aio_complete() being called
from another context.  In this scheme if your sync syscall path isn't broken,
your AIO path stands a great chance of working.

- The submission syscall won't block while handling submitted calls.  Not for
metadata IO, not for memory allocation, not for mutex contention, nothing.  

- AIO syscalls which *don't* block see very little overhead.  They'll allocate
stacks and juggle the run queue locks a little, but they'll execute in turn on
the submitting (cache-hot, presumably) processor.  There's room to optimize
this path, too, of course.

- We don't need to duplicate interfaces for each AIO interface we want to
support.  No iocb unions mirroring the syscall API, no magical AIO sys_
variants.

And the costs?  It's not free.

- The 800lb elephant in the room.  It uses a full stack per blocked operation.
I believe this is a reasonable price to pay for the flexibility of having *any*
call pending.  It rules out some loads which would want to keep *millions* of
operations pending, but I humbly submit that a load rarely takes that number of
concurrent ops to saturate a resource.  (think of it this way: we've gotten
this far by having to burn a full *task* to have *any* syscall pending.)  While
not optimal, it opens to door to a lot of functionality without having to
rewrite the kernel as a giant non-blocking state machine.

It should be noted that my very first try was to copy the used part of stacks
in to and out of one full allocated stack.  This uses less memory per blocking
operation at the cpu cost of copying the used regions.  And it's a terrible
idea, for at least two reasons.  First, to actually get the memory overhead
savings you allocate at stack switch time.  If that allocation can't be
satisfied you are in *trouble* because you might not be able to switch over to
a fibril that is trying to free up memory.  Deadlock city.  Second, it means
that *you can't reference on-stack data in the wake-up path*.  This is a
nightmare.  Even our trivial sys_nanosleep() example would have had to take its
hrtimer_sleeper off the stack and allocate it.  Never mind, you know, basically
every single user of <linux/wait.h>.   My current thinking is that it's just
not worth it.

- We would now have some measure of task_struct concurrency.  Read that twice,
it's scary.  As two fibrils execute and block in turn they'll each be
referencing current->.  It means that we need to audit task_struct to make sure
that paths can handle racing as its scheduled away.  The current implementation
*does not* let preemption trigger a fibril switch.  So one only has to worry
about racing with voluntary scheduling of the fibril paths.  This can mean
moving some task_struct members under an accessor that hides them in a struct
in task_struct so they're switched along with the fibril.  I think this is a
manageable burden.

- The fibrils can only execute in the submitter's task_struct.  While I think
this is in fact a feature, it does imply some interesting behaviour.
Submitters will be required to explicitly manage any concurrent between CPUs by
issuing their ops in tasks.  To guarantee forward-progress in syscall handling
paths (releasing i_mutex, say) we'll have to interrupt userspace when a fibril
is ready to run.

- Signals.  I have no idea what behaviour we want.  Help?  My first guess is
that we'll want signal state to be shared by fibrils by keeping it in the
task_struct.  If we want something like individual cancellation,  we'll augment
signal_pending() with some some per-fibril test which will cause it to return
from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
cancellation, I'll argue) as it would have if a signal was pending.

- lockdep and co. will need to be updated to track fibrils instead of tasks.
sysrq-t might want to dump fibril stacks, too.  That kind of thing.  Sorry.

As for the current implementation, it's obviously just a rough sketch.  I'm
sending it out in this state because this is the first point at which a tree
walker using AIO openat(), fstat(), and getdents() actually worked on ext3.
Generally, though, these are paths that I don't have the most experience in.
I'd be thrilled to implement whatever the experts think is the right way to do
this. 

Blah blah blah.  Too much typing.  What do people think?

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 1 of 4] Introduce per_call_chain()
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
@ 2007-01-30 20:39 ` Zach Brown
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

There are members of task_struct which are only used by a given call chain to
pass arguments up and down the chain itself.  They are logically thread-local
storage.

The patches later in the series want to have multiple calls pending for a given
task, though only one will be executing at a given time.  By putting these
thread-local members of task_struct in a seperate storage structure we're able
to trivially swap them in and out as their calls are swapped in and out.

per_call_chain() doesn't have a terribly great name. It was chosen in the
spirit of per_cpu().

The storage was left inline in task_struct to avoid introducing indirection for
the vast majority of uses which will never have multiple calls executing in a
task.

I chose a few members of task_struct to migrate under per_call_chain() along
with the introduction as an example of what it looks like.  These would be
seperate patches in a patch series that was suitable for merging.

diff -r b1128b48dc99 -r 26e278468209 fs/jbd/journal.c
--- a/fs/jbd/journal.c	Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/journal.c	Mon Jan 29 15:36:13 2007 -0800
@@ -471,7 +471,7 @@ int journal_force_commit_nested(journal_
 	tid_t tid;
 
 	spin_lock(&journal->j_state_lock);
-	if (journal->j_running_transaction && !current->journal_info) {
+	if (journal->j_running_transaction && !per_call_chain(journal_info)) {
 		transaction = journal->j_running_transaction;
 		__log_start_commit(journal, transaction->t_tid);
 	} else if (journal->j_committing_transaction)
diff -r b1128b48dc99 -r 26e278468209 fs/jbd/transaction.c
--- a/fs/jbd/transaction.c	Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/transaction.c	Mon Jan 29 15:36:13 2007 -0800
@@ -279,12 +279,12 @@ handle_t *journal_start(journal_t *journ
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
 
-	current->journal_info = handle;
+	per_call_chain(journal_info) = handle;
 
 	err = start_this_handle(journal, handle);
 	if (err < 0) {
 		jbd_free_handle(handle);
-		current->journal_info = NULL;
+		per_call_chain(journal_info) = NULL;
 		handle = ERR_PTR(err);
 	}
 	return handle;
@@ -1368,7 +1368,7 @@ int journal_stop(handle_t *handle)
 		} while (old_handle_count != transaction->t_handle_count);
 	}
 
-	current->journal_info = NULL;
+	per_call_chain(journal_info) = NULL;
 	spin_lock(&journal->j_state_lock);
 	spin_lock(&transaction->t_handle_lock);
 	transaction->t_outstanding_credits -= handle->h_buffer_credits;
diff -r b1128b48dc99 -r 26e278468209 fs/namei.c
--- a/fs/namei.c	Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/namei.c	Mon Jan 29 15:36:13 2007 -0800
@@ -628,20 +628,20 @@ static inline int do_follow_link(struct 
 static inline int do_follow_link(struct path *path, struct nameidata *nd)
 {
 	int err = -ELOOP;
-	if (current->link_count >= MAX_NESTED_LINKS)
+	if (per_call_chain(link_count) >= MAX_NESTED_LINKS)
 		goto loop;
-	if (current->total_link_count >= 40)
+	if (per_call_chain(total_link_count) >= 40)
 		goto loop;
 	BUG_ON(nd->depth >= MAX_NESTED_LINKS);
 	cond_resched();
 	err = security_inode_follow_link(path->dentry, nd);
 	if (err)
 		goto loop;
-	current->link_count++;
-	current->total_link_count++;
+	per_call_chain(link_count)++;
+	per_call_chain(total_link_count)++;
 	nd->depth++;
 	err = __do_follow_link(path, nd);
-	current->link_count--;
+	per_call_chain(link_count)--;
 	nd->depth--;
 	return err;
 loop:
@@ -1025,7 +1025,7 @@ int fastcall link_path_walk(const char *
 
 int fastcall path_walk(const char * name, struct nameidata *nd)
 {
-	current->total_link_count = 0;
+	per_call_chain(total_link_count) = 0;
 	return link_path_walk(name, nd);
 }
 
@@ -1153,7 +1153,7 @@ static int fastcall do_path_lookup(int d
 
 		fput_light(file, fput_needed);
 	}
-	current->total_link_count = 0;
+	per_call_chain(total_link_count) = 0;
 	retval = link_path_walk(name, nd);
 out:
 	if (likely(retval == 0)) {
diff -r b1128b48dc99 -r 26e278468209 include/linux/init_task.h
--- a/include/linux/init_task.h	Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/init_task.h	Mon Jan 29 15:36:13 2007 -0800
@@ -88,6 +88,11 @@ extern struct nsproxy init_nsproxy;
 
 extern struct group_info init_groups;
 
+#define INIT_PER_CALL_CHAIN(tsk)					\
+{									\
+	.journal_info	= NULL,						\
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -124,6 +129,7 @@ extern struct group_info init_groups;
 	.keep_capabilities = 0,						\
 	.user		= INIT_USER,					\
 	.comm		= "swapper",					\
+	.per_call	= INIT_PER_CALL_CHAIN(tsk),			\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
 	.files		= &init_files,					\
@@ -135,7 +141,6 @@ extern struct group_info init_groups;
 		.signal = {{0}}},					\
 	.blocked	= {{0}},					\
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),		\
-	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.fs_excl	= ATOMIC_INIT(0),				\
 	.pi_lock	= SPIN_LOCK_UNLOCKED,				\
diff -r b1128b48dc99 -r 26e278468209 include/linux/jbd.h
--- a/include/linux/jbd.h	Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/jbd.h	Mon Jan 29 15:36:13 2007 -0800
@@ -883,7 +883,7 @@ extern void		__wait_on_journal (journal_
 
 static inline handle_t *journal_current_handle(void)
 {
-	return current->journal_info;
+	return per_call_chain(journal_info);
 }
 
 /* The journaling code user interface:
diff -r b1128b48dc99 -r 26e278468209 include/linux/sched.h
--- a/include/linux/sched.h	Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/sched.h	Mon Jan 29 15:36:13 2007 -0800
@@ -784,6 +784,20 @@ static inline void prefetch_stack(struct
 static inline void prefetch_stack(struct task_struct *t) { }
 #endif
 
+/*
+ * Members of this structure are used to pass arguments down call chains
+ * without specific arguments.  Historically they lived on task_struct,
+ * putting them in one place gives us some flexibility.  They're accessed
+ * with per_call_chain(name).
+ */
+struct per_call_chain_storage {
+	int link_count;		/* number of links in one symlink */
+	int total_link_count;	/* total links followed in a lookup */
+	void *journal_info;	/* journalling filesystem info */
+};
+
+#define per_call_chain(foo) current->per_call.foo
+
 struct audit_context;		/* See audit.c */
 struct mempolicy;
 struct pipe_inode_info;
@@ -920,7 +934,7 @@ struct task_struct {
 				       it with task_lock())
 				     - initialized normally by flush_old_exec */
 /* file system info */
-	int link_count, total_link_count;
+	struct per_call_chain_storage per_call;
 #ifdef CONFIG_SYSVIPC
 /* ipc stuff */
 	struct sysv_sem sysvsem;
@@ -993,9 +1007,6 @@ struct task_struct {
 	struct held_lock held_locks[MAX_LOCK_DEPTH];
 	unsigned int lockdep_recursion;
 #endif
-
-/* journalling filesystem info */
-	void *journal_info;
 
 /* VM state */
 	struct reclaim_state *reclaim_state;

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
  2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
@ 2007-01-30 20:39 ` Zach Brown
  2007-02-01  8:36   ` Ingo Molnar
  2007-02-04  5:12   ` Davide Libenzi
  2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

This patch introduces the notion of a 'fibril'.  It's meant to be a lighter
kernel thread.  There can be multiple of them in the process of executing for a
given task_struct, but only one can every be actively running at a time.  Think
of it as a stack and some metadata for scheduling them inside the task_stuct.

This implementation is wildly architecture-specific but isn't put in the right
places.  Since these are not code paths that I have extensive experience with,
I focused more on on getting it going and representative of the concept than on
making it right on the first try.  I'm actively interested in feedback from
people who know more about the places this touches.

The fibril struct itself is left stand-alone for clarity.  There is a 1:1
relationship between fibrils and struct thread_info, though, so it might make
more sense to embed the two somehow.

The use of list_head for the run queue is simplistic.  As long as we're not
removing specific fibrils from the list, which seems unlikely, we be more
clever.  Maybe no more clever than a singly-linked list, though.

Fibril management is under the runqueue lock because that ends up working well
for the wake-up path as well.  In the current patch, though, it makes for some
pretty sloppy code for unlocking the runqueue lock (and re-enabling interrupts
and pre-emption) on the other side of the switch.

The actual mechanics of switching from one stack to another at the end of
schedule_fibril() makes me nervous.  I'm not convinced that blindly copying the
contents of thread_info from the previous to the next stack is safe, even if
done with interrupts disabled.  (NMIs?)  The juggling of current->thread_info
might be racy, etc.

diff -r 26e278468209 -r df7bc026d50e arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/arch/i386/kernel/process.c	Mon Jan 29 15:36:16 2007 -0800
@@ -698,6 +698,28 @@ struct task_struct fastcall * __switch_t
 	return prev_p;
 }
 
+/*
+ * We've just switched the stack and instruction pointer to point to a new
+ * fibril.  We were called from schedule() -> schedule_fibril() with the
+ * runqueue lock held _irq and with preemption disabled.
+ *
+ * We let finish_fibril_switch() unwind the state that was built up by
+ * our callers.  We do that here so that we don't need to ask fibrils to
+ * first execute something analagous to schedule_tail(). Maybe that's
+ * wrong.
+ *
+ * We'd also have to reacquire the kernel lock here.  For now we know the
+ * BUG_ON(lock_depth) prevents us from having to worry about it.
+ */
+void fastcall __switch_to_fibril(struct thread_info *ti)
+{
+	finish_fibril_switch();
+
+	/* free the ti if schedule_fibril() told us that it's done */
+	if (ti->status & TS_FREE_AFTER_SWITCH)
+		free_thread_info(ti);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
 	return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/system.h
--- a/include/asm-i386/system.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/system.h	Mon Jan 29 15:36:16 2007 -0800
@@ -31,6 +31,31 @@ extern struct task_struct * FASTCALL(__s
 		      "=a" (last),"=S" (esi),"=D" (edi)			\
 		     :"m" (next->thread.esp),"m" (next->thread.eip),	\
 		      "2" (prev), "d" (next));				\
+} while (0)
+
+struct thread_info;
+void fastcall __switch_to_fibril(struct thread_info *ti);
+
+/*
+ * This is called with the run queue lock held _irq and with preemption
+ * disabled.  __switch_to_fibril drops those.
+ */
+#define switch_to_fibril(prev, next, ti) do {				\
+	unsigned long esi,edi;						\
+	asm volatile("pushfl\n\t"		/* Save flags */	\
+		     "pushl %%ebp\n\t"					\
+		     "movl %%esp,%0\n\t"	/* save ESP */		\
+		     "movl %4,%%esp\n\t"	/* restore ESP */	\
+		     "movl $1f,%1\n\t"		/* save EIP */		\
+		     "pushl %5\n\t"		/* restore EIP */	\
+		     "jmp __switch_to_fibril\n"				\
+		     "1:\t"						\
+		     "popl %%ebp\n\t"					\
+		     "popfl"						\
+		     :"=m" (prev->esp),"=m" (prev->eip),		\
+		      "=S" (esi),"=D" (edi)				\
+		     :"m" (next->esp),"m" (next->eip),			\
+		      "d" (prev), "a" (ti));				\
 } while (0)
 
 #define _set_base(addr,base) do { unsigned long __pr; \
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/thread_info.h	Mon Jan 29 15:36:16 2007 -0800
@@ -91,6 +91,12 @@ static inline struct thread_info *curren
 static inline struct thread_info *current_thread_info(void)
 {
 	return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+}
+
+/* XXX perhaps should be integrated with task_pt_regs(task) */
+static inline struct pt_regs *thread_info_pt_regs(struct thread_info *info)
+{
+	return (struct pt_regs *)(KSTK_TOP(info)-8) - 1;
 }
 
 /* thread information allocation */
@@ -169,6 +175,7 @@ static inline struct thread_info *curren
  */
 #define TS_USEDFPU		0x0001	/* FPU was used by this task this quantum (SMP) */
 #define TS_POLLING		0x0002	/* True if in idle loop and not sleeping */
+#define TS_FREE_AFTER_SWITCH	0x0004	/* free ti in __switch_to_fibril() */
 
 #define tsk_is_polling(t) ((t)->thread_info->status & TS_POLLING)
 
diff -r 26e278468209 -r df7bc026d50e include/linux/init_task.h
--- a/include/linux/init_task.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/init_task.h	Mon Jan 29 15:36:16 2007 -0800
@@ -111,6 +111,8 @@ extern struct group_info init_groups;
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
+	.fibril		= NULL,						\
+	.runnable_fibrils = LIST_HEAD_INIT(tsk.runnable_fibrils),	\
 	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
 	.ioprio		= 0,						\
 	.time_slice	= HZ,						\
diff -r 26e278468209 -r df7bc026d50e include/linux/sched.h
--- a/include/linux/sched.h	Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/sched.h	Mon Jan 29 15:36:16 2007 -0800
@@ -812,6 +812,38 @@ enum sleep_type {
 
 struct prio_array;
 
+/*
+ * A 'fibril' is a very small fiber.  It's used here to mean a small thread.
+ *
+ * (Chosing a weird new name avoided yet more overloading of 'task', 'call',
+ * 'thread', 'stack', 'fib{er,re}', etc).
+ *
+ * This structure is used by the schduler to track multiple executing stacks
+ * inside a task_struct.
+ *
+ * Only one fibril executes for a given task_struct at a time.  When it
+ * blocks, however, another fibril has the chance to execute while it sleeps.
+ * This means that call chains executing in fibrils can see concurrent
+ * current-> accesses at blocking points.  "per_call_chain()" members are
+ * switched along with the fibril, so they remain local.  Preemption *will not*
+ * trigger a fibril switch.
+ *
+ * XXX
+ *  - arch specific
+ */
+struct fibril {
+	struct list_head		run_list;
+	/* -1 unrunnable, 0 runnable, >0 stopped */
+	long				state;
+	unsigned long			eip;
+	unsigned long			esp;
+	struct thread_info		*ti;
+	struct per_call_chain_storage	per_call;
+};
+
+void sched_new_runnable_fibril(struct fibril *fibril);
+void finish_fibril_switch(void);
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -857,6 +889,20 @@ struct task_struct {
 	struct list_head ptrace_list;
 
 	struct mm_struct *mm, *active_mm;
+
+	/*
+	 * The scheduler uses this to determine if the current call is a
+	 * stand-alone task or a fibril.  If it's a fibril then wake-ups
+	 * will target the fibril and a schedule() might result in swapping
+	 * in another runnable fibril.  So to start executing fibrils at all
+	 * one allocates a fibril to represent the running task and then
+	 * puts initialized runnable fibrils in the run list.
+	 *
+	 * The state members of the fibril and runnable_fibrils list are
+	 * managed under the task's run queue lock.
+	 */
+	struct fibril *fibril;
+	struct list_head runnable_fibrils;
 
 /* task state */
 	struct linux_binfmt *binfmt;
diff -r 26e278468209 -r df7bc026d50e kernel/exit.c
--- a/kernel/exit.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/exit.c	Mon Jan 29 15:36:16 2007 -0800
@@ -854,6 +854,13 @@ fastcall NORET_TYPE void do_exit(long co
 {
 	struct task_struct *tsk = current;
 	int group_dead;
+
+	/* 
+	 * XXX this is just a debug helper, this should be waiting for all
+	 * fibrils to return.  Possibly after sending them lots of -KILL
+	 * signals?
+	 */
+	BUG_ON(!list_empty(&current->runnable_fibrils));
 
 	profile_task_exit(tsk);
 
diff -r 26e278468209 -r df7bc026d50e kernel/fork.c
--- a/kernel/fork.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/fork.c	Mon Jan 29 15:36:16 2007 -0800
@@ -1179,6 +1179,9 @@ static struct task_struct *copy_process(
 
 	/* for sys_ioprio_set(IOPRIO_WHO_PGRP) */
 	p->ioprio = current->ioprio;
+
+	p->fibril = NULL;
+	INIT_LIST_HEAD(&p->runnable_fibrils);
 
 	/*
 	 * The task hasn't been attached yet, so its cpus_allowed mask will
diff -r 26e278468209 -r df7bc026d50e kernel/sched.c
--- a/kernel/sched.c	Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/sched.c	Mon Jan 29 15:36:16 2007 -0800
@@ -3407,6 +3407,111 @@ static inline int interactive_sleep(enum
 }
 
 /*
+ * This unwinds the state that was built up by schedule -> schedule_fibril().
+ * The arch-specific switch_to_fibril() path calls here once the new fibril
+ * is executing.
+ */
+void finish_fibril_switch(void)
+{
+	spin_unlock_irq(&this_rq()->lock);
+	preempt_enable_no_resched();
+}
+
+/*
+ * Add a new fibril to the runnable list.  It'll be switched to next time
+ * the caller comes through schedule().
+ */
+void sched_new_runnable_fibril(struct fibril *fibril)
+{
+	struct task_struct *tsk = current;
+	unsigned long flags;
+	struct rq *rq = task_rq_lock(tsk, &flags);
+
+	fibril->state = TASK_RUNNING;
+	BUG_ON(!list_empty(&fibril->run_list));
+	list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+
+	task_rq_unlock(rq, &flags);
+}
+
+/*
+ * This is called from schedule() when we're not being preempted and there is a
+ * fibril waiting in current->runnable_fibrils.
+ *
+ * This is called under the run queue lock to serialize fibril->state and the
+ * runnable_fibrils list with wake-up.  We drop it before switching and the
+ * return path takes that into account.
+ *
+ * We always switch so that a caller can specifically make a single pass
+ * through the runnable fibrils by marking itself _RUNNING and calling
+ * schedule().
+ */
+void schedule_fibril(struct task_struct *tsk)
+{
+	struct thread_info *ti = task_thread_info(tsk);
+	struct fibril *prev;
+	struct fibril *next;
+	struct fibril dummy;
+
+	/*
+	 * XXX We don't deal with the kernel lock yet.  It'd need to be audited
+	 * and lock_depth moved under per_call_chain().
+	 */
+	BUG_ON(tsk->lock_depth >= 0);
+
+	next = list_entry(current->runnable_fibrils.next, struct fibril,
+			 run_list);
+	list_del_init(&next->run_list);
+	BUG_ON(next->state != TASK_RUNNING);
+
+	prev = tsk->fibril;
+	if (prev) {
+		prev->state = tsk->state;
+		prev->per_call = current->per_call;
+		/*
+		 * This catches the case where the caller wants to make a pass
+		 * through runnable fibrils by marking itself _RUNNING and
+		 * calling schedule().  A fibril should not be able to be on
+		 * both tsk->fibril and the runnable_list.
+		 */
+		if (prev->state == TASK_RUNNING) {
+			BUG_ON(!list_empty(&prev->run_list));
+			list_add_tail(&prev->run_list,
+				      &current->runnable_fibrils);
+		}
+	} else {
+		/*
+		 * To free a fibril the calling path can free the structure
+		 * itself, set current->fibril to NULL, and call schedule().
+		 * That causes us to tell __switch_to_fibril() to free the ti
+		 * associated with the fibril once we've switched away from it.
+		 * The dummy is just use to give switch_to_fibril() something
+		 * to save state in to.
+		 */
+		prev = &dummy;
+	}
+
+	/* 
+	 * XXX The idea is to copy all but the actual call stack.  Obviously
+	 * this is wildly arch-specific and belongs abstracted out.
+	 */
+	*next->ti = *ti;
+	*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
+
+	current->thread_info = next->ti;
+	current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
+	current->fibril = next;
+	current->state = next->state;
+	current->per_call = next->per_call;
+
+	if (prev == &dummy)
+		ti->status |= TS_FREE_AFTER_SWITCH;
+
+	/* __switch_to_fibril() drops the runqueue lock and enables preempt */
+	switch_to_fibril(prev, next, ti);
+}
+
+/*
  * schedule() is the main scheduler function.
  */
 asmlinkage void __sched schedule(void)
@@ -3468,6 +3573,22 @@ need_resched_nonpreemptible:
 	run_time /= (CURRENT_BONUS(prev) ? : 1);
 
 	spin_lock_irq(&rq->lock);
+
+	/* always switch to a runnable fibril if we aren't being preempted */
+	if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
+		     !list_empty(&prev->runnable_fibrils))) {
+		schedule_fibril(prev);
+		/* 
+		 * finish_fibril_switch() drops the rq lock and enables
+		 * premption, but the popfl disables interrupts again.  Watch
+		 * me learn how context switch locking works before your very
+		 * eyes!  XXX This will need to be fixed up by throwing
+		 * together something like the prepare_lock_switch() path the
+		 * scheduler does.  Guidance appreciated!
+		 */
+		local_irq_enable();
+		return;
+	}
 
 	switch_count = &prev->nivcsw;
 	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
  2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
@ 2007-01-30 20:39 ` Zach Brown
  2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

The addition of multiple sleeping fibrils under a task_struct means that we
can't simply wake a task_struct to be able to wake a specific sleeping code
path.

This patch introduces task_wake_target() as a way to refer to a code path that
is about to sleep and will be woken in the future.  Sleepers that used to wake
a current task_struct reference with wake_up_process() now use this helper to
get a wake target cookie and wake it with wake_up_target().

Some paths know that waking a task will be sufficient.  Paths working with
kernel threads that never use fibrils fall into this category.  They're changed
to use wake_up_task() instead of wake_up_process().

This is not an exhaustive patch.  It isn't yet clear how signals are going to
interract with fibrils.  Once that is decided callers of wake_up_state() are
going to need to reflect the desired behaviour.  I add __deprecated to it to
highlight this detail.

The actual act of performing the wake-up is hidden under try_to_wake_up() and
is serialized with the scheduler under the runqueue lock.  This is very
fiddly stuff.  I'm sure I've missed some details.  I've tried to comment
the intent above try_to_wake_up_fibril().

diff -r df7bc026d50e -r 4ea674e8825e arch/i386/kernel/ptrace.c
--- a/arch/i386/kernel/ptrace.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/arch/i386/kernel/ptrace.c	Mon Jan 29 15:46:47 2007 -0800
@@ -492,7 +492,7 @@ long arch_ptrace(struct task_struct *chi
 		child->exit_code = data;
 		/* make sure the single step bit is not set. */
 		clear_singlestep(child);
-		wake_up_process(child);
+		wake_up_task(child);
 		ret = 0;
 		break;
 
@@ -508,7 +508,7 @@ long arch_ptrace(struct task_struct *chi
 		child->exit_code = SIGKILL;
 		/* make sure the single step bit is not set. */
 		clear_singlestep(child);
-		wake_up_process(child);
+		wake_up_task(child);
 		break;
 
 	case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */
@@ -526,7 +526,7 @@ long arch_ptrace(struct task_struct *chi
 		set_singlestep(child);
 		child->exit_code = data;
 		/* give it a chance to run. */
-		wake_up_process(child);
+		wake_up_task(child);
 		ret = 0;
 		break;
 
diff -r df7bc026d50e -r 4ea674e8825e drivers/block/loop.c
--- a/drivers/block/loop.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/block/loop.c	Mon Jan 29 15:46:47 2007 -0800
@@ -824,7 +824,7 @@ static int loop_set_fd(struct loop_devic
 		goto out_clr;
 	}
 	lo->lo_state = Lo_bound;
-	wake_up_process(lo->lo_thread);
+	wake_up_task(lo->lo_thread);
 	return 0;
 
 out_clr:
diff -r df7bc026d50e -r 4ea674e8825e drivers/md/dm-io.c
--- a/drivers/md/dm-io.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/md/dm-io.c	Mon Jan 29 15:46:47 2007 -0800
@@ -18,7 +18,7 @@ struct io {
 struct io {
 	unsigned long error;
 	atomic_t count;
-	struct task_struct *sleeper;
+	void *wake_target;
 	io_notify_fn callback;
 	void *context;
 };
@@ -110,8 +110,8 @@ static void dec_count(struct io *io, uns
 		set_bit(region, &io->error);
 
 	if (atomic_dec_and_test(&io->count)) {
-		if (io->sleeper)
-			wake_up_process(io->sleeper);
+		if (io->wake_target)
+			wake_up_task(io->wake_target);
 
 		else {
 			int r = io->error;
@@ -323,7 +323,7 @@ static int sync_io(unsigned int num_regi
 
 	io.error = 0;
 	atomic_set(&io.count, 1); /* see dispatch_io() */
-	io.sleeper = current;
+	io.wake_target = task_wake_target(current);
 
 	dispatch_io(rw, num_regions, where, dp, &io, 1);
 
@@ -358,7 +358,7 @@ static int async_io(unsigned int num_reg
 	io = mempool_alloc(_io_pool, GFP_NOIO);
 	io->error = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
-	io->sleeper = NULL;
+	io->wake_target = NULL;
 	io->callback = fn;
 	io->context = context;
 
diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/qla2xxx/qla_os.c
--- a/drivers/scsi/qla2xxx/qla_os.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/qla2xxx/qla_os.c	Mon Jan 29 15:46:47 2007 -0800
@@ -2403,7 +2403,7 @@ qla2xxx_wake_dpc(scsi_qla_host_t *ha)
 qla2xxx_wake_dpc(scsi_qla_host_t *ha)
 {
 	if (ha->dpc_thread)
-		wake_up_process(ha->dpc_thread);
+		wake_up_task(ha->dpc_thread);
 }
 
 /*
diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/scsi_error.c	Mon Jan 29 15:46:47 2007 -0800
@@ -51,7 +51,7 @@ void scsi_eh_wakeup(struct Scsi_Host *sh
 void scsi_eh_wakeup(struct Scsi_Host *shost)
 {
 	if (shost->host_busy == shost->host_failed) {
-		wake_up_process(shost->ehandler);
+		wake_up_task(shost->ehandler);
 		SCSI_LOG_ERROR_RECOVERY(5,
 				printk("Waking error handler thread\n"));
 	}
diff -r df7bc026d50e -r 4ea674e8825e fs/aio.c
--- a/fs/aio.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/aio.c	Mon Jan 29 15:46:47 2007 -0800
@@ -907,7 +907,7 @@ void fastcall kick_iocb(struct kiocb *io
 	 * single context. */
 	if (is_sync_kiocb(iocb)) {
 		kiocbSetKicked(iocb);
-	        wake_up_process(iocb->ki_obj.tsk);
+	        wake_up_target(iocb->ki_obj.wake_target);
 		return;
 	}
 
@@ -941,7 +941,7 @@ int fastcall aio_complete(struct kiocb *
 		BUG_ON(iocb->ki_users != 1);
 		iocb->ki_user_data = res;
 		iocb->ki_users = 0;
-		wake_up_process(iocb->ki_obj.tsk);
+		wake_up_target(iocb->ki_obj.wake_target);
 		return 1;
 	}
 
@@ -1053,7 +1053,7 @@ struct aio_timeout {
 struct aio_timeout {
 	struct timer_list	timer;
 	int			timed_out;
-	struct task_struct	*p;
+	void			*wake_target;
 };
 
 static void timeout_func(unsigned long data)
@@ -1061,7 +1061,7 @@ static void timeout_func(unsigned long d
 	struct aio_timeout *to = (struct aio_timeout *)data;
 
 	to->timed_out = 1;
-	wake_up_process(to->p);
+	wake_up_target(to->wake_target);
 }
 
 static inline void init_timeout(struct aio_timeout *to)
@@ -1070,7 +1070,7 @@ static inline void init_timeout(struct a
 	to->timer.data = (unsigned long)to;
 	to->timer.function = timeout_func;
 	to->timed_out = 0;
-	to->p = current;
+	to->wake_target = task_wake_target(current);
 }
 
 static inline void set_timeout(long start_jiffies, struct aio_timeout *to,
diff -r df7bc026d50e -r 4ea674e8825e fs/direct-io.c
--- a/fs/direct-io.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/direct-io.c	Mon Jan 29 15:46:47 2007 -0800
@@ -124,7 +124,7 @@ struct dio {
 	spinlock_t bio_lock;		/* protects BIO fields below */
 	unsigned long refcount;		/* direct_io_worker() and bios */
 	struct bio *bio_list;		/* singly linked via bi_private */
-	struct task_struct *waiter;	/* waiting task (NULL if none) */
+	void *wake_target;		/* waiting initiator (NULL if none) */
 
 	/* AIO related stuff */
 	struct kiocb *iocb;		/* kiocb */
@@ -278,8 +278,8 @@ static int dio_bio_end_aio(struct bio *b
 
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	remaining = --dio->refcount;
-	if (remaining == 1 && dio->waiter)
-		wake_up_process(dio->waiter);
+	if (remaining == 1 && dio->wake_target)
+		wake_up_target(dio->wake_target);
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (remaining == 0) {
@@ -309,8 +309,8 @@ static int dio_bio_end_io(struct bio *bi
 	spin_lock_irqsave(&dio->bio_lock, flags);
 	bio->bi_private = dio->bio_list;
 	dio->bio_list = bio;
-	if (--dio->refcount == 1 && dio->waiter)
-		wake_up_process(dio->waiter);
+	if (--dio->refcount == 1 && dio->wake_target)
+		wake_up_target(dio->wake_target);
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 	return 0;
 }
@@ -393,12 +393,12 @@ static struct bio *dio_await_one(struct 
 	 */
 	while (dio->refcount > 1 && dio->bio_list == NULL) {
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-		dio->waiter = current;
+		dio->wake_target = task_wake_target(current);
 		spin_unlock_irqrestore(&dio->bio_lock, flags);
 		io_schedule();
 		/* wake up sets us TASK_RUNNING */
 		spin_lock_irqsave(&dio->bio_lock, flags);
-		dio->waiter = NULL;
+		dio->wake_target = NULL;
 	}
 	if (dio->bio_list) {
 		bio = dio->bio_list;
@@ -990,7 +990,7 @@ direct_io_worker(int rw, struct kiocb *i
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 	dio->bio_list = NULL;
-	dio->waiter = NULL;
+	dio->wake_target = NULL;
 
 	/*
 	 * In case of non-aligned buffers, we may need 2 more
diff -r df7bc026d50e -r 4ea674e8825e fs/jbd/journal.c
--- a/fs/jbd/journal.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/jbd/journal.c	Mon Jan 29 15:46:47 2007 -0800
@@ -94,7 +94,7 @@ static void commit_timeout(unsigned long
 {
 	struct task_struct * p = (struct task_struct *) __data;
 
-	wake_up_process(p);
+	wake_up_task(p);
 }
 
 /*
diff -r df7bc026d50e -r 4ea674e8825e include/linux/aio.h
--- a/include/linux/aio.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/aio.h	Mon Jan 29 15:46:47 2007 -0800
@@ -98,7 +98,7 @@ struct kiocb {
 
 	union {
 		void __user		*user;
-		struct task_struct	*tsk;
+		void 			*wake_target;
 	} ki_obj;
 
 	__u64			ki_user_data;	/* user's data for completion */
@@ -124,7 +124,6 @@ struct kiocb {
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
 #define init_sync_kiocb(x, filp)			\
 	do {						\
-		struct task_struct *tsk = current;	\
 		(x)->ki_flags = 0;			\
 		(x)->ki_users = 1;			\
 		(x)->ki_key = KIOCB_SYNC_KEY;		\
@@ -133,7 +132,7 @@ struct kiocb {
 		(x)->ki_cancel = NULL;			\
 		(x)->ki_retry = NULL;			\
 		(x)->ki_dtor = NULL;			\
-		(x)->ki_obj.tsk = tsk;			\
+		(x)->ki_obj.wake_target = task_wake_target(current);	\
 		(x)->ki_user_data = 0;                  \
 		init_wait((&(x)->ki_wait));             \
 	} while (0)
diff -r df7bc026d50e -r 4ea674e8825e include/linux/freezer.h
--- a/include/linux/freezer.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/freezer.h	Mon Jan 29 15:46:47 2007 -0800
@@ -42,7 +42,7 @@ static inline int thaw_process(struct ta
 {
 	if (frozen(p)) {
 		p->flags &= ~PF_FROZEN;
-		wake_up_process(p);
+		wake_up_task(p);
 		return 1;
 	}
 	return 0;
diff -r df7bc026d50e -r 4ea674e8825e include/linux/hrtimer.h
--- a/include/linux/hrtimer.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/hrtimer.h	Mon Jan 29 15:46:47 2007 -0800
@@ -65,7 +65,7 @@ struct hrtimer {
  */
 struct hrtimer_sleeper {
 	struct hrtimer timer;
-	struct task_struct *task;
+	void *wake_target;
 };
 
 /**
diff -r df7bc026d50e -r 4ea674e8825e include/linux/kthread.h
--- a/include/linux/kthread.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/kthread.h	Mon Jan 29 15:46:47 2007 -0800
@@ -22,7 +22,7 @@ struct task_struct *kthread_create(int (
 	struct task_struct *__k						   \
 		= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
 	if (!IS_ERR(__k))						   \
-		wake_up_process(__k);					   \
+		wake_up_task(__k);					   \
 	__k;								   \
 })
 
diff -r df7bc026d50e -r 4ea674e8825e include/linux/module.h
--- a/include/linux/module.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/module.h	Mon Jan 29 15:46:47 2007 -0800
@@ -334,7 +334,7 @@ struct module
 	struct list_head modules_which_use_me;
 
 	/* Who is waiting for us to be unloaded */
-	struct task_struct *waiter;
+	void *wake_target;
 
 	/* Destruction function. */
 	void (*exit)(void);
diff -r df7bc026d50e -r 4ea674e8825e include/linux/mutex.h
--- a/include/linux/mutex.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/mutex.h	Mon Jan 29 15:46:47 2007 -0800
@@ -65,7 +65,7 @@ struct mutex {
  */
 struct mutex_waiter {
 	struct list_head	list;
-	struct task_struct	*task;
+	void			*wake_target;
 #ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
 	void			*magic;
diff -r df7bc026d50e -r 4ea674e8825e include/linux/posix-timers.h
--- a/include/linux/posix-timers.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/posix-timers.h	Mon Jan 29 15:46:47 2007 -0800
@@ -48,6 +48,7 @@ struct k_itimer {
 	int it_sigev_signo;		/* signo word of sigevent struct */
 	sigval_t it_sigev_value;	/* value word of sigevent struct */
 	struct task_struct *it_process;	/* process to send signal to */
+	void *it_wake_target;		/* wake target for nanosleep case */
 	struct sigqueue *sigq;		/* signal queue entry. */
 	union {
 		struct {
diff -r df7bc026d50e -r 4ea674e8825e include/linux/sched.h
--- a/include/linux/sched.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/sched.h	Mon Jan 29 15:46:47 2007 -0800
@@ -1338,8 +1338,14 @@ extern void switch_uid(struct user_struc
 
 extern void do_timer(unsigned long ticks);
 
-extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
-extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+/* 
+ * XXX We need to figure out how signal delivery will wake the fibrils in
+ * a task.  This is marked deprecated so that we get a compile-time warning
+ * to worry about it.
+ */
+extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state)) __deprecated;
+extern int FASTCALL(wake_up_target(void *wake_target));
+extern int FASTCALL(wake_up_task(struct task_struct *task));
 extern void FASTCALL(wake_up_new_task(struct task_struct * tsk,
 						unsigned long clone_flags));
 #ifdef CONFIG_SMP
diff -r df7bc026d50e -r 4ea674e8825e include/linux/sem.h
--- a/include/linux/sem.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/sem.h	Mon Jan 29 15:46:47 2007 -0800
@@ -104,7 +104,7 @@ struct sem_queue {
 struct sem_queue {
 	struct sem_queue *	next;	 /* next entry in the queue */
 	struct sem_queue **	prev;	 /* previous entry in the queue, *(q->prev) == q */
-	struct task_struct*	sleeper; /* this process */
+	void			*wake_target;
 	struct sem_undo *	undo;	 /* undo structure */
 	int    			pid;	 /* process id of requesting process */
 	int    			status;	 /* completion status of operation */
diff -r df7bc026d50e -r 4ea674e8825e include/linux/wait.h
--- a/include/linux/wait.h	Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/wait.h	Mon Jan 29 15:46:47 2007 -0800
@@ -54,13 +54,16 @@ typedef struct __wait_queue_head wait_qu
 typedef struct __wait_queue_head wait_queue_head_t;
 
 struct task_struct;
+/* XXX sigh, wait.h <-> sched.h have some fun ordering */
+void *task_wake_target(struct task_struct *task);
+struct task_struct *wake_target_to_task(void *wake_target);
 
 /*
  * Macros for declaration and initialisaton of the datatypes
  */
 
 #define __WAITQUEUE_INITIALIZER(name, tsk) {				\
-	.private	= tsk,						\
+	.private	= task_wake_target(tsk),			\
 	.func		= default_wake_function,			\
 	.task_list	= { NULL, NULL } }
 
@@ -91,7 +94,7 @@ static inline void init_waitqueue_entry(
 static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
 {
 	q->flags = 0;
-	q->private = p;
+	q->private = task_wake_target(p);
 	q->func = default_wake_function;
 }
 
@@ -389,7 +392,7 @@ int wake_bit_function(wait_queue_t *wait
 
 #define DEFINE_WAIT(name)						\
 	wait_queue_t name = {						\
-		.private	= current,				\
+		.private	= task_wake_target(current),		\
 		.func		= autoremove_wake_function,		\
 		.task_list	= LIST_HEAD_INIT((name).task_list),	\
 	}
@@ -398,7 +401,7 @@ int wake_bit_function(wait_queue_t *wait
 	struct wait_bit_queue name = {					\
 		.key = __WAIT_BIT_KEY_INITIALIZER(word, bit),		\
 		.wait	= {						\
-			.private	= current,			\
+			.private	= task_wake_target(current),	\
 			.func		= wake_bit_function,		\
 			.task_list	=				\
 				LIST_HEAD_INIT((name).wait.task_list),	\
@@ -407,7 +410,7 @@ int wake_bit_function(wait_queue_t *wait
 
 #define init_wait(wait)							\
 	do {								\
-		(wait)->private = current;				\
+		(wait)->private = task_wake_target(current);		\
 		(wait)->func = autoremove_wake_function;		\
 		INIT_LIST_HEAD(&(wait)->task_list);			\
 	} while (0)
diff -r df7bc026d50e -r 4ea674e8825e ipc/mqueue.c
--- a/ipc/mqueue.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/mqueue.c	Mon Jan 29 15:46:47 2007 -0800
@@ -58,7 +58,7 @@
 
 
 struct ext_wait_queue {		/* queue of sleeping tasks */
-	struct task_struct *task;
+	void *wake_target;
 	struct list_head list;
 	struct msg_msg *msg;	/* ptr of loaded message */
 	int state;		/* one of STATE_* values */
@@ -394,10 +394,11 @@ static void wq_add(struct mqueue_inode_i
 {
 	struct ext_wait_queue *walk;
 
-	ewp->task = current;
+	ewp->wake_target = task_wake_target(current);
 
 	list_for_each_entry(walk, &info->e_wait_q[sr].list, list) {
-		if (walk->task->static_prio <= current->static_prio) {
+		if (wake_target_to_task(walk->wake_target)->static_prio 
+		    <= current->static_prio) {
 			list_add_tail(&ewp->list, &walk->list);
 			return;
 		}
@@ -785,7 +786,7 @@ static inline void pipelined_send(struct
 	receiver->msg = message;
 	list_del(&receiver->list);
 	receiver->state = STATE_PENDING;
-	wake_up_process(receiver->task);
+	wake_up_target(receiver->wake_target);
 	smp_wmb();
 	receiver->state = STATE_READY;
 }
@@ -804,7 +805,7 @@ static inline void pipelined_receive(str
 	msg_insert(sender->msg, info);
 	list_del(&sender->list);
 	sender->state = STATE_PENDING;
-	wake_up_process(sender->task);
+	wake_up_target(sender->wake_target);
 	smp_wmb();
 	sender->state = STATE_READY;
 }
@@ -869,7 +870,7 @@ asmlinkage long sys_mq_timedsend(mqd_t m
 			spin_unlock(&info->lock);
 			ret = timeout;
 		} else {
-			wait.task = current;
+			wait.wake_target = task_wake_target(current);
 			wait.msg = (void *) msg_ptr;
 			wait.state = STATE_NONE;
 			ret = wq_sleep(info, SEND, timeout, &wait);
@@ -944,7 +945,7 @@ asmlinkage ssize_t sys_mq_timedreceive(m
 			ret = timeout;
 			msg_ptr = NULL;
 		} else {
-			wait.task = current;
+			wait.wake_target = task_wake_target(current);
 			wait.state = STATE_NONE;
 			ret = wq_sleep(info, RECV, timeout, &wait);
 			msg_ptr = wait.msg;
diff -r df7bc026d50e -r 4ea674e8825e ipc/msg.c
--- a/ipc/msg.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/msg.c	Mon Jan 29 15:46:47 2007 -0800
@@ -46,7 +46,7 @@
  */
 struct msg_receiver {
 	struct list_head	r_list;
-	struct task_struct	*r_tsk;
+	struct task_struct	*r_wake_target;
 
 	int			r_mode;
 	long			r_msgtype;
@@ -58,7 +58,7 @@ struct msg_receiver {
 /* one msg_sender for each sleeping sender */
 struct msg_sender {
 	struct list_head	list;
-	struct task_struct	*tsk;
+	void			*wake_target;
 };
 
 #define SEARCH_ANY		1
@@ -180,7 +180,7 @@ static int newque (struct ipc_namespace 
 
 static inline void ss_add(struct msg_queue *msq, struct msg_sender *mss)
 {
-	mss->tsk = current;
+	mss->wake_target = task_wake_target(current);
 	current->state = TASK_INTERRUPTIBLE;
 	list_add_tail(&mss->list, &msq->q_senders);
 }
@@ -203,7 +203,7 @@ static void ss_wakeup(struct list_head *
 		tmp = tmp->next;
 		if (kill)
 			mss->list.next = NULL;
-		wake_up_process(mss->tsk);
+		wake_up_target(mss->wake_target);
 	}
 }
 
@@ -218,7 +218,7 @@ static void expunge_all(struct msg_queue
 		msr = list_entry(tmp, struct msg_receiver, r_list);
 		tmp = tmp->next;
 		msr->r_msg = NULL;
-		wake_up_process(msr->r_tsk);
+		wake_up_target(msr->r_wake_target);
 		smp_mb();
 		msr->r_msg = ERR_PTR(res);
 	}
@@ -602,20 +602,21 @@ static inline int pipelined_send(struct 
 		msr = list_entry(tmp, struct msg_receiver, r_list);
 		tmp = tmp->next;
 		if (testmsg(msg, msr->r_msgtype, msr->r_mode) &&
-		    !security_msg_queue_msgrcv(msq, msg, msr->r_tsk,
-					       msr->r_msgtype, msr->r_mode)) {
+		    !security_msg_queue_msgrcv(msq, msg,
+				wake_target_to_task(msr->r_wake_target),
+				msr->r_msgtype, msr->r_mode)) {
 
 			list_del(&msr->r_list);
 			if (msr->r_maxsize < msg->m_ts) {
 				msr->r_msg = NULL;
-				wake_up_process(msr->r_tsk);
+				wake_up_target(msr->r_wake_target);
 				smp_mb();
 				msr->r_msg = ERR_PTR(-E2BIG);
 			} else {
 				msr->r_msg = NULL;
-				msq->q_lrpid = msr->r_tsk->pid;
+				msq->q_lrpid = wake_target_to_task(msr->r_wake_target)->pid;
 				msq->q_rtime = get_seconds();
-				wake_up_process(msr->r_tsk);
+				wake_up_target(msr->r_wake_target);
 				smp_mb();
 				msr->r_msg = msg;
 
@@ -826,7 +827,7 @@ long do_msgrcv(int msqid, long *pmtype, 
 			goto out_unlock;
 		}
 		list_add_tail(&msr_d.r_list, &msq->q_receivers);
-		msr_d.r_tsk = current;
+		msr_d.r_wake_target = task_wake_target(current);
 		msr_d.r_msgtype = msgtyp;
 		msr_d.r_mode = mode;
 		if (msgflg & MSG_NOERROR)
diff -r df7bc026d50e -r 4ea674e8825e ipc/sem.c
--- a/ipc/sem.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/sem.c	Mon Jan 29 15:46:47 2007 -0800
@@ -411,7 +411,7 @@ static void update_queue (struct sem_arr
 		error = try_atomic_semop(sma, q->sops, q->nsops,
 					 q->undo, q->pid);
 
-		/* Does q->sleeper still need to sleep? */
+		/* Does q->wake_target still need to sleep? */
 		if (error <= 0) {
 			struct sem_queue *n;
 			remove_from_queue(sma,q);
@@ -431,7 +431,7 @@ static void update_queue (struct sem_arr
 				n = sma->sem_pending;
 			else
 				n = q->next;
-			wake_up_process(q->sleeper);
+			wake_up_target(q->wake_target);
 			/* hands-off: q will disappear immediately after
 			 * writing q->status.
 			 */
@@ -515,7 +515,7 @@ static void freeary (struct ipc_namespac
 		q->prev = NULL;
 		n = q->next;
 		q->status = IN_WAKEUP;
-		wake_up_process(q->sleeper); /* doesn't sleep */
+		wake_up_target(q->wake_target); /* doesn't sleep */
 		smp_wmb();
 		q->status = -EIDRM;	/* hands-off q */
 		q = n;
@@ -1223,7 +1223,7 @@ retry_undos:
 		prepend_to_queue(sma ,&queue);
 
 	queue.status = -EINTR;
-	queue.sleeper = current;
+	queue.wake_target = task_wake_target(current);
 	current->state = TASK_INTERRUPTIBLE;
 	sem_unlock(sma);
 
diff -r df7bc026d50e -r 4ea674e8825e kernel/exit.c
--- a/kernel/exit.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/exit.c	Mon Jan 29 15:46:47 2007 -0800
@@ -91,7 +91,7 @@ static void __exit_signal(struct task_st
 		 * then notify it:
 		 */
 		if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count) {
-			wake_up_process(sig->group_exit_task);
+			wake_up_task(sig->group_exit_task);
 			sig->group_exit_task = NULL;
 		}
 		if (tsk == sig->curr_target)
diff -r df7bc026d50e -r 4ea674e8825e kernel/hrtimer.c
--- a/kernel/hrtimer.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/hrtimer.c	Mon Jan 29 15:46:47 2007 -0800
@@ -660,11 +660,11 @@ static int hrtimer_wakeup(struct hrtimer
 {
 	struct hrtimer_sleeper *t =
 		container_of(timer, struct hrtimer_sleeper, timer);
-	struct task_struct *task = t->task;
-
-	t->task = NULL;
-	if (task)
-		wake_up_process(task);
+	void *wake_target = t->wake_target;
+
+	t->wake_target = NULL;
+	if (wake_target)
+		wake_up_target(wake_target);
 
 	return HRTIMER_NORESTART;
 }
@@ -672,7 +672,7 @@ void hrtimer_init_sleeper(struct hrtimer
 void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *task)
 {
 	sl->timer.function = hrtimer_wakeup;
-	sl->task = task;
+	sl->wake_target = task_wake_target(task);
 }
 
 static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
@@ -688,9 +688,9 @@ static int __sched do_nanosleep(struct h
 		hrtimer_cancel(&t->timer);
 		mode = HRTIMER_ABS;
 
-	} while (t->task && !signal_pending(current));
-
-	return t->task == NULL;
+	} while (t->wake_target && !signal_pending(current));
+
+	return t->wake_target == NULL;
 }
 
 long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
diff -r df7bc026d50e -r 4ea674e8825e kernel/kthread.c
--- a/kernel/kthread.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/kthread.c	Mon Jan 29 15:46:47 2007 -0800
@@ -232,7 +232,7 @@ int kthread_stop(struct task_struct *k)
 
 	/* Now set kthread_should_stop() to true, and wake it up. */
 	kthread_stop_info.k = k;
-	wake_up_process(k);
+	wake_up_task(k);
 	put_task_struct(k);
 
 	/* Once it dies, reset stop ptr, gather result and we're done. */
diff -r df7bc026d50e -r 4ea674e8825e kernel/module.c
--- a/kernel/module.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/module.c	Mon Jan 29 15:46:47 2007 -0800
@@ -508,7 +508,7 @@ static void module_unload_init(struct mo
 	/* Hold reference count during initialization. */
 	local_set(&mod->ref[raw_smp_processor_id()].count, 1);
 	/* Backwards compatibility macros put refcount during init. */
-	mod->waiter = current;
+	mod->wake_target = task_wake_target(current);
 }
 
 /* modules using other modules */
@@ -699,7 +699,7 @@ sys_delete_module(const char __user *nam
 	}
 
 	/* Set this up before setting mod->state */
-	mod->waiter = current;
+	mod->wake_target = task_wake_target(current);
 
 	/* Stop the machine so refcounts can't move and disable module. */
 	ret = try_stop_module(mod, flags, &forced);
@@ -797,7 +797,7 @@ void module_put(struct module *module)
 		local_dec(&module->ref[cpu].count);
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
-			wake_up_process(module->waiter);
+			wake_up_target(module->wake_target);
 		put_cpu();
 	}
 }
diff -r df7bc026d50e -r 4ea674e8825e kernel/mutex-debug.c
--- a/kernel/mutex-debug.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/mutex-debug.c	Mon Jan 29 15:46:47 2007 -0800
@@ -53,6 +53,7 @@ void debug_mutex_free_waiter(struct mute
 	memset(waiter, MUTEX_DEBUG_FREE, sizeof(*waiter));
 }
 
+#warning "this is going to need updating for fibrils"
 void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			    struct thread_info *ti)
 {
@@ -67,12 +68,12 @@ void mutex_remove_waiter(struct mutex *l
 			 struct thread_info *ti)
 {
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
-	DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
+	DEBUG_LOCKS_WARN_ON(waiter->wake_target != task_wake_target(ti->task));
 	DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
 	ti->task->blocked_on = NULL;
 
 	list_del_init(&waiter->list);
-	waiter->task = NULL;
+	waiter->wake_target = NULL;
 }
 
 void debug_mutex_unlock(struct mutex *lock)
diff -r df7bc026d50e -r 4ea674e8825e kernel/mutex.c
--- a/kernel/mutex.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/mutex.c	Mon Jan 29 15:46:47 2007 -0800
@@ -137,7 +137,7 @@ __mutex_lock_common(struct mutex *lock, 
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
-	waiter.task = task;
+	waiter.wake_target = task_wake_target(task);
 
 	for (;;) {
 		/*
@@ -246,7 +246,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 
 		debug_mutex_wake_waiter(lock, waiter);
 
-		wake_up_process(waiter->task);
+		wake_up_target(waiter->wake_target);
 	}
 
 	debug_mutex_clear_owner(lock);
diff -r df7bc026d50e -r 4ea674e8825e kernel/posix-cpu-timers.c
--- a/kernel/posix-cpu-timers.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/posix-cpu-timers.c	Mon Jan 29 15:46:47 2007 -0800
@@ -673,7 +673,7 @@ static void cpu_timer_fire(struct k_itim
 		 * This a special case for clock_nanosleep,
 		 * not a normal timer from sys_timer_create.
 		 */
-		wake_up_process(timer->it_process);
+		wake_up_target(timer->it_wake_target);
 		timer->it.cpu.expires.sched = 0;
 	} else if (timer->it.cpu.incr.sched == 0) {
 		/*
@@ -1423,6 +1423,12 @@ static int do_cpu_nanosleep(const clocki
 	timer.it_overrun = -1;
 	error = posix_cpu_timer_create(&timer);
 	timer.it_process = current;
+	/* 
+	 * XXX This isn't quite right, but the rest of the it_process users
+	 * fall under the currently unresolved question of how signal delivery
+	 * will behave.
+	 */
+	timer.it_wake_target = task_wake_target(current);
 	if (!error) {
 		static struct itimerspec zero_it;
 
diff -r df7bc026d50e -r 4ea674e8825e kernel/ptrace.c
--- a/kernel/ptrace.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/ptrace.c	Mon Jan 29 15:46:47 2007 -0800
@@ -221,7 +221,7 @@ static inline void __ptrace_detach(struc
 	__ptrace_unlink(child);
 	/* .. and wake it up. */
 	if (child->exit_state != EXIT_ZOMBIE)
-		wake_up_process(child);
+		wake_up_task(child);
 }
 
 int ptrace_detach(struct task_struct *child, unsigned int data)
diff -r df7bc026d50e -r 4ea674e8825e kernel/rtmutex.c
--- a/kernel/rtmutex.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/rtmutex.c	Mon Jan 29 15:46:47 2007 -0800
@@ -516,7 +516,8 @@ static void wakeup_next_waiter(struct rt
 	}
 	spin_unlock_irqrestore(&pendowner->pi_lock, flags);
 
-	wake_up_process(pendowner);
+#warning "this looks like it needs expert attention"
+	wake_up_task(pendowner);
 }
 
 /*
@@ -640,7 +641,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 			/* Signal pending? */
 			if (signal_pending(current))
 				ret = -EINTR;
-			if (timeout && !timeout->task)
+			if (timeout && !timeout->wake_target)
 				ret = -ETIMEDOUT;
 			if (ret)
 				break;
diff -r df7bc026d50e -r 4ea674e8825e kernel/sched.c
--- a/kernel/sched.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/sched.c	Mon Jan 29 15:46:47 2007 -0800
@@ -1381,10 +1381,52 @@ static inline int wake_idle(int cpu, str
 }
 #endif
 
+/*
+ * This path wakes a fibril.
+ *
+ * In the common case, a task will be sleeping with multiple pending
+ * sleeping fibrils.  In that case we need to put the fibril on the task's
+ * runnable list and wake the task itself.  We need it to go back through
+ * the scheduler to find the runnable fibril so we set TIF_NEED_RESCHED.
+ *
+ * A derivative of that case is when the fibril that we're waking is already
+ * current on the sleeping task.  In that case we just need to wake the
+ * task itself, it will already be executing the fibril we're waking.  We
+ * do not put it on the runnable list in that case.
+ *
+ * XXX Obviously, there are lots of very scary races here.  We should get
+ * more confidence that they're taken care of.
+ */
+static int try_to_wake_up_fibril(struct task_struct *tsk, void *wake_target,
+				 unsigned int state)
+{
+	struct fibril *fibril = (struct fibril *)
+					((unsigned long)wake_target & ~1UL);
+	long old_state = fibril->state;
+	int ret = 1;
+
+	if (!(old_state & state))
+		goto out;
+
+	ret = 0;
+	fibril->state = TASK_RUNNING;
+
+	if (fibril->ti->task->fibril != fibril) {
+		BUG_ON(!list_empty(&fibril->run_list));
+		list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+		if (!tsk->array)
+			set_ti_thread_flag(task_thread_info(tsk),
+					   TIF_NEED_RESCHED);
+	}
+
+out:
+	return ret;
+}
+
 /***
  * try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
- * @state: the mask of task states that can be woken
+ * @wake_target: the to-be-woken-up sleeper, from task_wake_target()
+ * @state: the mask of states that can be woken
  * @sync: do a synchronous wakeup?
  *
  * Put it on the run-queue if it's not already there. The "current"
@@ -1395,9 +1437,10 @@ static inline int wake_idle(int cpu, str
  *
  * returns failure only if the task is already active.
  */
-static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
+static int try_to_wake_up(void *wake_target, unsigned int state, int sync)
 {
 	int cpu, this_cpu, success = 0;
+	struct task_struct *p = wake_target_to_task(wake_target);
 	unsigned long flags;
 	long old_state;
 	struct rq *rq;
@@ -1408,6 +1451,12 @@ static int try_to_wake_up(struct task_st
 #endif
 
 	rq = task_rq_lock(p, &flags);
+
+	/* See if we're just putting a fibril on its task's runnable list */
+	if (unlikely(((unsigned long)wake_target & 1) &&
+		     try_to_wake_up_fibril(p, wake_target, state)))
+		goto out;
+
 	old_state = p->state;
 	if (!(old_state & state))
 		goto out;
@@ -1555,16 +1604,27 @@ out:
 	return success;
 }
 
-int fastcall wake_up_process(struct task_struct *p)
-{
-	return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
+int fastcall wake_up_task(struct task_struct *task)
+{
+	return try_to_wake_up((void *)task, TASK_STOPPED | TASK_TRACED |
 				 TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
 }
-EXPORT_SYMBOL(wake_up_process);
-
+EXPORT_SYMBOL(wake_up_task);
+
+int fastcall wake_up_target(void *wake_target)
+{
+	return try_to_wake_up(wake_target, TASK_STOPPED | TASK_TRACED |
+				 TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+}
+EXPORT_SYMBOL(wake_up_target);
+
+/* 
+ * XXX We need to figure out how signal delivery will wake the fibrils in
+ * a task.
+ */
 int fastcall wake_up_state(struct task_struct *p, unsigned int state)
 {
-	return try_to_wake_up(p, state, 0);
+	return try_to_wake_up((void *)p, state, 0);
 }
 
 static void task_running_tick(struct rq *rq, struct task_struct *p);
@@ -2041,7 +2101,7 @@ static void sched_migrate_task(struct ta
 
 		get_task_struct(mt);
 		task_rq_unlock(rq, &flags);
-		wake_up_process(mt);
+		wake_up_task(mt);
 		put_task_struct(mt);
 		wait_for_completion(&req.done);
 
@@ -2673,7 +2733,7 @@ redo:
 			}
 			spin_unlock_irqrestore(&busiest->lock, flags);
 			if (active_balance)
-				wake_up_process(busiest->migration_thread);
+				wake_up_task(busiest->migration_thread);
 
 			/*
 			 * We've kicked active balancing, reset the failure
@@ -3781,6 +3841,33 @@ need_resched:
 
 #endif /* CONFIG_PREEMPT */
 
+/*
+ * This is a void * so that it's harder for people to stash it in a small 
+ * scalar without getting warnings.
+ */
+void *task_wake_target(struct task_struct *task)
+{
+	if (task->fibril) {
+		return (void *)((unsigned long)task->fibril | 1);
+	} else {
+		BUG_ON((unsigned long)task & 1);
+		return task;
+	}
+}
+EXPORT_SYMBOL(task_wake_target);
+
+struct task_struct *wake_target_to_task(void *wake_target)
+{
+	if ((unsigned long)wake_target & 1) {
+		struct fibril *fibril;
+		fibril = (struct fibril *) ((unsigned long)wake_target ^ 1);
+		return fibril->ti->task;
+	} else
+		return (struct task_struct *)((unsigned long)wake_target);
+}
+EXPORT_SYMBOL(wake_target_to_task);
+
+
 int default_wake_function(wait_queue_t *curr, unsigned mode, int sync,
 			  void *key)
 {
@@ -5140,7 +5227,7 @@ int set_cpus_allowed(struct task_struct 
 	if (migrate_task(p, any_online_cpu(new_mask), &req)) {
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, &flags);
-		wake_up_process(rq->migration_thread);
+		wake_up_task(rq->migration_thread);
 		wait_for_completion(&req.done);
 		tlb_migrate_finish(p->mm);
 		return 0;
@@ -5462,7 +5549,7 @@ migration_call(struct notifier_block *nf
 
 	case CPU_ONLINE:
 		/* Strictly unneccessary, as first user will wake it. */
-		wake_up_process(cpu_rq(cpu)->migration_thread);
+		wake_up_task(cpu_rq(cpu)->migration_thread);
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
diff -r df7bc026d50e -r 4ea674e8825e kernel/signal.c
--- a/kernel/signal.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/signal.c	Mon Jan 29 15:46:47 2007 -0800
@@ -948,7 +948,7 @@ __group_complete_signal(int sig, struct 
 			signal_wake_up(t, 0);
 			t = next_thread(t);
 		} while (t != p);
-		wake_up_process(p->signal->group_exit_task);
+		wake_up_task(p->signal->group_exit_task);
 		return;
 	}
 
diff -r df7bc026d50e -r 4ea674e8825e kernel/softirq.c
--- a/kernel/softirq.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/softirq.c	Mon Jan 29 15:46:47 2007 -0800
@@ -58,7 +58,7 @@ static inline void wakeup_softirqd(void)
 	struct task_struct *tsk = __get_cpu_var(ksoftirqd);
 
 	if (tsk && tsk->state != TASK_RUNNING)
-		wake_up_process(tsk);
+		wake_up_task(tsk);
 }
 
 /*
@@ -583,7 +583,7 @@ static int __cpuinit cpu_callback(struct
   		per_cpu(ksoftirqd, hotcpu) = p;
  		break;
 	case CPU_ONLINE:
-		wake_up_process(per_cpu(ksoftirqd, hotcpu));
+		wake_up_task(per_cpu(ksoftirqd, hotcpu));
 		break;
 #ifdef CONFIG_HOTPLUG_CPU
 	case CPU_UP_CANCELED:
diff -r df7bc026d50e -r 4ea674e8825e kernel/stop_machine.c
--- a/kernel/stop_machine.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/stop_machine.c	Mon Jan 29 15:46:47 2007 -0800
@@ -185,7 +185,7 @@ struct task_struct *__stop_machine_run(i
 	p = kthread_create(do_stop, &smdata, "kstopmachine");
 	if (!IS_ERR(p)) {
 		kthread_bind(p, cpu);
-		wake_up_process(p);
+		wake_up_task(p);
 		wait_for_completion(&smdata.done);
 	}
 	up(&stopmachine_mutex);
diff -r df7bc026d50e -r 4ea674e8825e kernel/timer.c
--- a/kernel/timer.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/timer.c	Mon Jan 29 15:46:47 2007 -0800
@@ -1290,7 +1290,7 @@ asmlinkage long sys_getegid(void)
 
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((struct task_struct *)__data);
+	wake_up_task((struct task_struct *)__data);
 }
 
 /**
diff -r df7bc026d50e -r 4ea674e8825e kernel/workqueue.c
--- a/kernel/workqueue.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/workqueue.c	Mon Jan 29 15:46:47 2007 -0800
@@ -504,14 +504,14 @@ struct workqueue_struct *__create_workqu
 		if (!p)
 			destroy = 1;
 		else
-			wake_up_process(p);
+			wake_up_task(p);
 	} else {
 		list_add(&wq->list, &workqueues);
 		for_each_online_cpu(cpu) {
 			p = create_workqueue_thread(wq, cpu, freezeable);
 			if (p) {
 				kthread_bind(p, cpu);
-				wake_up_process(p);
+				wake_up_task(p);
 			} else
 				destroy = 1;
 		}
@@ -773,7 +773,7 @@ static int __devinit workqueue_cpu_callb
 
 			cwq = per_cpu_ptr(wq->cpu_wq, hotcpu);
 			kthread_bind(cwq->thread, hotcpu);
-			wake_up_process(cwq->thread);
+			wake_up_task(cwq->thread);
 		}
 		mutex_unlock(&workqueue_mutex);
 		break;
diff -r df7bc026d50e -r 4ea674e8825e lib/rwsem.c
--- a/lib/rwsem.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/lib/rwsem.c	Mon Jan 29 15:46:47 2007 -0800
@@ -30,7 +30,7 @@ EXPORT_SYMBOL(__init_rwsem);
 
 struct rwsem_waiter {
 	struct list_head list;
-	struct task_struct *task;
+	void *wake_target;
 	unsigned int flags;
 #define RWSEM_WAITING_FOR_READ	0x00000001
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
@@ -50,7 +50,7 @@ __rwsem_do_wake(struct rw_semaphore *sem
 __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
 {
 	struct rwsem_waiter *waiter;
-	struct task_struct *tsk;
+	void *wake_target;
 	struct list_head *next;
 	signed long oldcount, woken, loop;
 
@@ -75,16 +75,17 @@ __rwsem_do_wake(struct rw_semaphore *sem
 	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
 		goto readers_only;
 
-	/* We must be careful not to touch 'waiter' after we set ->task = NULL.
-	 * It is an allocated on the waiter's stack and may become invalid at
-	 * any time after that point (due to a wakeup from another source).
+	/* We must be careful not to touch 'waiter' after we set ->wake_target
+	 * = NULL.  It is an allocated on the waiter's stack and may become
+	 * invalid at any time after that point (due to a wakeup from another
+	 * source).
 	 */
 	list_del(&waiter->list);
-	tsk = waiter->task;
+	wake_target = waiter->wake_target;
 	smp_mb();
-	waiter->task = NULL;
-	wake_up_process(tsk);
-	put_task_struct(tsk);
+	waiter->wake_target = NULL;
+	wake_up_target(wake_target);
+	put_task_struct(wake_target_to_task(wake_target));
 	goto out;
 
 	/* don't want to wake any writers */
@@ -123,11 +124,11 @@ __rwsem_do_wake(struct rw_semaphore *sem
 	for (; loop > 0; loop--) {
 		waiter = list_entry(next, struct rwsem_waiter, list);
 		next = waiter->list.next;
-		tsk = waiter->task;
+		wake_target = waiter->wake_target;
 		smp_mb();
-		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
+		waiter->wake_target = NULL;
+		wake_up_target(wake_target);
+		put_task_struct(wake_target_to_task(wake_target));
 	}
 
 	sem->wait_list.next = next;
@@ -157,7 +158,7 @@ rwsem_down_failed_common(struct rw_semap
 
 	/* set up my own style of waitqueue */
 	spin_lock_irq(&sem->wait_lock);
-	waiter->task = tsk;
+	waiter->wake_target = task_wake_target(tsk);
 	get_task_struct(tsk);
 
 	list_add_tail(&waiter->list, &sem->wait_list);
@@ -173,7 +174,7 @@ rwsem_down_failed_common(struct rw_semap
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->task)
+		if (!waiter->wake_target)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
diff -r df7bc026d50e -r 4ea674e8825e mm/pdflush.c
--- a/mm/pdflush.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/mm/pdflush.c	Mon Jan 29 15:46:47 2007 -0800
@@ -217,7 +217,7 @@ int pdflush_operation(void (*fn)(unsigne
 			last_empty_jifs = jiffies;
 		pdf->fn = fn;
 		pdf->arg0 = arg0;
-		wake_up_process(pdf->who);
+		wake_up_task(pdf->who);
 		spin_unlock_irqrestore(&pdflush_lock, flags);
 	}
 	return ret;
diff -r df7bc026d50e -r 4ea674e8825e net/core/pktgen.c
--- a/net/core/pktgen.c	Mon Jan 29 15:36:16 2007 -0800
+++ b/net/core/pktgen.c	Mon Jan 29 15:46:47 2007 -0800
@@ -3505,7 +3505,7 @@ static int __init pktgen_create_thread(i
 	pe->proc_fops = &pktgen_thread_fops;
 	pe->data = t;
 
-	wake_up_process(p);
+	wake_up_task(p);
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 151+ messages in thread

* [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
                   ` (2 preceding siblings ...)
  2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
@ 2007-01-30 20:39 ` Zach Brown
  2007-01-31  8:58   ` Andi Kleen
                     ` (2 more replies)
  2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 20:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

This finally does something useful with the notion of being able to schedule
stacks as fibrils under a task_struct.  Again, i386-specific and in need of
proper layering with archs.

sys_asys_submit() is added to let userspace submit asynchronous system calls.
It specifies the system call number and arguments.  A fibril is constructed for
each call.  Each starts with a stack which executes the given system call
handler and then returns to a function which records the return code of the
system call handler.  sys_asys_await_completion() then lets userspace collect
these results.

sys_asys_submit() is careful to construct a fibril for the submission syscall
itself so that it can return to userspace if the calls it is dispatching block.
If none of them block, however, they will have all been run hot in this
submitting task on this processor.

It allocates and runs each system call in turn.  It could certainly work in
batches to decrease locking overhead at the cost of increased peak memory
overhead for calls which don't end up blocking.

The complexity of a fully-formed submission and completion interface hasn't
been addressed.  Details like targeting explicit completion contexts, batching,
timeouts, signal delivery, and syscall-free submission and completion (now with
more rings!) can all be hashed out in some giant thread, no doubt.  I didn't
want them to cloud the basic mechanics being presented here.

diff -r 4ea674e8825e -r 5bdda0f7bef2 arch/i386/kernel/syscall_table.S
--- a/arch/i386/kernel/syscall_table.S	Mon Jan 29 15:46:47 2007 -0800
+++ b/arch/i386/kernel/syscall_table.S	Mon Jan 29 15:50:10 2007 -0800
@@ -319,3 +319,5 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_asys_submit		/* 320 */
+	.long sys_asys_await_completion
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h	Mon Jan 29 15:46:47 2007 -0800
+++ b/include/asm-i386/unistd.h	Mon Jan 29 15:50:10 2007 -0800
@@ -325,6 +325,8 @@
 #define __NR_move_pages		317
 #define __NR_getcpu		318
 #define __NR_epoll_pwait	319
+#define __NR_asys_submit	320
+#define __NR_asys_await_completion	321
 
 #ifdef __KERNEL__
 
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/init_task.h
--- a/include/linux/init_task.h	Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/init_task.h	Mon Jan 29 15:50:10 2007 -0800
@@ -148,6 +148,8 @@ extern struct group_info init_groups;
 	.pi_lock	= SPIN_LOCK_UNLOCKED,				\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
+	.asys_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.asys_wait),	\
+	.asys_completed = LIST_HEAD_INIT(tsk.asys_completed),		\
 }
 
 
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/sched.h
--- a/include/linux/sched.h	Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/sched.h	Mon Jan 29 15:50:10 2007 -0800
@@ -1019,6 +1019,14 @@ struct task_struct {
 
 	/* Protection of the PI data structures: */
 	spinlock_t pi_lock;
+
+	/*
+	 * XXX This is just a dummy that should be in a seperately managed
+	 * context.  An explicit contexts lets asys calls be nested (!) and
+	 * will let us provide the sys_io_*() API on top of asys.
+	 */
+	struct list_head asys_completed;
+	wait_queue_head_t	asys_wait;
 
 #ifdef CONFIG_RT_MUTEXES
 	/* PI waiters blocked on a rt_mutex held by this task */
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/Makefile
--- a/kernel/Makefile	Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/Makefile	Mon Jan 29 15:50:10 2007 -0800
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
-	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o asys.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/exit.c
--- a/kernel/exit.c	Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/exit.c	Mon Jan 29 15:50:10 2007 -0800
@@ -42,6 +42,7 @@
 #include <linux/audit.h> /* for audit_free() */
 #include <linux/resource.h>
 #include <linux/blkdev.h>
+#include <linux/asys.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -926,6 +927,8 @@ fastcall NORET_TYPE void do_exit(long co
 	taskstats_exit(tsk, group_dead);
 
 	exit_mm(tsk);
+
+	asys_task_exiting(tsk);
 
 	if (group_dead)
 		acct_process();
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/fork.c
--- a/kernel/fork.c	Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/fork.c	Mon Jan 29 15:50:10 2007 -0800
@@ -49,6 +49,7 @@
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
+#include <linux/asys.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -987,6 +988,8 @@ static struct task_struct *copy_process(
 		goto fork_out;
 
 	rt_mutex_init_task(p);
+
+	asys_init_task(p);
 
 #ifdef CONFIG_TRACE_IRQFLAGS
 	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/asys.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/asys.h	Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,7 @@
+#ifndef _LINUX_ASYS_H 
+#define _LINUX_ASYS_H 
+
+void asys_task_exiting(struct task_struct *tsk);
+void asys_init_task(struct task_struct *tsk);
+
+#endif
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/asys.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/kernel/asys.c	Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,252 @@
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/asys.h>
+
+/* XXX */
+#include <asm/processor.h>
+
+/*
+ * system call and argument specification given to _submit from userspace
+ */
+struct asys_input {
+	int 		syscall_nr;
+	unsigned long	cookie;
+	unsigned long	nr_args;
+	unsigned long	*args;
+};
+
+/*
+ * system call completion event given to userspace
+ * XXX: compat
+ */
+struct asys_completion {
+	long 		return_code;
+	unsigned long	cookie;
+};
+
+/*
+ * This record of a completed async system call is kept around until it
+ * is collected by userspace.
+ */
+struct asys_result {
+	struct list_head	item;
+	struct asys_completion	comp;
+};
+
+/*
+ * This stack is built-up and handed to the scheduler to first process
+ * the system call.  It stores the progress of the call until the call returns
+ * and this structure is freed.
+ */
+struct asys_call {
+	struct asys_result	*result;
+	struct fibril		fibril;
+};
+
+void asys_init_task(struct task_struct *tsk)
+{
+	INIT_LIST_HEAD(&tsk->asys_completed);
+	init_waitqueue_head(&tsk->asys_wait);
+}
+
+void asys_task_exiting(struct task_struct *tsk)
+{
+	struct asys_result *res, *next;
+
+	list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
+		kfree(res);
+
+	/* 
+	 * XXX this only works if tsk->fibril was allocated by
+	 * sys_asys_submit(), not if its embedded in an asys_call.  This
+	 * implies that we must forbid sys_exit in asys_submit.
+	 */
+	if (tsk->fibril) {
+		BUG_ON(!list_empty(&tsk->fibril->run_list));
+		kfree(tsk->fibril);
+		tsk->fibril = NULL;
+	}
+}
+
+/*
+ * Initial asys call stacks are constructed such that this is called when
+ * the system call handler returns.  It records the return code from
+ * the handler in a completion event and frees data associated with the
+ * completed asys call.
+ *
+ * XXX we know that the x86 syscall handlers put their return code in eax and
+ * that regparm(3) here will take our rc argument from eax.
+ */
+static void fastcall NORET_TYPE asys_teardown_stack(long rc)
+{
+	struct asys_result *res;
+	struct asys_call *call;
+	struct fibril *fibril;
+
+	fibril = current->fibril;
+	call = container_of(fibril, struct asys_call, fibril);
+	res = call->result;
+	call->result = NULL;
+
+	res->comp.return_code = rc;
+	list_add_tail(&res->item, &current->asys_completed);
+	wake_up(&current->asys_wait);
+
+	/*
+	 * We embedded the fibril in the call so that we could dereference it
+	 * here without adding some tracking to the fibril.  We then free the
+	 * call and fibril because we're done with them.
+	 *
+	 * The ti itself, though, is still in use.  It will only be freed once
+	 * the scheduler switches away from it to another fibril.  It does
+	 * that when it sees current->fibril assigned to NULL.
+	 */
+	current->fibril = NULL;
+	BUG_ON(!list_empty(&fibril->run_list));
+	kfree(call);
+
+	/* 
+	 * XXX This is sloppy.  We "know" this is likely for now as the task
+	 * with fibrils is only going to be in sys_asys_submit() or
+	 * sys_asys_complete()
+	 */
+	BUG_ON(list_empty(&current->runnable_fibrils));
+
+	schedule();
+	BUG();
+}
+
+asmlinkage long sys_asys_await_completion(struct asys_completion __user *comp)
+{
+	struct asys_result *res;
+	long ret;
+
+	ret = wait_event_interruptible(current->asys_wait,
+				       !list_empty(&current->asys_completed));
+	if (ret)
+		goto out;
+
+	res = list_entry(current->asys_completed.next, struct asys_result,
+			  item);
+
+	/* XXX compat */
+	ret = copy_to_user(comp, &res->comp, sizeof(struct asys_completion));
+	if (ret) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	list_del(&res->item);
+	kfree(res);
+	ret = 1;
+
+out:
+	return ret;
+}
+
+/*
+ * This initializes a newly allocated fibril so that it can be handed to the
+ * scheduler.  The fibril is private to this code path at this point.
+ *
+ * XXX
+ *  - this is arch specific
+ *  - should maybe have a sched helper that uses INIT_PER_CALL_CHAIN
+ */
+extern unsigned long sys_call_table[]; /* XXX */
+static int asys_init_fibril(struct fibril *fibril, struct thread_info *ti, 
+			    struct asys_input *inp)
+{
+	unsigned long *stack_bottom;
+
+	INIT_LIST_HEAD(&fibril->run_list);
+	fibril->ti = ti;
+
+	/* XXX sanity check syscall_nr */
+	fibril->eip = sys_call_table[inp->syscall_nr];
+	/* this mirrors copy_thread()'s use of task_pt_regs() */
+	fibril->esp = (unsigned long)thread_info_pt_regs(ti) -
+			((inp->nr_args + 1) * sizeof(long));
+
+	/* 
+	 * now setup the stack so that our syscall handler gets its arguments
+	 * and we return to asys_teardown_stack.
+	 */
+	stack_bottom = (unsigned long *)fibril->esp;
+	stack_bottom[0] = (unsigned long)asys_teardown_stack;
+	/* XXX compat */
+	if (copy_from_user(&stack_bottom[1], inp->args,
+			   inp->nr_args * sizeof(long)))
+		return -EFAULT;
+
+	return 0;
+}
+
+asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
+				unsigned long nr_inp)
+{
+	struct asys_input inp;
+	struct asys_result *res;
+	struct asys_call *call;
+	struct thread_info *ti;
+	unsigned long i;
+	long err = 0;
+
+	/* Allocate a fibril for the submitter's thread_info */
+	if (current->fibril == NULL) {
+		current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
+		if (current->fibril == NULL)
+			return -ENOMEM;
+
+		INIT_LIST_HEAD(&current->fibril->run_list);
+		current->fibril->state = TASK_RUNNING;
+		current->fibril->ti = current_thread_info();
+	}
+
+	for (i = 0; i < nr_inp; i++) {
+
+		if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
+			err = -EFAULT;
+			break;
+		}
+
+		res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
+		if (res == NULL) {
+			err = -ENOMEM;
+			break;
+		}
+
+		/* XXX kzalloc to init call.fibril.per_cpu, add helper */
+		call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
+		if (call == NULL) {
+			kfree(res);
+			err = -ENOMEM;
+			break;
+		}
+
+		ti = alloc_thread_info(tsk);
+		if (ti == NULL) {
+			kfree(res);
+			kfree(call);
+			err = -ENOMEM;
+			break;
+		}
+
+		err = asys_init_fibril(&call->fibril, ti, &inp);
+		if (err) {
+			kfree(res);
+			kfree(call);
+			free_thread_info(ti);
+			break;
+		}
+
+		res->comp.cookie = inp.cookie;
+		call->result = res;
+		ti->task = current;
+
+		sched_new_runnable_fibril(&call->fibril);
+		schedule();
+	}
+
+	return i ? i : err;
+}

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
                   ` (3 preceding siblings ...)
  2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
@ 2007-01-30 21:58 ` Linus Torvalds
  2007-01-30 22:23   ` Linus Torvalds
  2007-01-30 22:40   ` Zach Brown
  2007-01-31  2:04 ` Benjamin Herrenschmidt
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-01-30 21:58 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Tue, 30 Jan 2007, Zach Brown wrote:
>
> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Yee-haa!

I looked at this approach a long time ago, and basically gave up because 
it looked like too much work.

I heartily approve, although I only gave the actual patches a very cursory 
glance. I think the approach is the proper one, but the devil is in the 
details. It might be that the stack allocation overhead or some other 
subtle fundamental problem ends up making this impractical in the end, but 
I would _really_ like for this to basically go in.

One of the biggest issues I see is signalling. Your mention it, and I 
think that's going to be one of the big issues.

It won't matter at all for a certain class of calls (a lot of the people 
who want to do AIO really end up doing non-interruptible things, and 
signalling is a non-issue), but not only is it going to matter for some 
others, we will almost certainly want to have a way to not just signal a 
task, but a single "fibril" (and let me say that I'm not convinced about 
your naming, but I don't hate it either ;)

But from a quick overview of the patches, I really don't see anything 
fundamentally wrong. It needs some error checking and some limiting (I 
_really_ don't think we want a regular user starting a thousand fibrils 
concurrently), but it actually looks much less invasive than I thought it 
would be.

So yay! Consider me at least preliminarily very happy.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
@ 2007-01-30 22:23   ` Linus Torvalds
  2007-01-30 22:53     ` Zach Brown
  2007-01-30 22:40   ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-01-30 22:23 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Tue, 30 Jan 2007, Linus Torvalds wrote:
> 
> But from a quick overview of the patches, I really don't see anything 
> fundamentally wrong. It needs some error checking and some limiting (I 
> _really_ don't think we want a regular user starting a thousand fibrils 
> concurrently), but it actually looks much less invasive than I thought it 
> would be.

Side note (and maybe this was obvious to people already): I would suggest 
that the "limiting" not be done the way fork() is limited ("return EAGAIN 
if you go over a limit") but be done as a per-task counting semaphore 
(down() on submit, up() on fibril exit).

So we should limit these to basically have some maximum concurrency 
factor, but rather than consider it an error to go over it, we'd just cap 
the concurrency by default, so that people can freely use asynchronous 
interfaces without having to always worry about what happens if their 
resources run out..

However, that also implies that we should probably add a "flags" parameter 
to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or 
something to tell the kernel to rather return EAGAIN than wait. Sometimes 
you don't want to block just because you already have too much work.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
  2007-01-30 22:23   ` Linus Torvalds
@ 2007-01-30 22:40   ` Zach Brown
  2007-01-30 22:53     ` Linus Torvalds
  1 sibling, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-30 22:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> I looked at this approach a long time ago, and basically gave up  
> because
> it looked like too much work.

Indeed, your mention of it in that thread.. a year ago?.. is what got  
this notion sitting in the back of my head.  I didn't like it at  
first, but it grew on me.

> I heartily approve, although I only gave the actual patches a very  
> cursory
> glance. I think the approach is the proper one, but the devil is in  
> the
> details. It might be that the stack allocation overhead or some other
> subtle fundamental problem ends up making this impractical in the  
> end, but
> I would _really_ like for this to basically go in.

As for efficiency and overhead, I hope to get some time with the team  
that work on the Giant Database Software Whose Name We Shall Not  
Speak.  That'll give us some non-trival loads to profile.

> It won't matter at all for a certain class of calls (a lot of the  
> people
> who want to do AIO really end up doing non-interruptible things, and
> signalling is a non-issue), but not only is it going to matter for  
> some
> others, we will almost certainly want to have a way to not just  
> signal a
> task, but a single "fibril" (and let me say that I'm not convinced  
> about
> your naming, but I don't hate it either ;)

Yeah, no doubt.  I'm wildly open to discussion here.  (and yeah, me  
either, but I don't care much about the name.  I got tired of  
qualifying overloaded uses of 'stack' or 'thread', that's all :)).

> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand  
> fibrils
> concurrently), but it actually looks much less invasive than I  
> thought it
> would be.

I think we'll also want to flesh out the submission and completion  
interface so that we don't find ourselves frustrated with it in  
another 5 years.  What's there now is just scaffolding to support the  
interesting kernel-internal part.  No doubt the kevent thread will  
come into play here.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 22:23   ` Linus Torvalds
@ 2007-01-30 22:53     ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-30 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> So we should limit these to basically have some maximum concurrency
> factor, but rather than consider it an error to go over it, we'd  
> just cap
> the concurrency by default, so that people can freely use asynchronous
> interfaces without having to always worry about what happens if their
> resources run out..

Yeah, call it the socket transmit queue model :).  Maybe tuned by a  
ulimit?

I don't have very strong opinions abou the specific mechanics of  
limiting concurrent submissions, as long as they're there.  Some  
folks in Oracle complain about having one more thing to have to tune,  
but the alternative seems worse.

> However, that also implies that we should probably add a "flags"  
> parameter
> to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
> something to tell the kernel to rather return EAGAIN than wait.  
> Sometimes
> you don't want to block just because you already have too much work.

EAGAIN or the initial number of submissions completed before the one  
that ran over the limit, perhaps.  Sure.  Nothing too controversial  
here :).   I have this kind of stuff queued up for worrying about  
once the internal mechanics are stronger.

- z


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 22:40   ` Zach Brown
@ 2007-01-30 22:53     ` Linus Torvalds
  2007-01-30 23:45       ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-01-30 22:53 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar



On Tue, 30 Jan 2007, Zach Brown wrote:
>
> I think we'll also want to flesh out the submission and completion interface
> so that we don't find ourselves frustrated with it in another 5 years.  What's
> there now is just scaffolding to support the interesting kernel-internal part.
> No doubt the kevent thread will come into play here.

Actually, the thing I like about kernel micro-threads (and that "fibril" 
name is starting to grow on me) is that I'm hoping we might be able to use 
them for that kevent thing too. And perhaps some other issues (ACPI has 
some "events" that might work with synchronously scheduled threads).

IOW, synchronous threading does have its advantages..

Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not just 
because he's basically the normal scheduler maintainer, but also because 
he's historically been involved in things like the async filename lookup 
that the in-kernel web server thing used. EXACTLY the kinds of things 
where fibrils actually give you *much* nicer interfaces.

Ingo - see linux-kernel for the announcement and WIP patches.

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 22:53     ` Linus Torvalds
@ 2007-01-30 23:45       ` Zach Brown
  2007-01-31  2:07         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-30 23:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar

> Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not  
> just
> because he's basically the normal scheduler maintainer, but also  
> because
> he's historically been involved in things like the async filename  
> lookup
> that the in-kernel web server thing used.

Yeah, that was dumb.  I had him in the cc: initially, then thought it  
was too large and lobbed a bunch off.  My mistake.

Ingo, I'm interested in your reaction to the i386-specific mechanics  
here (the thread_info copies terrify me) and the general notion of  
how to tie this cleanly into the scheduler.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
                   ` (4 preceding siblings ...)
  2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
@ 2007-01-31  2:04 ` Benjamin Herrenschmidt
  2007-01-31  2:46   ` Linus Torvalds
  2007-01-31 17:38   ` Zach Brown
  2007-02-04  5:13 ` Davide Libenzi
  2007-02-09 22:33 ` Linus Torvalds
  7 siblings, 2 replies; 151+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-31  2:04 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

> - We would now have some measure of task_struct concurrency.  Read that twice,
> it's scary.  As two fibrils execute and block in turn they'll each be
> referencing current->.  It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away.  The current implementation
> *does not* let preemption trigger a fibril switch.  So one only has to worry
> about racing with voluntary scheduling of the fibril paths.  This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril.  I think this is a
> manageable burden.

That's the one scaring me in fact ... Maybe it will end up being an easy
one but I don't feel too comfortable... we didn't create fibril-like
things for threads, instead, we share PIDs between tasks. I wonder if
the sane approach would be to actually create task structs (or have a
pool of them pre-created sitting there for performances) and add a way
to share the necessary bits so that syscalls can be run on those
spin-offs.

Ben.



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 23:45       ` Zach Brown
@ 2007-01-31  2:07         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 151+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-31  2:07 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Tue, 2007-01-30 at 15:45 -0800, Zach Brown wrote:
> > Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not  
> > just
> > because he's basically the normal scheduler maintainer, but also  
> > because
> > he's historically been involved in things like the async filename  
> > lookup
> > that the in-kernel web server thing used.
> 
> Yeah, that was dumb.  I had him in the cc: initially, then thought it  
> was too large and lobbed a bunch off.  My mistake.
> 
> Ingo, I'm interested in your reaction to the i386-specific mechanics  
> here (the thread_info copies terrify me) and the general notion of  
> how to tie this cleanly into the scheduler.

Thread info copies aren't such a big deal, we do that for irq stacks
already afaik

Ben.



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:04 ` Benjamin Herrenschmidt
@ 2007-01-31  2:46   ` Linus Torvalds
  2007-01-31  3:02     ` Linus Torvalds
                       ` (3 more replies)
  2007-01-31 17:38   ` Zach Brown
  1 sibling, 4 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-01-31  2:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:

> > - We would now have some measure of task_struct concurrency.  Read that twice,
> > it's scary.  As two fibrils execute and block in turn they'll each be
> > referencing current->.  It means that we need to audit task_struct to make sure
> > that paths can handle racing as its scheduled away.  The current implementation
> > *does not* let preemption trigger a fibril switch.  So one only has to worry
> > about racing with voluntary scheduling of the fibril paths.  This can mean
> > moving some task_struct members under an accessor that hides them in a struct
> > in task_struct so they're switched along with the fibril.  I think this is a
> > manageable burden.
> 
> That's the one scaring me in fact ... Maybe it will end up being an easy
> one but I don't feel too comfortable...

We actually have almost zero "interesting" data in the task-struct.

All the real meat of a task has long since been split up into structures 
that can be shared for threading anyway (ie signal/files/mm/etc).

Which is why I'm personally very comfy with just re-using task_struct 
as-is.

NOTE! This is with the understanding that we *never* do any preemption. 
The whole point of the microthreading as far as I'm concerned is exactly 
that it is cooperative. It's not preemptive, and it's emphatically *not* 
concurrent (ie you'd never have two fibrils running at the same time on 
separate CPU's).

If you want preemptive of concurrent CPU usage, you use separate threads. 
The point of AIO scheduling is very much inherent in its name: it's for 
filling up CPU's when there's IO.

So the theory (and largely practice) is that you want to use real threads 
to fill your CPU's, but then *within* those threads you use AIO to make 
sure that each thread actually uses the CPU efficiently and doesn't just 
block with nothing to do.

So with the understanding that this is neither CPU-concurrent nor 
preemptive (*within* a fibril group - obviously the thread itself gets 
both preempted and concurrently run with other threads), I don't worry at 
all about sharing "struct task_struct".

Does that mean that we might not have some cases where we'd need to make 
sure we do things differently? Of course not. Something migt show up. But 
this actually makes it very clear what the difference between "struct 
thread_struct" and "struct task_struct" are. One is shared between 
fibrils, the other isn't.

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:46   ` Linus Torvalds
@ 2007-01-31  3:02     ` Linus Torvalds
  2007-01-31 10:50       ` Xavier Bestel
  2007-01-31 17:59       ` Zach Brown
  2007-01-31  5:16     ` Benjamin Herrenschmidt
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-01-31  3:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Tue, 30 Jan 2007, Linus Torvalds wrote:
> 
> Does that mean that we might not have some cases where we'd need to make 
> sure we do things differently? Of course not. Something migt show up. But 
> this actually makes it very clear what the difference between "struct 
> thread_struct" and "struct task_struct" are. One is shared between 
> fibrils, the other isn't.

Btw, this is also something where we should just disallow certain system 
calls from being done through the asynchronous method. 

Notably, clone/fork(), execve() and exit() are all things that we probably 
simply shouldn't allow as "AIO" events.

The process handling ones are obvious: they are very much about the shared 
"struct task_struct", so they rather clearly should only done "natively".

More interesting is the question about "close()", though. Currently we 
have an optimization (fget/fput_light) that basically boils down to "we 
know we are the only owners". That optimization becomes more "interesting" 
with AIO - we need to disable it when fibrils are active (because other 
fibrils or the main thread can do it), but we can still keep it for the 
non-fibril case.

So we can certainly allow close() to happen in a fibril, but at the same 
time, this is an area where just the *existence* of fibrils means that 
certain other decisions that were thread-related may be modified to be 
aware of the micro-threads too.

I suspect there are rather few of those, though. The only one I can think 
of is literally the fget/fput_light() case, but there could be others.

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:46   ` Linus Torvalds
  2007-01-31  3:02     ` Linus Torvalds
@ 2007-01-31  5:16     ` Benjamin Herrenschmidt
  2007-01-31  5:36     ` Nick Piggin
  2007-01-31 17:47     ` Zach Brown
  3 siblings, 0 replies; 151+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-31  5:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar


> NOTE! This is with the understanding that we *never* do any preemption. 
> The whole point of the microthreading as far as I'm concerned is exactly 
> that it is cooperative. It's not preemptive, and it's emphatically *not* 
> concurrent (ie you'd never have two fibrils running at the same time on 
> separate CPU's).

That makes it indeed much less worrisome...

> If you want preemptive of concurrent CPU usage, you use separate threads. 
> The point of AIO scheduling is very much inherent in its name: it's for 
> filling up CPU's when there's IO.

Ok, I see, that's in fact pretty similar to some task switching hack I
did about 10 years ago on MacOS to have "asynchronous" IO code be
implemented linearily :-)

Makes lots of sense imho.

Ben.



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:46   ` Linus Torvalds
  2007-01-31  3:02     ` Linus Torvalds
  2007-01-31  5:16     ` Benjamin Herrenschmidt
@ 2007-01-31  5:36     ` Nick Piggin
  2007-01-31  5:51       ` Nick Piggin
                         ` (2 more replies)
  2007-01-31 17:47     ` Zach Brown
  3 siblings, 3 replies; 151+ messages in thread
From: Nick Piggin @ 2007-01-31  5:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

Linus Torvalds wrote:
> 
> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
> 
> 
>>>- We would now have some measure of task_struct concurrency.  Read that twice,
>>>it's scary.  As two fibrils execute and block in turn they'll each be
>>>referencing current->.  It means that we need to audit task_struct to make sure
>>>that paths can handle racing as its scheduled away.  The current implementation
>>>*does not* let preemption trigger a fibril switch.  So one only has to worry
>>>about racing with voluntary scheduling of the fibril paths.  This can mean
>>>moving some task_struct members under an accessor that hides them in a struct
>>>in task_struct so they're switched along with the fibril.  I think this is a
>>>manageable burden.
>>
>>That's the one scaring me in fact ... Maybe it will end up being an easy
>>one but I don't feel too comfortable...
> 
> 
> We actually have almost zero "interesting" data in the task-struct.
> 
> All the real meat of a task has long since been split up into structures 
> that can be shared for threading anyway (ie signal/files/mm/etc).
> 
> Which is why I'm personally very comfy with just re-using task_struct 
> as-is.
> 
> NOTE! This is with the understanding that we *never* do any preemption. 
> The whole point of the microthreading as far as I'm concerned is exactly 
> that it is cooperative. It's not preemptive, and it's emphatically *not* 
> concurrent (ie you'd never have two fibrils running at the same time on 
> separate CPU's).

So using stacks to hold state is (IMO) the logical choice to do async
syscalls, especially once you have a look at some of the other AIO
stuff going around.

I always thought that the AIO people didn't do this because they wanted
to avoid context switch overhead.

So now if we introduce the context switch overhead back, why do we need
just another scheduling primitive? What's so bad about using threads? The
upside is that almost everything is already there and working, and also
they don't have any of these preemption or concurrency restrictions.

The only thing I saw in Zach's post against the use of threads is that
some kernel API would change. But surely if this is the showstopper then
there must be some better argument than sys_getpid()?!

Aside from that, I'm glad that someone is looking at this way for AIO,
because I really don't like some aspects in the other approach.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  5:36     ` Nick Piggin
@ 2007-01-31  5:51       ` Nick Piggin
  2007-01-31  6:06       ` Linus Torvalds
  2007-01-31 18:20       ` Zach Brown
  2 siblings, 0 replies; 151+ messages in thread
From: Nick Piggin @ 2007-01-31  5:51 UTC (permalink / raw)
  Cc: Linus Torvalds, Benjamin Herrenschmidt, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar

Nick Piggin wrote:
> Linus Torvalds wrote:
> 
>>
>> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>>
>>
>>>> - We would now have some measure of task_struct concurrency.  Read 
>>>> that twice,
>>>> it's scary.  As two fibrils execute and block in turn they'll each be
>>>> referencing current->.  It means that we need to audit task_struct 
>>>> to make sure
>>>> that paths can handle racing as its scheduled away.  The current 
>>>> implementation
>>>> *does not* let preemption trigger a fibril switch.  So one only has 
>>>> to worry
>>>> about racing with voluntary scheduling of the fibril paths.  This 
>>>> can mean
>>>> moving some task_struct members under an accessor that hides them in 
>>>> a struct
>>>> in task_struct so they're switched along with the fibril.  I think 
>>>> this is a
>>>> manageable burden.
>>>
>>>
>>> That's the one scaring me in fact ... Maybe it will end up being an easy
>>> one but I don't feel too comfortable...
>>
>>
>>
>> We actually have almost zero "interesting" data in the task-struct.
>>
>> All the real meat of a task has long since been split up into 
>> structures that can be shared for threading anyway (ie 
>> signal/files/mm/etc).
>>
>> Which is why I'm personally very comfy with just re-using task_struct 
>> as-is.
>>
>> NOTE! This is with the understanding that we *never* do any 
>> preemption. The whole point of the microthreading as far as I'm 
>> concerned is exactly that it is cooperative. It's not preemptive, and 
>> it's emphatically *not* concurrent (ie you'd never have two fibrils 
>> running at the same time on separate CPU's).
> 
> 
> So using stacks to hold state is (IMO) the logical choice to do async
> syscalls, especially once you have a look at some of the other AIO
> stuff going around.
> 
> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.
> 
> So now if we introduce the context switch overhead back, why do we need
> just another scheduling primitive? What's so bad about using threads? The
> upside is that almost everything is already there and working, and also
> they don't have any of these preemption or concurrency restrictions.

In other words, while I share the appreciation for this clever trick of
using cooperative switching between these little thriblets, I don't
actually feel it is very elegant to then have to change the kernel so
much in order to handle them.

I would be fascinated to see where such a big advantage comes from using
these rather than threads. Maybe we can then improve threads not to suck
so much and everybody wins.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  5:36     ` Nick Piggin
  2007-01-31  5:51       ` Nick Piggin
@ 2007-01-31  6:06       ` Linus Torvalds
  2007-01-31  8:43         ` Ingo Molnar
  2007-01-31 20:13         ` Joel Becker
  2007-01-31 18:20       ` Zach Brown
  2 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-01-31  6:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Benjamin Herrenschmidt, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Wed, 31 Jan 2007, Nick Piggin wrote:
> 
> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.

I don't think that scheduling overhead was ever a really the reason, at 
least not the primary one, and at least not on Linux. Sure, we can 
probably make cooperative thread switching a bit faster than even 
VM-sharing thread switching (maybe), but it's not going to be *that* big 
an issue.

Ifaik, the bigger issues were about setup costs (but also purely semantic 
- it was hard to do AIO semantics with threads).

And memory costs. The "one stack page per outstanding AIO" may end up 
still being too expensive, but threads were even more so.

[ Of course, that used to also be the claim by all the people who thought 
  we couldn't do native kernel threads for "normal" threading either, and 
  should go with the n*m setup. Shows how much they knew ;^]

But I've certainly _personally_ always wanted to do AIO with threads. I 
wanted to do it with regular threads (ie the "clone()" kind). It didn't 
fly. But I think we can possibly lower both the setup costs and the memory 
costs with the cooperative approach, to the point where maybe this one is 
more palatable and workable.

And maybe it also solves some of the scalability worries (threads have ID 
space and scheduling setup things that essentially go away by just not 
doing them - which is what the fibrils simply wouldn't have).

(Sadly, some of the people who really _use_ AIO are the database people, 
and they really only care about a particularly stupid and trivial case: 
pure reads and writes. A lot of other loads care about much more complex 
things: filename lookups etc, that traditional AIO cannot do at all, and 
that you really want something more thread-like for. But those other loads 
get kind of swamped by the DB needs, which are might tighter and trivial 
enough that you don't "need" a real thread for them).

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  6:06       ` Linus Torvalds
@ 2007-01-31  8:43         ` Ingo Molnar
  2007-01-31 20:13         ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-01-31  8:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Benjamin Herrenschmidt, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ulrich Drepper


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ Of course, that used to also be the claim by all the people who 
>   thought we couldn't do native kernel threads for "normal" threading 
>   either, and should go with the n*m setup. Shows how much they knew 
>   ;^]
> 
> But I've certainly _personally_ always wanted to do AIO with threads. 
> I wanted to do it with regular threads (ie the "clone()" kind). It 
> didn't fly. But I think we can possibly lower both the setup costs and 
> the memory costs with the cooperative approach, to the point where 
> maybe this one is more palatable and workable.

as you know i've been involved in all the affected IO and scheduling 
disciplines in question: i hacked on scheduling, 1:1 threading, on Tux, 
i even optimized kaio under TPC workloads, and various other things. So 
i believe to have seen most of these things first-hand and thus i 
(pretend to) have no particular prejudice or other subjective 
preference. In the following, rather long (and boring) analysis i touch 
upon both 1:1 threading, light threads, state machines and AIO.

The first and most important thing is that i think there are only /two/ 
basic, fundamental IO approaches that matter to kernel design:

 1) the most easy to program one.

 2) the fastest one.

1), the most easily programmed model: this is tasks and synchronous IO. 
Samba (and Squid, and even Apache most of the time ...) is still using 
plain old /processes/ and is doing fine! Most of the old LWP arguments 
are not really true anymore, TLB switches are very fast meanwhile (per 
hardware improvements) and our scheduler is still pretty light. 
Furthermore, for certain specialized environments that are able isolate 
the programmer from the risks and complexities of thread programming, 
threading can be somewhat faster (such as Java or C#). But for the 
general C/C++ based server environment even threading is pure hell. (We 
know it from the kernel that programming a threaded design is /hard/,
needs alot of discipline and takes alot more resources than any of the
saner models.)

2) the fastest model: is a pure, minimal state-machine tailored to the 
task/workload we want to do. Tux does that (and i'm keeping it uptodate 
for current 2.6 kernels too), and it still beats the pants off anything 
user-space.

But reality is that most people care more about programmability: 300 
MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K 
requests in parallel or 60K requests in parallel - whatever the 
performance metric you care about is) isnt really making any difference 
in all but the most specialized environments - /because/ the price to 
pay for it is much worse programmability (and hence much worse 
flexibility). [ And this is all about basic economy: if it's 10 times 
harder to program something and what takes 1 month to program in one 
model takes 10 months to program in the other model - but in 10 months 
the hardware already got another 50% faster, likely offsetting the 
advantage of that 'other' model to begin with ... ]

Note that often the speed difference between task based and state based 
designs is alot smaller. But God is the programming complexity 
difference huge. Not even huge but /HUGE/. All our programming languages 
are procedural and stack based. (Not only is there a basic lack of 
state-based programming languages, it's a product of our brain 
structure: state machines are not really what the human mind is 
structured for, but i digress.)

Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts 
fit? Most of the time they are just a /weakening/ of the #1 concept. And 
that's why they will lose out, because #1 is all about programmability 
and they dont offer anything new: because they cannot. Either you go for 
programmability or you go for performance. There is /no/ middle ground 
for us in the kernel! (User-space can do the middle ground thing, but 
the kernel must only be structured around the two most fundamental 
concepts that exist. [NOTE: more about KAIO later.])

Having a 1:1 relationship between user-space and kernel-space context is 
a strong concept, and on modern CPUs we perform /process/ context 
switches in 650 nsecs, we do user thread context switches in 500 nsecs, 
and kernel<->kernel thread context switches in 350 nsecs. That's pretty 
damn fast.

The big problem is, the M:N concepts (and i consider micro-threads or 
'schedulable stacks' to be of that type) tend to operate the following 
way: they build up hype by being "faster" - but the performance 
advantage only comes from weakening #1! (for example by not making the 
contexts generally schedulable, bindable, load-balancable, suspendable, 
migratable, etc.) But then a few years later they grow (by virtue of 
basic pressure from programmers, who /want/ a clean easily programmable 
concept #1) the /very same/ features that they "optimized away" in the 
beginning. With the difference that now all that infrastructure they 
left out initially is replicated in user-space, in a poorer and 
inconsistent way, blowing up cache-size, and slowing the /sane/ things 
down in the end. (and then i havent even begun about the nightmare of 
extending the debug infrastructure to light threads, ABI dependencies, 
etc, etc...)

One often repeated (because pretty much only) performance advantage of 
'light threads' is context-switch performance between user-space 
threads. But reality is, nobody /cares/ about being able to 
context-switch between "light user-space threads"! Why? Because there 
are only two reasons why such a high-performance context-switch would 
occur:

 1) there's contention between those two tasks. Wonderful: now two 
    artificial threads are running on the /same/ CPU and they are even 
    contending each other. Why not run a single context on a single CPU 
    instead and only get contended if /another/ CPU runs a conflicting 
    context?? While this makes for nice "pthread locking benchmarks", 
    it is not really useful for anything real.

 2) there has been an IO event. The thing is, for IO events we enter the 
    kernel no matter what - and we'll do so for the next 10 years at 
    minimum. We want to abstract away the hardware, we want to do 
    reliable resource accounting, we want to share hardware resources, 
    we want to rate-limit, etc., etc. While in /theory/ you could handle 
    IO purely from user-space, in practice you dont want to do that. And 
    if we accept the premise that we'll enter the kernel anyway, there's 
    zero performance difference between scheduling right there in the 
    kernel, or returning back to user-space to schedule there. (in fact
    i submit that the former is faster). Or if we accept the theoretical
    possibility of 'perfect IO hardware' that implements /all/ the
    features that the kernel wants (in a secure and generic way, and
    mind you, such IO hardware does not exist yet), then /at most/ the
    performance advantage of user-space doing the scheduling is the
    overhead of a null syscall entry. Which is a whopping 100 nsecs on
    modern CPUs! That's roughly the latency of a /single/ DRAM access!

Furthermore, 'light thread' concepts can no way approach the performance 
of #2 state-machines: if you /know/ what the structure of your context 
is, and you can program it in a specialized state-machine way, there's 
just so many shortcuts possible that it's not even funny.

> And maybe it also solves some of the scalability worries (threads have 
> ID space and scheduling setup things that essentially go away by just 
> not doing them - which is what the fibrils simply wouldn't have).

our PID management is very fast - i've never seen it show up in 
profiles. (We could make it even more scalable if the need arises, for 
example right now we still maintain the luxory of having a globally 
increasing and tightly compressed PID/TID space.)

until a few days ago we even had a /global lock/ in the thread/task 
creation and destruction hotpath since 2.5.43 (when oprofile support was 
added, more than 4 years ago, the oprofile exit notifier lock) and 
nobody really noticed or cared!

so i believe there are only two directions that make sense in the long 
run:

1) improve our basic #1 design gradually. If something is a bottleneck,
   if the scheduler has grown too fat, cut some slack. If micro-threads 
   or fibrils offer anything nice for our basic thread model: integrate 
   it into the kernel. (Dont worry about having the enter the kernel for 
   that, we /want/ that isolation! Resource control /needs/ a secure and 
   trustable context. Good debuggability /needs/ a secure and trustable 
   context. Hardware makers will make damn sure that such hardware-based 
   isolation is fast as lightning. SYSENTER will continue to be very 
   fast.) In any case, unless someone can answer the issues i raised 
   above, please dont mess with the basic 1:1 task model we have. It's 
   really what we want to have.

2) enable crazy people to do IO in the #2 way. For this i think the most 
   programmable interface is KAIO - because the 'context' /only/ 
   involves the IO entity itself, which is simple enough for our brain
   to wrap itself around. (and the hardware itself has the concept of
   'in flight IO' anyway, so people are forced to learn about it and to
   deal with it anyway, even in the synchronous IO model. So there's
   quite some mindshare to build upon.) This also happens to realize
   /most/ of the performance advantages that a state-machine like Tux
   tends to have. (In that sense i dont see epoll to be conceptually
   different from KAIO, although internally within the kernel it's quite
   a bit different.)

now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to 
program in a fully state-driven manner - but we are crazy folks who 
/might/ be able to pull it off. Initially we should just simulate the 
really hard bits via a pool of in-kernel synchronous threads. Some IO 
disciplines are already state-driven internally (networking most 
notably, but also timers), so there KAIO can be implemented 'natively'.

And it will be /faster/ than micro-threads based AIO!

but frankly, KAIO is the most i can see a normal, sane human programmer 
to be able to deal with. Any more exposure to state-machines will just 
drive them crazy, and they'll lynch us (or more realistically: ignore 
us) if we try anything more. And with KAIO they still have the /option/ 
to make all their user-space functionality state-driven. Also, abstract, 
fully managed programming environments like Java can hide KAIO 
complexities by implementing their synchronous primitives using KAIO.

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
@ 2007-01-31  8:58   ` Andi Kleen
  2007-01-31 17:15     ` Zach Brown
  2007-02-01 20:26   ` bert hubert
  2007-02-04  5:12   ` Davide Libenzi
  2 siblings, 1 reply; 151+ messages in thread
From: Andi Kleen @ 2007-01-31  8:58 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

Zach Brown <zach.brown@oracle.com> writes:

> This finally does something useful with the notion of being able to schedule
> stacks as fibrils under a task_struct.  Again, i386-specific and in need of
> proper layering with archs.
> 
> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments.  A fibril is constructed for
> each call.  Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler.  sys_asys_await_completion() then lets userspace collect
> these results.

Do you have any numbers how this compares cycle wise to just doing
clone+syscall+exit in user space? 

If the difference is not too big might it be easier to fix
clone+syscall to be more efficient than teach all the rest
of the kernel about fibrils? 

-Andi

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  3:02     ` Linus Torvalds
@ 2007-01-31 10:50       ` Xavier Bestel
  2007-01-31 19:28         ` Zach Brown
  2007-01-31 17:59       ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Xavier Bestel @ 2007-01-31 10:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
> Btw, this is also something where we should just disallow certain system 
> calls from being done through the asynchronous method. 

Does that mean that doing an AIO-disabled syscall will wait for all in-
flight AIO syscalls to finish ?

	Xav



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-31  8:58   ` Andi Kleen
@ 2007-01-31 17:15     ` Zach Brown
  2007-01-31 17:21       ` Andi Kleen
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-31 17:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:

> Do you have any numbers how this compares cycle wise to just doing
> clone+syscall+exit in user space?

Not yet, no.  Release early, release often, and all that.  I'll throw  
something together.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-31 17:15     ` Zach Brown
@ 2007-01-31 17:21       ` Andi Kleen
  2007-01-31 19:23         ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Andi Kleen @ 2007-01-31 17:21 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

On Wednesday 31 January 2007 18:15, Zach Brown wrote:
> 
> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
> 
> > Do you have any numbers how this compares cycle wise to just doing
> > clone+syscall+exit in user space?
> 
> Not yet, no.  Release early, release often, and all that.  I'll throw  
> something together.

So what was the motivation for doing this then?  It's only point 
is to have smaller startup costs for AIO than clone+fork without
fixing the VFS code to be a state machine, right? 

I'm personally unclear if it's really less work to teach a lot of 
code in the kernel about a new thread abstraction than changing VFS.

Your patches don't look that complicated yet but you openly
admitted you waved away many of the more tricky issues (like 
signals etc.) and I bet there are yet-unknown side effects
of this too that will need more changes.

I would expect a VFS solution to be the fastest of any at least.

I'm not sure the fibrils thing will be that much faster than
a possibly somewhat fast pathed for this case clone+syscall+exit.

-Andi

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:04 ` Benjamin Herrenschmidt
  2007-01-31  2:46   ` Linus Torvalds
@ 2007-01-31 17:38   ` Zach Brown
  2007-01-31 17:51     ` Benjamin LaHaise
  1 sibling, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-31 17:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

>> - We would now have some measure of task_struct concurrency.  Read  
>> that twice,
>> it's scary.

> That's the one scaring me in fact ... Maybe it will end up being an  
> easy
> one but I don't feel too comfortable...

Indeed, that was my first reaction too.  I dismissed the idea for a  
good six months after initially realizing that it implied sharing  
journal_info, etc.

But when I finally sat down and started digging through the  
task_struct members and, after quickly dismissing involuntary  
preemption of the fibrils, it didn't seem so bad.  I haven't done an  
exhaustive audit yet (and I won't advocate merging until I have) but  
I haven't seen any train wrecks.

> we didn't create fibril-like
> things for threads, instead, we share PIDs between tasks. I wonder if
> the sane approach would be to actually create task structs (or have a
> pool of them pre-created sitting there for performances) and add a way
> to share the necessary bits so that syscalls can be run on those
> spin-offs.

Maybe, if it comes to that.  I have some hopes that sharing by  
default and explicitly marking the bits that we shouldn't share will  
be good enough.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  2:46   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2007-01-31  5:36     ` Nick Piggin
@ 2007-01-31 17:47     ` Zach Brown
  3 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-31 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

> Does that mean that we might not have some cases where we'd need to  
> make
> sure we do things differently? Of course not. Something migt show up.

Might, and has.  For a good time, take journal_info out of  
per_call_chain() in the patch set and watch it helpfully and loudly  
wet itself.  There really really are bits of thread_struct which are  
strictly thread-local-storage, of a sort, for a kernel call path.    
Sharing them, even only through cooperate scheduling, is fatal.   
link_count is another obvious one.

They're also the only ones I've bothered to discover so far :).

> But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Indeed.

Right now the per-fibril uses of task_struct members are left inline  
in task_struct and are copied on fibril switches.

We *could* put them in thread_info, at the cost of stack pressure, or  
hang them off task_struct in their own struct to avoid the copies, at  
the cost of indirection.  I didn't like imposing a cost on paths that  
don't use fibrils, though, so I left them inline.

(I think you know all this.  I'm clarifying for the peanut gallery, I  
hope.)

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31 17:38   ` Zach Brown
@ 2007-01-31 17:51     ` Benjamin LaHaise
  2007-01-31 19:25       ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Benjamin LaHaise @ 2007-01-31 17:51 UTC (permalink / raw)
  To: Zach Brown
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-aio,
	Suparna Bhattacharya, Linus Torvalds

On Wed, Jan 31, 2007 at 09:38:11AM -0800, Zach Brown wrote:
> Indeed, that was my first reaction too.  I dismissed the idea for a  
> good six months after initially realizing that it implied sharing  
> journal_info, etc.
> 
> But when I finally sat down and started digging through the  
> task_struct members and, after quickly dismissing involuntary  
> preemption of the fibrils, it didn't seem so bad.  I haven't done an  
> exhaustive audit yet (and I won't advocate merging until I have) but  
> I haven't seen any train wrecks.

I'm still of the opinion that you cannot do this without creating actual 
threads.  That said, there is room for setting up the task_struct beforehand 
without linking it into the system lists.  The reason I don't think this 
approach works (and I looked at it a few times) is that many things end 
up requiring special handling: things like permissions, signals, FPU state, 
segment registers....  The list ends up looking exactly the way task_struct 
is, making the actual savings very small.

What the fibrils approach is useful for is the launching of the thread 
initially, as you don't have to retain things like the current FPU state, 
change segment registers, etc.  Changing the stack is cheap, the rest of 
the work is not.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  3:02     ` Linus Torvalds
  2007-01-31 10:50       ` Xavier Bestel
@ 2007-01-31 17:59       ` Zach Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-31 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

> Btw, this is also something where we should just disallow certain  
> system
> calls from being done through the asynchronous method.

Yeah.  Maybe just a bitmap built from __NR_ constants?  I don't know  
if we can do it in a way that doesn't require arch maintainer's  
attention.

It seems like it would be nice to avoid putting a test in the  
handlers themselves, and leave it up to the aio syscall submission  
processing.

> More interesting is the question about "close()", though. Currently we
> have an optimization (fget/fput_light) that basically boils down to  
> "we
> know we are the only owners". That optimization becomes more  
> "interesting"
> with AIO - we need to disable it when fibrils are active (because  
> other
> fibrils or the main thread can do it), but we can still keep it for  
> the
> non-fibril case.

I'll take a look, thanks for pointing it out.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  5:36     ` Nick Piggin
  2007-01-31  5:51       ` Nick Piggin
  2007-01-31  6:06       ` Linus Torvalds
@ 2007-01-31 18:20       ` Zach Brown
  2 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-31 18:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Benjamin Herrenschmidt,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar

> The only thing I saw in Zach's post against the use of threads is that
> some kernel API would change. But surely if this is the showstopper  
> then
> there must be some better argument than sys_getpid()?!

Haha, yeah, that's the silly example I keep throwing around :).  I  
guess it does leave a little too much of the exercise up to the reader.

Perhaps a less goofy example are the uses of current->ioprio and  
current->io_context?

If you create and destroy threads around each operation then you're  
going to be creating and destroying an io_context around each op  
instead of getting a reference on a pre-existing context in  
additional ops.  ioprio is inherited it seems, though, so that's not  
so bad.

If you have a pool of threads and you want to update the ioprio for  
future IOs, you now have to sync up the pool's ioprio with new  
desired priority.

It's all solvable, sure.  Get an io_context ref in copy_process().   
Share ioprio instead of inheriting it.  Have a fun conversation with  
Jens about the change in behaviour this implies.  Broadcasting to  
threads to update ioprio isn't exactly rocket science.

But with the fibril model the user don't have to know to worry about  
the inconsistencies and we kernel developers don't have to worry  
about pro-actively stamping them out.  A series of sync write and  
ioprio setting calls behaves exactly the same as that series of calls  
issued sequentially as "async" calls.  That's worth aiming for, I think.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-31 17:21       ` Andi Kleen
@ 2007-01-31 19:23         ` Zach Brown
  2007-02-01 11:13           ` Suparna Bhattacharya
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-31 19:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


On Jan 31, 2007, at 9:21 AM, Andi Kleen wrote:

> On Wednesday 31 January 2007 18:15, Zach Brown wrote:
>>
>> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
>>
>>> Do you have any numbers how this compares cycle wise to just doing
>>> clone+syscall+exit in user space?
>>
>> Not yet, no.  Release early, release often, and all that.  I'll throw
>> something together.
>
> So what was the motivation for doing this then?

Most fundamentally?  Providing AIO system call functionality at a  
much lower maintenance cost.  The hope is that the cost of adopting  
these fibril things will be lower than the cost of having to touch a  
code path that wants AIO support.

I simply don't believe that it's cheap to update code paths to  
support non-blocking state machines.  As just one example of a  
looming cost, consider the retry-based buffered fs AIO patches that  
exist today.  Their requirement to maintain these precisely balanced  
code paths that know to only return -EIOCBRETRY once they're at a  
point where retries won't access current-> seems.. unsustainable to  
me.  This stems from the retries being handled off in the aio kernel  
threads which have their own task_struct.  fs/aio.c goes to the  
trouble of migrating ->mm from the submitting task_struct, but  
nothing else.  Continually adjusting this finely balanced  
relationship between paths that return -EIOCBRETY and the fields of  
task_struct that fs/aio.c knows to share with the submitting context  
seems unacceptably fragile.

Even with those buffered IO patches we still only get non-blocking  
behaviour at a few specific blocking points in the buffered IO path.   
It's nothing like the guarantee of non-blocking submission returns  
that the fibril-based submission guarantees.

>   It's only point
> is to have smaller startup costs for AIO than clone+fork without
> fixing the VFS code to be a state machine, right?

Smaller startup costs and fewer behavioural differences.  Did that  
message to Nick about ioprio and io_context resonate with you at all?

> I'm personally unclear if it's really less work to teach a lot of
> code in the kernel about a new thread abstraction than changing VFS.

Why are we limiting the scope of moving to a state machine just to  
the VFS?  If you look no further than some hypothetical AIO iscsi/aoe/ 
nbd/whatever target you obviously include networking.  Probably splice 
() if you're aggressive :).

Let's be clear.  I would be thrilled if AIO was implemented by native  
non-blocking handler implementations.  I don't think it will happen.   
Not because we don't think it sounds great on paper, but because it's  
a hugely complex endeavor that would take development and maintenance  
effort away from the task of keeping basic functionality working.

So the hope with fibrils is that we lower the barrier to getting AIO  
syscall support across the board at an acceptable cost.

It doesn't *stop* us from migrating very important paths (storage,  
networking) to wildly optimized AIO implementations.  But it also  
doesn't force AIO support to wait for that.

> Your patches don't look that complicated yet but you openly
> admitted you waved away many of the more tricky issues (like
> signals etc.) and I bet there are yet-unknown side effects
> of this too that will need more changes.

To quibble, "waved away" implies that they've been dismissed.  That's  
not right.  It's a work in progress, so yes, there will be more  
fiddly details discovered and addressed over time.  The hope is that  
when it's said and done it'll still be worth merging.  If at some  
point it gets to be too much, well, at least we'll have this work to  
reference as a decisive attempt.

> I'm not sure the fibrils thing will be that much faster than
> a possibly somewhat fast pathed for this case clone+syscall+exit.

I'll try and get some numbers for you sooner rather than later.

Thanks for being diligent, this is exactly the kind of hard look I  
want this work to get.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31 17:51     ` Benjamin LaHaise
@ 2007-01-31 19:25       ` Zach Brown
  2007-01-31 20:05         ` Benjamin LaHaise
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-01-31 19:25 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-aio,
	Suparna Bhattacharya, Linus Torvalds

> without linking it into the system lists.  The reason I don't think  
> this
> approach works (and I looked at it a few times) is that many things  
> end
> up requiring special handling: things like permissions, signals,  
> FPU state,
> segment registers....

Can you share a specific example of the special handling required?

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31 10:50       ` Xavier Bestel
@ 2007-01-31 19:28         ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-31 19:28 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: Linus Torvalds, Benjamin Herrenschmidt,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar


On Jan 31, 2007, at 2:50 AM, Xavier Bestel wrote:

> On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
>> Btw, this is also something where we should just disallow certain  
>> system
>> calls from being done through the asynchronous method.
>
> Does that mean that doing an AIO-disabled syscall will wait for all  
> in-
> flight AIO syscalls to finish ?

That seems unlikely.  I imagine we'd just return EINVAL or ENOSYS or  
something to that effect.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31 19:25       ` Zach Brown
@ 2007-01-31 20:05         ` Benjamin LaHaise
  2007-01-31 20:41           ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Benjamin LaHaise @ 2007-01-31 20:05 UTC (permalink / raw)
  To: Zach Brown
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-aio,
	Suparna Bhattacharya, Linus Torvalds

On Wed, Jan 31, 2007 at 11:25:30AM -0800, Zach Brown wrote:
> >without linking it into the system lists.  The reason I don't think  
> >this
> >approach works (and I looked at it a few times) is that many things  
> >end
> >up requiring special handling: things like permissions, signals,  
> >FPU state,
> >segment registers....
> 
> Can you share a specific example of the special handling required?

Take FPU state: memory copies and RAID xor functions use MMX/SSE and 
require that the full task state be saved and restored.

Task priority is another.  POSIX AIO lets you specify request priority, and 
it really is needed for realtime workloads where things like keepalive 
must be processed at a higher priority.  This is especially important on 
embedded systems which don't have a surplus of CPU cycles.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31  6:06       ` Linus Torvalds
  2007-01-31  8:43         ` Ingo Molnar
@ 2007-01-31 20:13         ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2007-01-31 20:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Benjamin Herrenschmidt, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar

On Tue, Jan 30, 2007 at 10:06:48PM -0800, Linus Torvalds wrote:
> (Sadly, some of the people who really _use_ AIO are the database people, 
> and they really only care about a particularly stupid and trivial case: 
> pure reads and writes. A lot of other loads care about much more complex 
> things: filename lookups etc, that traditional AIO cannot do at all, and 
> that you really want something more thread-like for. But those other loads 
> get kind of swamped by the DB needs, which are might tighter and trivial 
> enough that you don't "need" a real thread for them).

	While certainly not an exhaustive list, DB people love their
reads and writes, but are pining for network reads and writes.  They
also are very excited about async poll(), connect(), and accept().  One
of the big problems today is that you can either sleep for your I/O in
io_getevents() or for your connect()/accept() in poll()/epoll(), but
there is nowhere you can sleep for all of them at once.  That's why the
aio list continually proposes aio_poll() or returning aio events
via epoll().
	Fibril based syscalls would allow async connect()/accept() and
the rest of networking, plus one stop shopping for completions.

Joel


-- 

"Born under a bad sign.
 I been down since I began to crawl.
 If it wasn't for bad luck,
 I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-31 20:05         ` Benjamin LaHaise
@ 2007-01-31 20:41           ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-01-31 20:41 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Benjamin Herrenschmidt, linux-kernel, linux-aio,
	Suparna Bhattacharya, Linus Torvalds

> Take FPU state: memory copies and RAID xor functions use MMX/SSE and
> require that the full task state be saved and restored.

Sure, that much is obvious.  I was hoping to see what FPU state  
juggling actually requires.  I'm operating under the assumption that  
it won't be *terrible*.

> Task priority is another.  POSIX AIO lets you specify request  
> priority, and
> it really is needed for realtime workloads where things like keepalive
> must be processed at a higher priority.

Yeah.  A first-pass approximation might be to have threads with asys  
system calls grouped by priority.  Leaving all that priority handling  
to the *task* scheduler, instead of the dirt-stupid fibril  
"scheduler", would be great.  If we can get away with it.  I don't  
have a good feeling for what portion of the world actually cares  
about this, or to what degree.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
@ 2007-02-01  8:36   ` Ingo Molnar
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 20:07     ` Linus Torvalds
  2007-02-04  5:12   ` Davide Libenzi
  1 sibling, 2 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-01  8:36 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


* Zach Brown <zach.brown@oracle.com> wrote:

> This patch introduces the notion of a 'fibril'.  It's meant to be a 
> lighter kernel thread. [...]

as per my other email, i dont really like this concept. This is the 
killer:

> [...]  There can be multiple of them in the process of executing for a 
> given task_struct, but only one can every be actively running at a 
> time. [...]

there's almost no scheduling cost from being able to arbitrarily 
schedule a kernel thread - but there are /huge/ benefits in it.

would it be hard to redo your AIO patches based on a pool of plain 
simple kernel threads?

We could even extend the scheduling properties of kernel threads so that 
they could also be 'companion threads' of any given user-space task. 
(i.e. they'd always schedule on the same CPu as that user-space task)

I bet most of the real benefit would come from co-scheduling them on the 
same CPU. But this should be a performance property, not a basic design 
property. (And i also think that having a limited per-CPU pool of AIO 
threads works better than having a per-user-thread pool - but again this 
is a detail that can be easily changed, not a fundamental design 
property.)

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-31 19:23         ` Zach Brown
@ 2007-02-01 11:13           ` Suparna Bhattacharya
  2007-02-01 19:50             ` Trond Myklebust
  2007-02-01 22:18             ` Zach Brown
  0 siblings, 2 replies; 151+ messages in thread
From: Suparna Bhattacharya @ 2007-02-01 11:13 UTC (permalink / raw)
  To: Zach Brown
  Cc: Andi Kleen, linux-kernel, linux-aio, Benjamin LaHaise, Linus Torvalds

On Wed, Jan 31, 2007 at 11:23:39AM -0800, Zach Brown wrote:
> On Jan 31, 2007, at 9:21 AM, Andi Kleen wrote:
> 
> >On Wednesday 31 January 2007 18:15, Zach Brown wrote:
> >>
> >>On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
> >>
> >>>Do you have any numbers how this compares cycle wise to just doing
> >>>clone+syscall+exit in user space?
> >>
> >>Not yet, no.  Release early, release often, and all that.  I'll throw
> >>something together.
> >
> >So what was the motivation for doing this then?
> 
> Most fundamentally?  Providing AIO system call functionality at a  
> much lower maintenance cost.  The hope is that the cost of adopting  
> these fibril things will be lower than the cost of having to touch a  
> code path that wants AIO support.
> 
> I simply don't believe that it's cheap to update code paths to  
> support non-blocking state machines.  As just one example of a  
> looming cost, consider the retry-based buffered fs AIO patches that  
> exist today.  Their requirement to maintain these precisely balanced  
> code paths that know to only return -EIOCBRETRY once they're at a  
> point where retries won't access current-> seems.. unsustainable to  
> me.  This stems from the retries being handled off in the aio kernel  
> threads which have their own task_struct.  fs/aio.c goes to the  
> trouble of migrating ->mm from the submitting task_struct, but  
> nothing else.  Continually adjusting this finely balanced  
> relationship between paths that return -EIOCBRETY and the fields of  
> task_struct that fs/aio.c knows to share with the submitting context  
> seems unacceptably fragile.

Wooo ...hold on ... I think this is swinging out of perspective :)

I have said some of this before, but let me try again.

As you already discovered when going down the fibril path, there are
two kinds of accesses to current-> state, (1) common state
for a given call chain (e.g. journal info etc), and (2) for 
various validations against the caller's process (uid, ulimit etc). 

(1) is not an issue when it comes to execution in background threads
(the VFS already uses background writeback for example).

As for (2), such checks need to happen upfront at the time of IO submission,
so again are not an issue.

This is aside from access to the caller's address space, a familiar
concept which the AIO threads use. If there is any state that
relates to address space access, then it belongs in the ->mm struct, rather
than in current (and we should fix that if we find anything which isn't
already there).

I don't see any other reason why IO paths should be assuming that they are
running in the original caller's context, midway through doing the IO. If
that were the case background writeouts and readaheads could be fragile as
well (or ptrace). The reason it isn't is because of this conceptual division of
responsibility.

Sure, having explicit separation of submission checks as an interface
would likely make this clearer and cleaner, but I'm just
pointing out that usage of current-> state isn't and shouldn't be arbitrary
when it comes to filesystem IO paths. We should be concerned in any case
if that starts happening.

Of course, this is fundamentally connected to the way filesystem IO is
designed to work, and may not necessarily apply to all syscal

When you want to make any and every syscall asynchronous, then indeed
the challenge is magnified and that is where it could get scary. But that
isn't the problem the current AIO code is trying to tackle.

> 
> Even with those buffered IO patches we still only get non-blocking  
> behaviour at a few specific blocking points in the buffered IO path.   
> It's nothing like the guarantee of non-blocking submission returns  
> that the fibril-based submission guarantees.

This one is a better reason, and why I have thought of fibrils (or the
equivalent alternative of enhancing kernel theads to become even lighter)
as an interesting fallback option to implement AIO for cases which we
haven't (maybe some of which are too complicated) gotton around to
supporting natively. Especially if it could be coupled with some clever
tricks to keep stack space to be minimal (I have wondered whether any of
the ideas from similar user-level efforts like Cappricio, or laio would help).

> 
> >  It's only point
> >is to have smaller startup costs for AIO than clone+fork without
> >fixing the VFS code to be a state machine, right?
> 
> Smaller startup costs and fewer behavioural differences.  Did that  
> message to Nick about ioprio and io_context resonate with you at all?
> 
> >I'm personally unclear if it's really less work to teach a lot of
> >code in the kernel about a new thread abstraction than changing VFS.
> 
> Why are we limiting the scope of moving to a state machine just to  
> the VFS?  If you look no further than some hypothetical AIO iscsi/aoe/ 
> nbd/whatever target you obviously include networking.  Probably splice 
> () if you're aggressive :).
> 
> Let's be clear.  I would be thrilled if AIO was implemented by native  
> non-blocking handler implementations.  I don't think it will happen.   
> Not because we don't think it sounds great on paper, but because it's  
> a hugely complex endeavor that would take development and maintenance  
> effort away from the task of keeping basic functionality working.
> 
> So the hope with fibrils is that we lower the barrier to getting AIO  
> syscall support across the board at an acceptable cost.
> 
> It doesn't *stop* us from migrating very important paths (storage,  
> networking) to wildly optimized AIO implementations.  But it also  
> doesn't force AIO support to wait for that.
> 
> >Your patches don't look that complicated yet but you openly
> >admitted you waved away many of the more tricky issues (like
> >signals etc.) and I bet there are yet-unknown side effects
> >of this too that will need more changes.
> 
> To quibble, "waved away" implies that they've been dismissed.  That's  
> not right.  It's a work in progress, so yes, there will be more  
> fiddly details discovered and addressed over time.  The hope is that  
> when it's said and done it'll still be worth merging.  If at some  
> point it gets to be too much, well, at least we'll have this work to  
> reference as a decisive attempt.
> 
> >I'm not sure the fibrils thing will be that much faster than
> >a possibly somewhat fast pathed for this case clone+syscall+exit.
> 
> I'll try and get some numbers for you sooner rather than later.
> 
> Thanks for being diligent, this is exactly the kind of hard look I  
> want this work to get.

BTW, I like the way you are approaching this with a cautiously
critical eye cognizant of lurking details/issues, despite the obvious
(and justified) excitement/eureka feeling.  AIO _is_ hard !

Regards
Suparna

> 
> - z
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01  8:36   ` Ingo Molnar
@ 2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
                         ` (2 more replies)
  2007-02-01 20:07     ` Linus Torvalds
  1 sibling, 3 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-01 13:02 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds


* Ingo Molnar <mingo@elte.hu> wrote:

> * Zach Brown <zach.brown@oracle.com> wrote:
> 
> > This patch introduces the notion of a 'fibril'.  It's meant to be a 
> > lighter kernel thread. [...]
> 
> as per my other email, i dont really like this concept. This is the 
> killer:

let me clarify this: i very much like your AIO patchset in general, in 
the sense that it 'completes' the AIO implementation: finally everything 
can be done via it, greatly increasing its utility and hopefully its 
penetration. This is the most important step, by far.

what i dont really like /the particular/ concept above - the 
introduction of 'fibrils' as a hard distinction of kernel threads. They 
are /almost/ kernel threads, but still by being different they create 
alot of duplication and miss out on a good deal of features that kernel 
threads have naturally.

It kind of hurts to say this because i'm usually quite concept-happy - 
one can easily get addicted to the introduction of new core kernel 
concepts :-) But i really, really think we dont want to do fibrils but 
we want to do kernel threads, and i havent really seen a discussion 
about why they shouldnt be done via kernel threads.

Nor have i seen a discussion that whatever threading concept we use for 
AIO within the kernel, it is really a fallback thing, not the primary 
goal of "native" KAIO design. The primary goal of KAIO design is to 
arrive at a state machine - and for one of the most important IO 
disciplines, networking, that is reality already. (For filesystem events 
i doubt we will ever be able to build an IO state machine - but there 
are lots of crazy folks out there so it's not fundamentally impossible, 
just very, very hard.)

so my suggestions center around the notion of extending kernel threads 
to support the features you find important in fibrils:

> would it be hard to redo your AIO patches based on a pool of plain 
> simple kernel threads?
> 
> We could even extend the scheduling properties of kernel threads so 
> that they could also be 'companion threads' of any given user-space 
> task. (i.e. they'd always schedule on the same CPu as that user-space 
> task)
> 
> I bet most of the real benefit would come from co-scheduling them on 
> the same CPU. But this should be a performance property, not a basic 
> design property. (And i also think that having a limited per-CPU pool 
> of AIO threads works better than having a per-user-thread pool - but 
> again this is a detail that can be easily changed, not a fundamental 
> design property.)

but i'm willing to be convinced of the opposite as well, as always. (I'm 
real good at quickly changing my mind, especially when i'm embarrasingly 
wrong about something. So please fire away and dont hold back.)

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
@ 2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 13:52         ` Ingo Molnar
  2007-02-02 13:23         ` Andi Kleen
  2007-02-01 21:52       ` Zach Brown
  2007-02-02 13:22       ` Andi Kleen
  2 siblings, 2 replies; 151+ messages in thread
From: Christoph Hellwig @ 2007-02-01 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 1080 bytes --]

On Thu, Feb 01, 2007 at 02:02:34PM +0100, Ingo Molnar wrote:
> what i dont really like /the particular/ concept above - the 
> introduction of 'fibrils' as a hard distinction of kernel threads. They 
> are /almost/ kernel threads, but still by being different they create 
> alot of duplication and miss out on a good deal of features that kernel 
> threads have naturally.
> 
> It kind of hurts to say this because i'm usually quite concept-happy - 
> one can easily get addicted to the introduction of new core kernel 
> concepts :-) But i really, really think we dont want to do fibrils but 
> we want to do kernel threads, and i havent really seen a discussion 
> about why they shouldnt be done via kernel threads.

I tend to agree.  Note that there is one thing we should be doing one
one day (not only if we want to use it for aio) is to make kernel threads
more lightweight.  Thereéis a lot of baggae we keep around in task_struct
and co that only makes sense for threads that have a user space part and
aren't or shouldn't be needed for a purely kernel-resistant thread.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:19       ` Christoph Hellwig
@ 2007-02-01 13:52         ` Ingo Molnar
  2007-02-01 17:13           ` Mark Lord
  2007-02-02 13:23         ` Andi Kleen
  1 sibling, 1 reply; 151+ messages in thread
From: Ingo Molnar @ 2007-02-01 13:52 UTC (permalink / raw)
  To: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds


* Christoph Hellwig <hch@infradead.org> wrote:

> I tend to agree.  Note that there is one thing we should be doing one 
> one day (not only if we want to use it for aio) is to make kernel 
> threads more lightweight.  There a lot of baggae we keep around in 
> task_struct and co that only makes sense for threads that have a user 
> space part and aren't or shouldn't be needed for a purely 
> kernel-resistant thread.

yeah. I'm totally open to such efforts. I'd also be most happy if this 
was primarily driven via the KAIO effort: i.e. to implement it via 
kernel threads and then to benchmark the hell out of it. I volunteer to 
fix whatever fat kernel thread handling has left.

and if people agree with me that 'native' state-machine driven KAIO is 
where we want to ultimately achieve (it is certainly the best performing 
implementation) then i dont see the point in fibrils as an interim 
mechanism anyway. Lets just hide AIO complexities from userspace via 
kernel threads, and optimize this via two methods: by making kernel 
threads faster, and by simultaneously and gradually converting as much 
KAIO code to a native state machine - which would not need any kind of 
kernel thread help anyway.

(plus as i mentioned previously, co-scheduling kernel threads with 
related user space threads on the same CPU might be something useful too 
- not just for KAIO, and we could add that too.)

also, we context-switch kernel threads in 350 nsecs on current hardware 
and the -rt kernel is certainly happy with that and runs all hardirqs 
and softirqs in separate kernel thread contexts. There's not /that/ much 
fat left to cut off - and if there's something more to optimize there 
then there are a good number of projects interested in that, not just 
the KAIO effort :)

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:52         ` Ingo Molnar
@ 2007-02-01 17:13           ` Mark Lord
  2007-02-01 18:02             ` Ingo Molnar
  0 siblings, 1 reply; 151+ messages in thread
From: Mark Lord @ 2007-02-01 17:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

Ingo Molnar wrote:
>
> also, we context-switch kernel threads in 350 nsecs on current hardware 
> and the -rt kernel is certainly happy with that and runs all hardirqs 

Ingo, how relevant is that "350 nsecs on current hardware" claim?

I don't mean that in a bad way, but my own experience suggests that
most people doing real hard RT (or tight soft RT) are not doing it
on x86 architectures.  But rather on lowly 1GHz (or less) ARM based
processors and the like.

For RT issues, those are the platforms I care more about,
as those are the ones that get embedded into real-time devices.

??

Cheers

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 17:13           ` Mark Lord
@ 2007-02-01 18:02             ` Ingo Molnar
  0 siblings, 0 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-01 18:02 UTC (permalink / raw)
  To: Mark Lord
  Cc: Christoph Hellwig, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds


* Mark Lord <lkml@rtr.ca> wrote:

> >also, we context-switch kernel threads in 350 nsecs on current 
> >hardware and the -rt kernel is certainly happy with that and runs all 
> >hardirqs
> 
> Ingo, how relevant is that "350 nsecs on current hardware" claim?
> 
> I don't mean that in a bad way, but my own experience suggests that 
> most people doing real hard RT (or tight soft RT) are not doing it on 
> x86 architectures.  But rather on lowly 1GHz (or less) ARM based 
> processors and the like.

it's not relevant to those embedded boards, but it's relevant to the AIO 
discussion, which centers around performance.

> For RT issues, those are the platforms I care more about, as those are 
> the ones that get embedded into real-time devices.

yeah. Nevertheless if you want to use -rt on your desktop (under Fedora 
4/5/6) you can track an rpmized+distroized full kernel package quite 
easily, via 3 easy commands:

   cd /etc/yum.repos.d
   wget http://people.redhat.com/~mingo/realtime-preempt/rt.repo

   yum install kernel-rt.x86_64   # on x86_64
   yum install kernel-rt          # on i686

which is closely tracking latest upstream -git. (for example, the 
current kernel-rt-2.6.20-rc7.1.rt3.0109.i686.rpm is based on 
2.6.20-rc7-git1, so if you want to run a kernel rpm that has all of 
Linus' latest commits from yesterday, this might be for you.)

it's rumored to be a quite smooth kernel ;-) So in this sense, because 
this also runs on all my testboxes by default, it matters on modern 
hardware too, at least to me. Today's commodity hardware is tomorrow's 
embedded hardware. If a kernel is good on today's colorful desktop 
hardware then it will be perfect for tomorrow's embedded hardware.

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-02-01 11:13           ` Suparna Bhattacharya
@ 2007-02-01 19:50             ` Trond Myklebust
  2007-02-02  7:19               ` Suparna Bhattacharya
  2007-02-01 22:18             ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Trond Myklebust @ 2007-02-01 19:50 UTC (permalink / raw)
  To: suparna
  Cc: Zach Brown, Andi Kleen, linux-kernel, linux-aio,
	Benjamin LaHaise, Linus Torvalds

On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> Wooo ...hold on ... I think this is swinging out of perspective :)
> 
> I have said some of this before, but let me try again.
> 
> As you already discovered when going down the fibril path, there are
> two kinds of accesses to current-> state, (1) common state
> for a given call chain (e.g. journal info etc), and (2) for 
> various validations against the caller's process (uid, ulimit etc). 
> 
> (1) is not an issue when it comes to execution in background threads
> (the VFS already uses background writeback for example).
> 
> As for (2), such checks need to happen upfront at the time of IO submission,
> so again are not an issue.

Wrong! These checks can and do occur well after the time of I/O
submission in the case of remote filesystems with asynchronous writeback
support.

Consider, for instance, the cases where the server reboots and loses all
state. Then there is the case of failover and/or migration events, where
the entire filesystem gets moved from one server to another, and again
you may have to recover state, etc...

> I don't see any other reason why IO paths should be assuming that they are
> running in the original caller's context, midway through doing the IO. If
> that were the case background writeouts and readaheads could be fragile as
> well (or ptrace). The reason it isn't is because of this conceptual division of
> responsibility.

The problem with this is that the security context is getting
progressively more heavy as we add more and more features. In addition
to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
keyrings to carry around. Then there is all the context needed to
support selinux,...
In the end, you end up recreating most of struct task_struct...

Cheers
  Trond


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01  8:36   ` Ingo Molnar
  2007-02-01 13:02     ` Ingo Molnar
@ 2007-02-01 20:07     ` Linus Torvalds
  2007-02-02 10:49       ` Ingo Molnar
  1 sibling, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-01 20:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Thu, 1 Feb 2007, Ingo Molnar wrote:
> 
> there's almost no scheduling cost from being able to arbitrarily 
> schedule a kernel thread - but there are /huge/ benefits in it.

That's a singularly *stupid* argument.

Of course scheduling is fast. That's the whole *point* of fibrils. They 
still schedule. Nobody claimed anything else. 

Bringing up RT kernels and scheduling latency is idiotic. It's like saying 
"we should do this because the sky is blue". Sure, that's true, but what 
the *hell* does raleigh scattering have to do with anything?

The cost has _never_ been scheduling. That was never the point. Why do you 
even bring it up? Only to make an argument that makes no sense?

The cost of AIO is

 - maintenance. It'sa separate code-path, and it's one that simply doesn't 
   fit into anything else AT ALL. It works (mostly) for simple things, ie 
   reads and writes, but even there, it's really adding a lot of crud that 
   we could do without.

 - setup and teardown costs: both in CPU and in memory. These are the big 
   costs. It's especially true since a lot of AIO actually ends up cached. 
   The user program just wants the data - 99% of the time it's likely to 
   be there, and the whole point of AIO is to get at it cheaply, but not 
   block if it's not there.

So your scheduling arguments are inane. They totally miss the point. They 
have nothing to do with *anything*.

Ingo: everybody *agrees* that scheduling is cheap. Scheduling isn't the 
issue. Scheduling isn't even needed in the perfect path where the AIO 
didn't need to do any real IO (and that _is_ the path we actually would 
like to optimize most).

So instead of talking about totally irrelevant things, please keep your 
eyes on the ball.

So I claim that the ball is here:

 - cached data (and that is *espectally* true of some of the more 
   interesting things we can do with a more generic AIO thing: path 
   lookup, inode filling (stat/fstat) etc usually has hit-rates in the 99% 
   range, but missing even just 1% of the time can be deadly, if the miss 
   costs you a hundred msec of not doing anythign else!

   Do the math. A "stat()" system call generally takes on the other of a 
   couple of microseconds. But if it misses even just 1% of the time (and 
   takes 100 msec when it does that, because there is other IO also 
   competing for the disk arm), ON AVERAGE it takes 1ms. 

   So what you should aim for is improving that number. The cached case 
   should hopefully still be in the microseconds, and the uncached case 
   should be nonblocking for the caller.

 - setup/teardown costs. Both memory and CPU. This is where the current 
   threads simply don't work. The setup cost of doing a clone/exit is 
   actually much higher than the cost of doing the whole operation, most 
   of the time. Remember: caches still work.

 - maintenance. Clearly AIO will always have some special code, but if we 
   can move the special code *away* from filesystems and networking and 
   all the thousands of device drivers, and into core kernel code, we've 
   done something good. And if we can extend it from just pure read/write 
   into just about *anything*, then people will be happy.

So stop blathering about scheduling costs, RT kernels and interrupts. 
Interrupts generally happen a few thousand times a second. This is 
soemthing you want to do a *million* times a second, without any IO 
happening at all except for when it has to.

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion  system calls
  2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
  2007-01-31  8:58   ` Andi Kleen
@ 2007-02-01 20:26   ` bert hubert
  2007-02-01 21:29     ` Zach Brown
  2007-02-04  5:12   ` Davide Libenzi
  2 siblings, 1 reply; 151+ messages in thread
From: bert hubert @ 2007-02-01 20:26 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

On Tue, Jan 30, 2007 at 01:39:45PM -0700, Zach Brown wrote:

> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments.  A fibril is constructed for
> each call.  Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler.  sys_asys_await_completion() then lets userspace collect
> these results.

Zach,

Do you have any userspace code that can be used to get started experimenting
with your fibril based AIO stuff?

I want to try it on from a userspace perspective.

I'm confident I could hack it up from scratch, but I'm sure you'll have some
test code lying around.

I scanned the discussion so far, but I might've missed any references to
userspace code so far.

Thanks!

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion  system calls
  2007-02-01 20:26   ` bert hubert
@ 2007-02-01 21:29     ` Zach Brown
  2007-02-02  7:12       ` bert hubert
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-01 21:29 UTC (permalink / raw)
  To: bert hubert
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

> Do you have any userspace code that can be used to get started  
> experimenting
> with your fibril based AIO stuff?

I only have a goofy little test app so far:

	http://www.zabbo.net/~zab/aio-walk-tree.c

It's not to be taken too seriously :)

> I want to try it on from a userspace perspective.

Frankly, I'm not sure its ready for that yet.  You're welcome to give  
it a try, but it's early enough that you're sure to hit problems  
almost immediately.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
@ 2007-02-01 21:52       ` Zach Brown
  2007-02-01 22:23         ` Benjamin LaHaise
  2007-02-02 13:22       ` Andi Kleen
  2 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-01 21:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

> let me clarify this: i very much like your AIO patchset in general, in
> the sense that it 'completes' the AIO implementation: finally  
> everything
> can be done via it, greatly increasing its utility and hopefully its
> penetration. This is the most important step, by far.

We violently agree on this :).

> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads.  
> They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that  
> kernel
> threads have naturally.

I might quibble with some of the details, but I understand your  
fundamental concern.  I do.  I don't get up each morning *thrilled*  
by the idea of having to update lockdep, sysrq-t, etc, to understand  
these fibril things :).  The current fibril switch isn't nearly as  
clever as the lock-free task scheduling switch.  It'd be nice if we  
didn't have to do that work to optimize the hell out of it, sure.

> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-)

:)

> so my suggestions center around the notion of extending kernel threads
> to support the features you find important in fibrils:
>
>> would it be hard to redo your AIO patches based on a pool of plain
>> simple kernel threads?

It'd certainly be doable to throw together a credible attempt to  
service "asys" system call submission with full-on kernel threads.   
That seems like reasonable due diligence to me.  If full-on threads  
are almost as cheap, great.  If fibrils are so much cheaper that they  
seem to warrant investing in, great.

I am concerned about the change in behaviour if we fall back to full  
kernel threads, though.  I really, really, want aio syscalls to  
behave just like sync ones.

Would your strategy be to update the syscall implementations to share  
data in task_struct so that there isn't as significant a change in  
behaviour?  (sharing current->ioprio, instead if just inheriting it,  
for example.).  We'd be betting that there would be few of these and  
that they'd be pretty reasonable to share?

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-02-01 11:13           ` Suparna Bhattacharya
  2007-02-01 19:50             ` Trond Myklebust
@ 2007-02-01 22:18             ` Zach Brown
  2007-02-02  3:35               ` Suparna Bhattacharya
  1 sibling, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-01 22:18 UTC (permalink / raw)
  To: suparna
  Cc: Andi Kleen, linux-kernel, linux-aio, Benjamin LaHaise, Linus Torvalds

> Wooo ...hold on ... I think this is swinging out of perspective :)

I'm sorry, but I don't.  I think using the EIOCBRETRY method in  
complicated code paths requires too much maintenance cost to justify  
its benefits.  We can agree to disagree on that judgement :).

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 21:52       ` Zach Brown
@ 2007-02-01 22:23         ` Benjamin LaHaise
  2007-02-01 22:37           ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Benjamin LaHaise @ 2007-02-01 22:23 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, linux-kernel, linux-aio, Suparna Bhattacharya,
	Linus Torvalds

On Thu, Feb 01, 2007 at 01:52:13PM -0800, Zach Brown wrote:
> >let me clarify this: i very much like your AIO patchset in general, in
> >the sense that it 'completes' the AIO implementation: finally  
> >everything
> >can be done via it, greatly increasing its utility and hopefully its
> >penetration. This is the most important step, by far.
> 
> We violently agree on this :).

There is also the old kernel_thread based method that should probably be 
compared, especially if pre-created threads are thrown into the mix.  Also, 
since the old days, a lot of thread scaling issues have been fixed that 
could even make userland threads more viable.

> Would your strategy be to update the syscall implementations to share  
> data in task_struct so that there isn't as significant a change in  
> behaviour?  (sharing current->ioprio, instead if just inheriting it,  
> for example.).  We'd be betting that there would be few of these and  
> that they'd be pretty reasonable to share?

Priorities cannot be shared, as they have to adapt to the per-request 
priority when we get down to the nitty gitty of POSIX AIO, as otherwise 
realtime issues like keepalive transmits will be handled incorrectly.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 22:23         ` Benjamin LaHaise
@ 2007-02-01 22:37           ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-01 22:37 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ingo Molnar, linux-kernel, linux-aio, Suparna Bhattacharya,
	Linus Torvalds

> Priorities cannot be shared, as they have to adapt to the per-request
> priority when we get down to the nitty gitty of POSIX AIO, as  
> otherwise
> realtime issues like keepalive transmits will be handled incorrectly.

Well, maybe not *blind* sharing.  But something more than the  
disconnect threads currently have with current->ioprio.

Today an existing kernel thread would most certainly ignore a  
sys_ioprio_set() in the submitter and then handle an aio syscall with  
an old current->ioprio.

Something more smart than that is all I'm on about.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-02-01 22:18             ` Zach Brown
@ 2007-02-02  3:35               ` Suparna Bhattacharya
  0 siblings, 0 replies; 151+ messages in thread
From: Suparna Bhattacharya @ 2007-02-02  3:35 UTC (permalink / raw)
  To: Zach Brown
  Cc: Andi Kleen, linux-kernel, linux-aio, Benjamin LaHaise, Linus Torvalds

On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote:
> >Wooo ...hold on ... I think this is swinging out of perspective :)
> 
> I'm sorry, but I don't.  I think using the EIOCBRETRY method in  
> complicated code paths requires too much maintenance cost to justify  
> its benefits.  We can agree to disagree on that judgement :).

I don't disagree about limitations of the EIOCBRETRY approach, nor do I
recommend it for all sorts of complicated code paths. It is only good as
an approximation for specific blocking points involving idempotent 
behaviour, and I was trying to emphasise that that is *all* it was ever
intended for. I certainly do not see it as a viable path to make all syscalls
asynchronous, or to address all blocking points in filesystem IO.

And I do like the general direction of your approach, only need to think
deeper about the details like how to reduce stack per IO request so this
can scale better. So we don't disagree as much as you think :)

The point where we seem to disagree is that I think there is goodness in
maintaining the conceptual clarity between what parts of the operation assume
that it is executing in the original submitters context. For the IO paths
this is what allows things like readahead and writeback to work and to cluster
operations which may end up to/from multiple submitters. This means that
if there is some context that is needed thereafter it could be associated
with the IO request (as an argument or in some other manner), so that this
division is still maintained.

Regards
Suparna

> 
> - z
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion  system calls
  2007-02-01 21:29     ` Zach Brown
@ 2007-02-02  7:12       ` bert hubert
  0 siblings, 0 replies; 151+ messages in thread
From: bert hubert @ 2007-02-02  7:12 UTC (permalink / raw)
  To: Zach Brown
  Cc: linux-kernel, linux-aio, Suparna Bhattacharya, Benjamin LaHaise,
	Linus Torvalds

On Thu, Feb 01, 2007 at 01:29:41PM -0800, Zach Brown wrote:

> >I want to try it on from a userspace perspective.
> 
> Frankly, I'm not sure its ready for that yet.  You're welcome to give  
> it a try, but it's early enough that you're sure to hit problems  
> almost immediately.

I'm counting on it - what I want to taste is if the concept is a match for
the things I want to do.

Thanks!

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-02-01 19:50             ` Trond Myklebust
@ 2007-02-02  7:19               ` Suparna Bhattacharya
  2007-02-02  7:45                 ` Andi Kleen
  0 siblings, 1 reply; 151+ messages in thread
From: Suparna Bhattacharya @ 2007-02-02  7:19 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Zach Brown, Andi Kleen, linux-kernel, linux-aio,
	Benjamin LaHaise, Linus Torvalds

On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote:
> On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> > Wooo ...hold on ... I think this is swinging out of perspective :)
> > 
> > I have said some of this before, but let me try again.
> > 
> > As you already discovered when going down the fibril path, there are
> > two kinds of accesses to current-> state, (1) common state
> > for a given call chain (e.g. journal info etc), and (2) for 
> > various validations against the caller's process (uid, ulimit etc). 
> > 
> > (1) is not an issue when it comes to execution in background threads
> > (the VFS already uses background writeback for example).
> > 
> > As for (2), such checks need to happen upfront at the time of IO submission,
> > so again are not an issue.
> 
> Wrong! These checks can and do occur well after the time of I/O
> submission in the case of remote filesystems with asynchronous writeback
> support.
> 
> Consider, for instance, the cases where the server reboots and loses all
> state. Then there is the case of failover and/or migration events, where
> the entire filesystem gets moved from one server to another, and again
> you may have to recover state, etc...
> 
> > I don't see any other reason why IO paths should be assuming that they are
> > running in the original caller's context, midway through doing the IO. If
> > that were the case background writeouts and readaheads could be fragile as
> > well (or ptrace). The reason it isn't is because of this conceptual division of
> > responsibility.
> 
> The problem with this is that the security context is getting
> progressively more heavy as we add more and more features. In addition
> to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
> keyrings to carry around. Then there is all the context needed to
> support selinux,...

Isn't that kind of information supposed to be captured in nfs_open_context ?
Which is associated with the open file instance ...

I know this has been a traditional issue with network filesystems, and I
haven't kept up with the latest code and decisions in that respect, but how
would you do background writeback if there is an assumption of running in
the context of the original submitter ?

Regards
Suparna

> In the end, you end up recreating most of struct task_struct...
> 
> Cheers
>   Trond
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-02-02  7:19               ` Suparna Bhattacharya
@ 2007-02-02  7:45                 ` Andi Kleen
  0 siblings, 0 replies; 151+ messages in thread
From: Andi Kleen @ 2007-02-02  7:45 UTC (permalink / raw)
  To: suparna
  Cc: Trond Myklebust, Zach Brown, linux-kernel, linux-aio,
	Benjamin LaHaise, Linus Torvalds


> Isn't that kind of information supposed to be captured in nfs_open_context ?
> Which is associated with the open file instance ...

Or a refcounted struct cred. Which would be needed for strict POSIX thread
semantics likely anyways. But there never was enough incentive to go down
that path and it would be likely somewhat slow.

> 
> I know this has been a traditional issue with network filesystems, and I
> haven't kept up with the latest code and decisions in that respect, but how
> would you do background writeback if there is an assumption of running in
> the context of the original submitter ?

AFAIK (Trond will hopefully correct me if I'm wrong) in the special case of 
NFS there isn't much problem because the server does the (passive) authentication
and there is no background writeback from server to client. The client just 
does the usual checks at open time and then forgets about it. The server
threads don't have own credentials but just check those of others.

I can't think of any cases where you would need to do authentication
in the client for every read() or write() 

Overall the arguments for reusing current don't seem to be strong to me.

-Andi

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 20:07     ` Linus Torvalds
@ 2007-02-02 10:49       ` Ingo Molnar
  2007-02-02 15:56         ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Ingo Molnar @ 2007-02-02 10:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So stop blathering about scheduling costs, RT kernels and interrupts. 
> Interrupts generally happen a few thousand times a second. This is 
> soemthing you want to do a *million* times a second, without any IO 
> happening at all except for when it has to.

we might be talking past each other.

i never suggested every aio op should create/destroy a kernel thread!

My only suggestion was to have a couple of transparent kernel threads 
(not fibrils) attached to a user context that does asynchronous 
syscalls! Those kernel threads would be 'switched in' if the current 
user-space thread blocks - so instead of having to 'create' any of them 
- the fast path would be to /switch/ them to under the current 
user-space, so that user-space processing can continue under that other 
thread!

That means that in the 'switch kernel context' fastpath it simply needs 
to copy the blocked threads' user-space ptregs (~64 bytes) to its own 
kernel stack, and then it can do a return-from-syscall without 
user-space noticing the switch! Never would we really see the cost of 
kernel thread creation. We would never see that cost in the fully cached 
case (no other thread is needed then), nor would we see it in the 
blocking-IO case, due to pooling. (there are some other details related 
to things like the FPU context, but you get the idea.)

Let me quote Zach's reply to my suggestions:

| It'd certainly be doable to throw together a credible attempt to 
| service "asys" system call submission with full-on kernel threads. 
| That seems like reasonable due diligence to me.  If full-on threads 
| are almost as cheap, great. If fibrils are so much cheaper that they 
| seem to warrant investing in, great.

that's all i wanted to see being considered!

Please ignore my points about scheduling costs - i only talked about 
them at length because the only fundamental difference between kernel 
threads and fibrils is their /scheduling/ properties. /Not/ the 
setup/teardown costs - those are not relevant /precisely/ because they 
can be pooled and because they happen relatively rarely, compared to the 
cached case. The 'switch to the blocked thread's ptregs' operation also 
involves a context-switch under this design. That's why i was talking 
about scheduling so much: the /only/ true difference between fibrils and 
kernel threads is their /scheduling/.

I believe this is the point where your argument fails:

> - setup/teardown costs. Both memory and CPU. This is where the current
>   threads simply don't work. The setup cost of doing a clone/exit is 
>   actually much higher than the cost of doing the whole operation, 
>   most of the time.

you are comparing apples to oranges - i never said we should 
create/destroy a kernel thread for every async op. That would be insane!

what we need to support asynchronous system-calls is the ability to pick 
up an /already created/ kernel thread from a pool of per-task kernel 
threads and to switch it to under the current user-space and return to 
the user-space stack with that new kernel thread running. (The other, 
blocked kernel thread stays blocked and is returned into the pool of 
'pending' AIO kernel threads.) And this only needs to happen in the 
'cachemiss' case anyway. In the 'cached' case no other kernel thread 
would be involved at all, the current one just falls straight through 
the system-call.

my argument is that the whole notion of cutting this at the kernel stack 
and thread info level and making fibrils in essence a separate 
scheduling entitity is wrong, wrong, wrong. Why not use plain kernel 
threads for this?

[ finally, i think you totally ignored my main argument, state machines.
  The networking stack is a full and very nice state machine. It's
  kicked from user-space, and zillions of small contexts (sockets) are
  living on without any of the originating tasks having to be involved.
  So i'm still holding to the fundamental notion that within the kernel
  this form of AIO is a nice but /secondary/ mechanism. If a subsystem
  is able to pull it off, it can implement asynchronity via a state
  machine - and it will outperform any thread based AIO. Or not. We'll
  see. For something like the VFS i doubt we'll see (and i doubt we
  /want/ to see) a 'native' state-machine implementation.

  this is btw. quite close to the Tux model of doing asynchronous block
  IO and asynchronous VFS events such as asynchronous open(). Tux uses a
  pool of kernel threads to pass blocking work to, while not holding up
  the 'main' thread. But the main Tux speedup comes from having a native
  state machine for all the networking IO. ]

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:02     ` Ingo Molnar
  2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 21:52       ` Zach Brown
@ 2007-02-02 13:22       ` Andi Kleen
  2 siblings, 0 replies; 151+ messages in thread
From: Andi Kleen @ 2007-02-02 13:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

Ingo Molnar <mingo@elte.hu> writes:

> and for one of the most important IO 
> disciplines, networking, that is reality already.

Not 100% -- a few things in TCP/IP at least are blocking still.
Mostly relatively obscure things though.

Also the sockets model is currently incompatible with direct zero-copy RX/TX, 
which needs fixing.

-Andi

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-01 13:19       ` Christoph Hellwig
  2007-02-01 13:52         ` Ingo Molnar
@ 2007-02-02 13:23         ` Andi Kleen
  1 sibling, 0 replies; 151+ messages in thread
From: Andi Kleen @ 2007-02-02 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Linus Torvalds

Christoph Hellwig <hch@infradead.org> writes:
> 
> I tend to agree.  Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel threads
> more lightweight.  Thereéis a lot of baggae we keep around in task_struct
> and co that only makes sense for threads that have a user space part and
> aren't or shouldn't be needed for a purely kernel-resistant thread.

I suspect you will get a lot of this for free from the current namespace
efforts.

-Andi

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 10:49       ` Ingo Molnar
@ 2007-02-02 15:56         ` Linus Torvalds
  2007-02-02 19:59           ` Alan
  2007-02-02 22:21           ` Ingo Molnar
  0 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> My only suggestion was to have a couple of transparent kernel threads 
> (not fibrils) attached to a user context that does asynchronous 
> syscalls! Those kernel threads would be 'switched in' if the current 
> user-space thread blocks - so instead of having to 'create' any of them 
> - the fast path would be to /switch/ them to under the current 
> user-space, so that user-space processing can continue under that other 
> thread!

But in that case, you really do end up with "fibrils" anyway. 

Because those fibrils are what would be the state for the blocked system 
calls when they aren't scheduled.

We may have a few hundred thousand system calls a second (maybe that's not 
actually reasonable, but it should be what we *aim* for), and 99% of them 
will hopefully hit the cache and never need any separate IO, but even if 
it's just 1%, we're talking about thousands of threads.

I do _not_ think that it's reasonable to have thousands of threads state 
around just "in case". Especially if all those threadlets are then 
involved in signals etc - something that they are totally uninterested in. 

I think it's a lot more reasonable to have just the kernel stack page for 
"this was where I was when I blocked". IOW, a fibril-like thing. You need 
some data structure to set up the state *before* you start doing any 
threads at all, because hopefully the operation will be totally 
synchronous, and no secondary thread is ever really needed!

What I like about fibrils is that they should be able to handle the cached 
case well: the case where no "real" scheduling (just the fibril stack 
switches) takes place.

Now, most traditional DB loads would tend to use AIO only when they "know" 
that real IO will take place (the AIO call itself will probably be 
O_DIRECT most of the time). So I suspect that a lot of those users will 
never really have the cached case, but one of my hopes is to be able to do 
exactly the things that we have *not* done well: asynchronous file opens 
and pathname lookups, which is very common in a file server.

If done *really* right, a perfectly normal app could do things like 
asynchronous stat() calls to fill in the readdir results. In other words, 
what *I* would like to see is the ability to have something *really* 
simple like "ls" use this, without it actually being a performance hit 
for the common case where everythign is cached.

Have you done "ls -l" on a big uncached directory where the inodes 
are all over the disk lately? You can hear the disk whirr. THAT is the 
kind of "normal user" thing I'd like to be able to fix, and the db case is 
actually secondary. The DB case is much much more limited (ok, so somebody 
pointed out that they want slightly more than just read/write, but 
still.. We're talking "special code".)

> [ finally, i think you totally ignored my main argument, state machines.

I ignored your argument, because it's not really relevant. The fact that 
networking (and TCP in particular) has state machines is because it is a 
packetized environment. Nothing else is. Think pathname lookup etc. They 
are all *fundamentally* environments with a call stack.

So the state machine argument is totally bogus - it results in a 
programming model that simply doesn't match the *normal* setup. You want 
the kernel programming model to appear "linear" even when it isn't, 
because it's too damn hard to think nonlinearly.

Yes, we could do pathname lookup with that kind of insane setup too. But 
it would be HORRID!

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 15:56         ` Linus Torvalds
@ 2007-02-02 19:59           ` Alan
  2007-02-02 20:14             ` Linus Torvalds
  2007-02-05 16:44             ` Zach Brown
  2007-02-02 22:21           ` Ingo Molnar
  1 sibling, 2 replies; 151+ messages in thread
From: Alan @ 2007-02-02 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

This one got shelved while I sorted other things out as it warranted a
longer look. Some comments follow, but firstly can we please bury this
"fibril" name. The constructs Zach is using appear to be identical to
co-routines, and they've been called that in computer science literature
for fifty years. They are one of the great and somehow forgotten ideas.
(and I admit I've used them extensively in past things where its
wonderful for multi-player gaming so I'm a convert already).

The stuff however isn't as free as you make out. Current kernel logic
knows about various things being "safe" but with fibrils you have to
address additional questions such as "What happens if I issue an I/O and
change priority". You also have an 800lb gorilla hiding behind a tree
waiting for you in priviledge and permission checking.

Right now current->*u/gid is safe across a syscall start to end, with an
asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
this, in fact we'd be able to do some of the utterly moronic poxix thread
uid handling in kernel space if we did, just that it isn't free. We have
locking rules defined by the magic serializing construct called
"the syscall" and you break those.

I'd expect the odd other gorilla waiting to mug you as well and the ones
nobody has thought of will be the worst 8)

The number of co-routines and stacks can be dealt with two ways - you use
small stacks allocated when you create a fibril, or you grab a page, use
separate IRQ stacks and either fail creation with -ENOBUFS etc which
drops work on user space, or block (for which cases ??) which also means
an overhead on co-routine exits. That can be tunable, for embedded easily
tuned right down.

Traditional co-routines have clear notions of being able to create a
co-routine, stack them and fire up specific ones. In part this is done
because many things expressed in this way know what to fire up next. It's
also a very clean way to express driver problem with a lot of state

Essentially as a co-routine is simply making "%esp" roughly the same as
the C++ world's "self". 

You get some other funny things from co-routines which are very powerful,
very dangerous, or plain insane depending upon your view of life. One big
one is the ability for real men (and women) to do stuff like this,
because you don't need to keep the context attached to the same task.

	send_reset_command(dev);
	wait_for_irq_event(dev->irq);
	/* co-routine continues in IRQ context here */
	clean_up_reset_command(dev);
	exit_irq_event();
	/* co-routine continues out of IRQ context here */
	send_identify_command(dev);

Notice we just dealt with all the IRQ stack problems the moment an IRQ is
a co-routine transfer 8)

Ditto with timers, although for the kernel that might not be smart as we
have a lot of timers.

Less insanely you can create a context, start doing stuff in it and then
pass it to someone else local variables, state and all. This one is
actually rather useful for avoiding a lot of the 'D' state crap in the
kernel.

For example we have driver code that sleeps uninterruptibly because its
too hard to undo the mess and get out of the current state if it is
interrupted. In the world of sending other people co-routines you just do
this

	coroutine_set(MUST_COMPLETE);

and in exit

	foreach(coroutine)
		if(coroutine->flags & MUST_COMPLETE)
			inherit_coroutine(init, coroutine);

and obviously you don't pass any over that will then not do the right
thing before accessing user space (well unless implementing
'read_for_someone_else()' or other strange syscalls - like ptrace...)

Other questions really relate to the scheduling - Zach do you intend
schedule_fibrils() to be a call code would make or just from schedule() ?


Linus will now tell me I'm out of my tree...


Alan (who used to use Co-routines in real languages on 36bit
computers with 9bit bytes before learning C)

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 19:59           ` Alan
@ 2007-02-02 20:14             ` Linus Torvalds
  2007-02-02 20:58               ` Davide Libenzi
  2007-02-02 21:30               ` Alan
  2007-02-05 16:44             ` Zach Brown
  1 sibling, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 20:14 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Alan wrote:
>
> This one got shelved while I sorted other things out as it warranted a
> longer look. Some comments follow, but firstly can we please bury this
> "fibril" name. The constructs Zach is using appear to be identical to
> co-routines, and they've been called that in computer science literature
> for fifty years. They are one of the great and somehow forgotten ideas.
> (and I admit I've used them extensively in past things where its
> wonderful for multi-player gaming so I'm a convert already).

Well, they are indeed coroutines, but they are coroutines in the same 
sense any "CPU scheduler" ends up being a coroutine.

They are NOT the generic co-routine that some languages support natively. 
So I think trying to call them coroutines would be even more misleading 
than calling them fibrils.

In other workds the whole *point* of the fibril is that you can do 
"coroutine-like stuff" while using a "normal functional linear programming 
paradign".

Wouldn't you agree?

(I love the concept of coroutines, but I absolutely detest what the code 
ends up looking like. There's a good reason why people program mostly in 
linear flow: that's how people think consciously - even if it's obviously 
not how the brain actually works).

And we *definitely* don't want to have a coroutine programming interface 
in the kernel. Not in C.

> The stuff however isn't as free as you make out. Current kernel logic
> knows about various things being "safe" but with fibrils you have to
> address additional questions such as "What happens if I issue an I/O and
> change priority". You also have an 800lb gorilla hiding behind a tree
> waiting for you in priviledge and permission checking.

This is why I think it should be 100% clear that things happen in process 
context. That just answers everything. If you want to synchronize with 
async events and change IO priority, you should do exactly that:

	wait_for_async();
	ioprio(newprority);

and that "solves" that problem. Leave it to user space.

> Right now current->*u/gid is safe across a syscall start to end, with an
> asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
> this, in fact we'd be able to do some of the utterly moronic poxix thread
> uid handling in kernel space if we did, just that it isn't free. We have
> locking rules defined by the magic serializing construct called
> "the syscall" and you break those.

I agree. As mentioned, we probably will have fallout. 

> The number of co-routines and stacks can be dealt with two ways - you use
> small stacks allocated when you create a fibril, or you grab a page, use
> separate IRQ stacks and either fail creation with -ENOBUFS etc which
> drops work on user space, or block (for which cases ??) which also means
> an overhead on co-routine exits. That can be tunable, for embedded easily
> tuned right down.

Right. It should be possible to just say "use a max parallelism factor of 
5", and if somebody submits a hundred AIO calls and they all block, when 
it hits #6, it will just do it synchronously.

Basically, what I'm hoping can come out of this (and this is a simplistic 
example, but perhaps exactly *because* of that it hopefully also shows 
that we canactually make *simple* interfaces for complex asynchronous 
things):

	struct one_entry *prev = NULL;
	struct dirent *de;

	while ((de = readdir(dir)) != NULL) {
		struct one_entry *entry = malloc(..);

		/* Add it to the list, fill in the name */
		entry->next = prev;
		prev = entry;
		strcpy(entry->name, de->d_name);

		/* Do the stat lookup async */
		async_stat(de->d_name, &entry->stat_buf);
	}
	wait_for_async();
	.. Ta-daa! All done ..

and it *should* allow us to do all the stat lookup asynchronously.

Done right, this should basically be no slower than doing it with a real 
stat() if everything was cached. That would kind of be the holy grail 
here.

> You get some other funny things from co-routines which are very powerful,
> very dangerous, or plain insane

You forgot "very hard to think about". 

We DO NOT want coroutines in general. It's clever, but it's
 (a) impossible to do without language support that C doesn't have, or 
     some really really horrid macro constructs that really only work for 
     very specific and simple cases.
 (b) very non-intuitive unless you've worked with coroutines a lot (and 
     almost nobody has)

> Linus will now tell me I'm out of my tree...

I don't think you're wrong in theory, I just thnk that in practice, 
withing the confines of (a) existing code, (b) existing languages, and (c) 
existing developers, we really REALLY don't want to expose coroutines as 
such.

But if you wanted to point out that what we want to do is get the 
ADVANTAGES of coroutines, without actually have to program them as such, 
then yes, I agree 100%. But we shouldn't call them coroutines, because the 
whole point is that as far as the user interface is concerned, they don't 
look like that. In the kernel, they just look like normal linear 
programming.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:14             ` Linus Torvalds
@ 2007-02-02 20:58               ` Davide Libenzi
  2007-02-02 21:09                 ` Linus Torvalds
  2007-02-02 21:30               ` Alan
  1 sibling, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-02 20:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1807 bytes --]

On Fri, 2 Feb 2007, Linus Torvalds wrote:

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
> 
> You forgot "very hard to think about". 
> 
> We DO NOT want coroutines in general. It's clever, but it's
>  (a) impossible to do without language support that C doesn't have, or 
>      some really really horrid macro constructs that really only work for 
>      very specific and simple cases.
>  (b) very non-intuitive unless you've worked with coroutines a lot (and 
>      almost nobody has)

Actually, coroutines are not too bad to program once you have a 
total-coverage async scheduler to run them. The attached (very sketchy) 
example uses libpcl ( http://www.xmailserver.org/libpcl.html ) and epoll 
as scheduler (but here you can really use anything). You can implement 
coroutines in many way, from C preprocessor macros up to anything, but in 
the libpcl case they are simply switched stacks. Like fibrils are supposed 
to be. The problem is that in order to make a real-life example of 
coroutine-based application work, you need everything that can put you at 
sleep (syscalls or any external library call you have no control on) 
implemented in an async way. And what I ended up doing is exactly what Zab 
did inside the kernel. In my case a dynamic pool of (userspace) threads 
servicing any non-native potentially pre-emptive call, and signaling the 
result to a pollable fd (pipe in my case) that is integrated in the epoll 
(poll/select whatever) scheduler.
I personally find Zab idea a really good one, since it allows for generic 
kernel async implementation, w/out the burden of dirtying kernel code 
paths with AIO knowledge. Being it fibrils or real kthreads, it is IMO 
definitely worth a very close look.




- Davide


[-- Attachment #2: Type: TEXT/x-csrc, Size: 2147 bytes --]


struct eph_conn {
	int sfd;
	unsigned int events, revents;
	coroutine_t co;
};



int eph_new_conn(int sfd, void *func) {
	struct eph_conn *conn;
	struct epoll_event ev;

	conn = (struct eph_conn *) malloc(sizeof(struct eph_conn));

	conn->sfd = sfd;
	conn->co = co_create(func, conn, NULL, STACKSIZE);

	ev.events = 0;
	ev.data.ptr = conn;
	epoll_ctl(kdpfd, EPOLL_CTL_ADD, sfd, &ev);

	co_call(conn->co);

	return 0;
}

void eph_exit_conn(struct eph_conn *conn) {
	struct epoll_event ev;

	epoll_ctl(kdpfd, EPOLL_CTL_DEL, conn->sfd, &ev);
	co_exit();
}

int eph_connect(struct eph_conn *conn, const struct sockaddr *serv_addr, socklen_t addrlen) {

	if (connect(conn->sfd, serv_addr, addrlen) == -1) {
		if (errno != EWOULDBLOCK && errno != EINPROGRESS)
			return -1;
		co_resume();
		if (conn->revents & (EPOLLERR | EPOLLHUP))
			return -1;
	}
	return 0;
}

int eph_read(struct eph_conn *conn, void *buf, int nbyte) {
	int n;

	while ((n = read(conn->sfd, buf, nbyte)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return n;
}

int eph_write(struct eph_conn *conn, void const *buf, int nbyte) {
	int n;

	while ((n = write(conn->sfd, buf, nbyte)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return n;
}

int eph_accept(struct eph_conn *conn, struct sockaddr *addr, int *addrlen) {
	int sfd;

	while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) {
		if (errno == EINTR)
			continue;
		if (errno != EAGAIN && errno != EWOULDBLOCK)
			return -1;
		co_resume();
	}
	return sfd;
}

int eph_scheduler(int loop, long timeout) {
	int i, nfds;
	struct eph_conn *conn;
	struct epoll_event *cevents;

	do {
		nfds = epoll_wait(kdpfd, events, maxfds, timeout);

		for (i = 0, cevents = events; i < nfds; i++, cevents++) {
			conn = cevents->data.ptr;
			conn->revents = cevents->events;
			if (conn->revents & conn->events)
				co_call(conn->co);
		}
	} while (loop);

	return 0;
}


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:58               ` Davide Libenzi
@ 2007-02-02 21:09                 ` Linus Torvalds
  0 siblings, 0 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 21:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Alan, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Davide Libenzi wrote:
> 
> Actually, coroutines are not too bad to program once you have a 
> total-coverage async scheduler to run them.

No, no, I don't disagree at all. In fact, I agree emphatically.

It's just that you need the scheduler to run them, in order to not "see" 
them as coroutines. Then, you can program everything *as*if* it was just a 
regular declarative linear language with multiple threads).

And that gets us the same programming interface as we always have, and 
people can forget about the fact that in a very real sense, they are using 
coroutines with the scheduler just keeping track of it all for them.

After all, that's what we do between processes *anyway*. You can 
technically see the kernel as one big program that uses coroutines and the 
scheduler just keeping track of every coroutine instance. It's just that I 
doubt that any kernel programmer really thinks in those terms. You *think* 
in terms of "threads".

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30               ` Alan
@ 2007-02-02 21:30                 ` Linus Torvalds
  2007-02-02 22:42                   ` Ingo Molnar
  2007-02-02 22:48                   ` Alan
  0 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 21:30 UTC (permalink / raw)
  To: Alan
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Fri, 2 Feb 2007, Alan wrote:
> 
> The brown and sticky will hit the rotating air impeller pretty hard if you
> are not very careful about how that ends up scheduled

Why do you think that?

With cooperative scheduling (like the example Zach posted), there is 
absolutely no "brown and sticky" wrt any CPU usage. Which is why 
cooperative scheduling is a *good* thing. If you want to blow up your 
1024-node CPU cluster, you'd to it with "real threads".

Also, with sane default limits of fibrils per process (say, in the 5-10), 
it also ends up beign good for IO. No "insane" IO bombs, but an easy way 
for users to just just get a reasonable amount of IO parallelism without 
having to use threading (which is hard).

So, best of both worlds.

Yes, *of*course* you want to have limits on outstanding work. And yes, a 
database server would set those limits much higher ("Only a thousand 
outstanding IO requests? Can we raise that to ten thousand, please?") than 
a regular process ("default: 5, and the super-user can raise it for you if 
you're good").

But there really shouldn't be any downsides.

(Of course, there will be downsides. I'm sure there will be. But I don't 
see any really serious and obvious ones).

> Other minor evil. If we use fibrils we need to be careful we
> know in advance how many fibrils an operation needs so we don't deadlock
> on them in critical places like writeout paths when we either hit the per
> task limit or we have no page for another stack.

Since we'd only create fibrils on a system call entry level, and system 
calls are independent, how would you do that anyway?

Once a fibril has been created, it will *never* depend on any other fibril 
resources ever again. At least not in any way that any normal non-fibril 
call wouldn't already do as far as I can see.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 20:14             ` Linus Torvalds
  2007-02-02 20:58               ` Davide Libenzi
@ 2007-02-02 21:30               ` Alan
  2007-02-02 21:30                 ` Linus Torvalds
  1 sibling, 1 reply; 151+ messages in thread
From: Alan @ 2007-02-02 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> They are NOT the generic co-routine that some languages support natively. 
> So I think trying to call them coroutines would be even more misleading 
> than calling them fibrils.

Its actually pretty damned close the Honeywell B co-routine package, with
a kernel twist to be honest.

> ends up looking like. There's a good reason why people program mostly in 
> linear flow: that's how people think consciously - even if it's obviously 
> not how the brain actually works).

The IRQ example below is an example of how it linearizes - so it cuts
both ways like most tools, admittedly one of the blades is at the handle
end in this case ...

> Basically, what I'm hoping can come out of this (and this is a simplistic 
> example, but perhaps exactly *because* of that it hopefully also shows 
> that we canactually make *simple* interfaces for complex asynchronous 
> things):
> 
> 	struct one_entry *prev = NULL;
> 	struct dirent *de;
> 
> 	while ((de = readdir(dir)) != NULL) {
> 		struct one_entry *entry = malloc(..);
> 
> 		/* Add it to the list, fill in the name */
> 		entry->next = prev;
> 		prev = entry;
> 		strcpy(entry->name, de->d_name);
> 
> 		/* Do the stat lookup async */
> 		async_stat(de->d_name, &entry->stat_buf);
> 	}
> 	wait_for_async();

The brown and sticky will hit the rotating air impeller pretty hard if you
are not very careful about how that ends up scheduled. Its one thing to
exploit the ability to pull all the easy lookups out in advance, and
another having created all the parallelism to turn into into sane disk
scheduling and wakeups without scaling hit. But you do at least have the
opportunity to exploit it I guess.

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
> 
> You forgot "very hard to think about". 

I'm not sure handing a fibril off to another task is that hard to think
about. It's not easy to turn it around as an async_exit() keeping the
other fibrils around because of the mass of rules and behaviours tied to
process exit but its perhaps not impossible.

Other minor evil. If we use fibrils we need to be careful we
know in advance how many fibrils an operation needs so we don't deadlock
on them in critical places like writeout paths when we either hit the per
task limit or we have no page for another stack.

Alan

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 15:56         ` Linus Torvalds
  2007-02-02 19:59           ` Alan
@ 2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
                               ` (2 more replies)
  1 sibling, 3 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-02 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> > 
> > My only suggestion was to have a couple of transparent kernel threads 
> > (not fibrils) attached to a user context that does asynchronous 
> > syscalls! Those kernel threads would be 'switched in' if the current 
> > user-space thread blocks - so instead of having to 'create' any of them 
> > - the fast path would be to /switch/ them to under the current 
> > user-space, so that user-space processing can continue under that other 
> > thread!
> 
> But in that case, you really do end up with "fibrils" anyway.
> 
> Because those fibrils are what would be the state for the blocked 
> system calls when they aren't scheduled.
> 
> We may have a few hundred thousand system calls a second (maybe that's 
> not actually reasonable, but it should be what we *aim* for), and 99% 
> of them will hopefully hit the cache and never need any separate IO, 
> but even if it's just 1%, we're talking about thousands of threads.
> 
> I do _not_ think that it's reasonable to have thousands of threads 
> state around just "in case". Especially if all those threadlets are 
> then involved in signals etc - something that they are totally 
> uninterested in.
> 
> I think it's a lot more reasonable to have just the kernel stack page 
> for "this was where I was when I blocked". IOW, a fibril-like thing. 

ok, i think i noticed another misunderstanding. The kernel thread based 
scheme i'm suggesting would /not/ 'switch' to another kernel thread in 
the cached case, by default. It would just execute in the original 
context (as if it were a synchronous syscall), and the switch to a 
kernel thread from the pool would only occur /if/ the context is about 
to block. (this 'switch' thing would be done by the scheduler) 
User-space gets back an -EAIO error code immediately and transparently - 
but already running under the new kernel thread.

i.e. in the fully cached case there would be no scheduling at all - in 
fact no thread pool is needed at all.

regarding cost:

the biggest memory resource cost of a kernel thread (assuming it has no 
real user-space context) /is/ its kernel stack page, which is 4K or 8K. 
The task struct takes ~1.5K. Once we have a ready kernel thread around, 
it's quite cheap to 'flip' it to under any arbitrary user-space context: 
change its thread_info->task pointer to the user-space context's task 
struct, copy the mm pointer, the fs pointer to the "worker thread", 
switch the thread_info, update ptregs - done. Hm?

Note: such a 'flip' would only occur when the original context blocks, 
/not/ on every async syscall.

regarding CPU resource costs, i dont think there should be significant 
signal overhead, because the original task is still only one instance, 
and the kernel thread that is now running with the blocked kernel stack 
is not part of the signal set. (Although it might make sense to make 
such async syscalls interruptible, just like any syscall.)

The 'pool' of kernel threads doesnt even have to be per-task, it can be 
a natural per-CPU thing - and its size will grow/shrink [with a low 
update frequency] depending on how much AIO parallelism there is in the 
workload. (But it can also be strictly per-user-context - to make sure 
that a proper ->mm ->fs, etc. is set up and that when the async system 
calls execute they have all the right context info.)

and note the immediate scheduling benefits: if an app (say like 
OpenOffice) is single-threaded but has certain common ops coded as async 
syscalls, then if any of those syscalls blocks then it could utilize 
/more than one/ CPU. I.e. we could 'spread' a single-threaded app's 
processing to multiple cores/hardware-threads /without/ having to 
multi-thread the app in an intrusive way. I.e. this would be a 
finegrained threading of syscalls, executed as coroutines in essence. 
With fibrils all sorts of scheduling limitations occur and no 
parallelism is possible.

in fact an app could also /trigger/ the execution of a syscall in a 
different context - to create parallelism artificially - without any 
blocking event. So we could do:

  cookie1 = sys_async(sys_read, params);
  cookie2 = sys_async(sys_write, params);

  [ ... calculation loop ... ]

  wait_on_async_syscall(cookie1);
  wait_on_async_syscall(cookie2);

or something like that. Without user-space having to create threads 
itself, etc. So basically, we'd make kernel threads more useful, and 
we'd make threading safer - by only letting syscalls thread.

> What I like about fibrils is that they should be able to handle the 
> cached case well: the case where no "real" scheduling (just the fibril 
> stack switches) takes place.

the cached case (when a system call would not block at all) would not 
necessiate any switch to another kernel thread at all - the task just 
executes its system call as if it were synchronous!

that's the nice thing: we can do this switch-to-another-kernel-thread 
magic thing right in the scheduler when we block - and the switched-to 
thread will magically return to user-space (with a -EAIO return code) as 
if nothing happened (while the original task blocks). I.e. under this 
scheme i'm suggesting we have /zero/ setup cost in the cached case. The 
optimistic case just falls through and switches to nothing else. Any 
switching cost only occurs in the slowpath - and even that cost is very 
low.

once a kernel thread that ran off with the original stack finishes the 
async syscall and wants to return the return code, this can be gathered 
via a special return-code ringbuffer that notifies finished syscalls. (A 
magic cookie is associated to every async syscall.)

> So the state machine argument is totally bogus - it results in a 
> programming model that simply doesn't match the *normal* setup. You 
> want the kernel programming model to appear "linear" even when it 
> isn't, because it's too damn hard to think nonlinearly.
>
> Yes, we could do pathname lookup with that kind of insane setup too. 
> But it would be HORRID!

yeah, but i guess not nearly as horrid as writing a new OS from scratch 
;-)

seriously, i very much think and agree that programming state machines 
is hard and not desired in most of the kernel. But it can be done, and 
sometimes (definitely not in the common case) it's /cleaner/ than 
functional programming. I've programmed an HTTP and an FTP in-kernel 
server via a state machine and it worked better than i initially 
expected. It needs different thinking but there /are/ people around with 
that kind of thinking, so we just cannot exclude the possibility. [ It's 
just that such people usually dedicate their brain to mental 
fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]

> [...] The fact that networking (and TCP in particular) has state 
> machines is because it is a packetized environment.

rough ballpark figures: for things like webserving or fileserving (or 
mailserving), networking sockets are the reason for context-blocking 
events in 90% of the cases (mostly due to networking latency). 9% of the 
blocking happens due to plain block IO, and 1% happens due to VFS 
metadata (inode, directory, etc.) blocking.

( in Tux i had to handle /all/ of these sources of blocking because even
  1% kills your performance if you do a hundred thousand requests per
  second - but in terms of design weight, networking is pretty damn
  important. )

and interestingly, modern IO frameworks tend to gravitate towards a 
packetized environment as well. I.e. i dont think state machines are 
/that/ unimportant.

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30                 ` Linus Torvalds
@ 2007-02-02 22:42                   ` Ingo Molnar
  2007-02-02 23:01                     ` Linus Torvalds
  2007-02-02 22:48                   ` Alan
  1 sibling, 1 reply; 151+ messages in thread
From: Ingo Molnar @ 2007-02-02 22:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> With cooperative scheduling (like the example Zach posted), there is 
> absolutely no "brown and sticky" wrt any CPU usage. Which is why 
> cooperative scheduling is a *good* thing. If you want to blow up your 
> 1024-node CPU cluster, you'd to it with "real threads".

i'm not worried about the 1024-node cluster case.

i also fully agree that in some cases /not/ going parallel and having a 
cooperative relationship between execution contexts can be good.

but if the application /has/ identified fundamental parallelism, we 
/must not/ shut that parallelism off by /designing/ this interface to 
use the fibril thing which is a limited cooperative, single-CPU entity. 
I cannot over-emphasise it how wrong that feels to me. Cooperativeness 
isnt bad, but it should be an /optional/ thing, not hardcoded into the 
design!

If the application tells us: "gee, you can execute this syscall in 
parallel!" (which AIO /is/ about after all), and if we have idle 
cores/hardware-threads nearby, it would be the worst thing to not 
execute that in parallel if the syscall blocks or if the app asks for 
that syscall to be executed in parallel right away, even in the cached 
case.

if we were in the 1.2 days i might agree that fibrils are perhaps easier 
on the kernel, but today the Linux kernel doesnt even use this 
cooperativeness anywhere. We have all the hard work done already. The 
full kernel is threaded. We can execute arbitrary number of kernel 
contexts off a single user context, we can execute parallel syscalls and 
we scale very well doing so.

all that is needed is this new facility and some scheduler hacking to 
enable "transparent, kernel-side threading". That enables AIO, 
coroutines and more. It brings threading to a whole new level, because 
it makes it readily and gradually accessible to single-threaded apps 
too.

[ and if we are worried about the 1024 CPU cluster (or about memory use) 
  then we could limit such threads to only overlap in a limited number, 
  etc. Just like we'd have to do with fibrils anyway. But with fibrils 
  we /force/ single-threadedness, which, i'm quite sure, is just about 
  the worst thing we can do. ]

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 21:30                 ` Linus Torvalds
  2007-02-02 22:42                   ` Ingo Molnar
@ 2007-02-02 22:48                   ` Alan
  1 sibling, 0 replies; 151+ messages in thread
From: Alan @ 2007-02-02 22:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> > The brown and sticky will hit the rotating air impeller pretty hard if you
> > are not very careful about how that ends up scheduled
> 
> Why do you think that?
> 
> With cooperative scheduling (like the example Zach posted), there is 
> absolutely no "brown and sticky" wrt any CPU usage. Which is why 
> cooperative scheduling is a *good* thing. If you want to blow up your 
> 1024-node CPU cluster, you'd to it with "real threads".

You end up with a lot more things running asynchronously. In the current
world we see a series of requests for attributes and hopefully we do
readahead and all is neatly ordered. If fibrils are not ordered the same
way then we could make it worse as we might not pick the right readahead
for example.

> Since we'd only create fibrils on a system call entry level, and system 
> calls are independent, how would you do that anyway?

If we stick to that limit it ought to be ok. We've been busy slapping
people who call sys_*, except for internal magic like kernel_thread

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
@ 2007-02-02 22:49             ` Linus Torvalds
  2007-02-02 23:55               ` Ingo Molnar
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-05 17:02             ` Zach Brown
  2 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 22:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> Note: such a 'flip' would only occur when the original context blocks, 
> /not/ on every async syscall.

Right.

So can you take a look at Zach's fibril idea again? Because that's exactly 
what it does. It basically sets a flag, saying "flip to this when you 
block or yield". Of course, it's a bit bigger than just a flag, since it 
needs to describe what to flip to, but that's the basic idea.

Now, if you want to make fibrils *also* then actually use a separate 
thread, that's an extension. But you were arguing as if they should use 
threads to begin with, and that sounds stupid. Now you seem to retract it, 
since you say "only if you need to block".

THAT'S THE POINT. That's what makes fibrils cooperative. The "only if you 
block" is really what makes a fibril be something else than a regular 
thread. 

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:42                   ` Ingo Molnar
@ 2007-02-02 23:01                     ` Linus Torvalds
  2007-02-02 23:17                       ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 23:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> but if the application /has/ identified fundamental parallelism, we 
> /must not/ shut that parallelism off by /designing/ this interface to 
> use the fibril thing which is a limited cooperative, single-CPU entity. 

Right. We should for example encourage people to use some kind of 
paralellizing construct. 

I know! We could even call them "threads", so to give people the idea that 
they are independent smaller entities in a thicker "rope", and we could 
call that bigger entity a "task" or "process", since it "processes" data.

Or is that just too far out?

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:01                     ` Linus Torvalds
@ 2007-02-02 23:17                       ` Linus Torvalds
  2007-02-03  0:04                         ` Alan
  2007-02-03  0:23                         ` bert hubert
  0 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-02 23:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan, Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Fri, 2 Feb 2007, Linus Torvalds wrote: 
> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> > 
> > but if the application /has/ identified fundamental parallelism, we 
> > /must not/ shut that parallelism off by /designing/ this interface to 
> > use the fibril thing which is a limited cooperative, single-CPU entity. 
> 
> Right. We should for example encourage people to use some kind of 
> paralellizing construct. 
> 
> I know! We could even call them "threads", so to give people the idea that 
> they are independent smaller entities in a thicker "rope", and we could 
> call that bigger entity a "task" or "process", since it "processes" data.
> 
> Or is that just too far out?

So the above was obviously tongue-in-cheek, but you should really think 
about the context here.

We're discussing doing *single* system calls. There is absolutely zero 
point to try to parallelize the work over multiple CPU's or threads. We're 
literally talking about doing things where the actual CPU cost is in the 
hundreds of nanoseconds, and where traditionally a rather noticeable part 
of the cost is not the code itself, but the high cost of taking a system 
call trap, and saving all the register state.

When parallelising "real work", I absolutely agree with you: we should use 
threads. But you need to look at what it is we parallelize here, and ask 
yourself why we're doing what we're doing, and why people aren't *already* 
just using a separate thread for it.

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
@ 2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
                                 ` (2 more replies)
  2007-02-05 17:02             ` Zach Brown
  2 siblings, 3 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-02 23:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, 2 Feb 2007, Ingo Molnar wrote:

> in fact an app could also /trigger/ the execution of a syscall in a 
> different context - to create parallelism artificially - without any 
> blocking event. So we could do:
> 
>   cookie1 = sys_async(sys_read, params);
>   cookie2 = sys_async(sys_write, params);
> 
>   [ ... calculation loop ... ]
> 
>   wait_on_async_syscall(cookie1);
>   wait_on_async_syscall(cookie2);
> 
> or something like that. Without user-space having to create threads 
> itself, etc. So basically, we'd make kernel threads more useful, and 
> we'd make threading safer - by only letting syscalls thread.

Since I still think that the many-thousands potential async operations 
coming from network sockets are better handled with a classical event 
machanism [1], and since smooth integration of new async syscall into the 
standard POSIX infrastructure is IMO a huge win, I think we need to have a 
"bridge" to allow async completions being detectable through a pollable 
(by the mean of select/poll/epoll whatever) device.
In that way you can handle async operations with the best mechanism that 
is fit for them, and gather them in a single async scheduler.



[1] Unless you really want to have thousands of kthreads/fibrils lingering 
    on the system.



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:49             ` Linus Torvalds
@ 2007-02-02 23:55               ` Ingo Molnar
  2007-02-03  0:56                 ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Ingo Molnar @ 2007-02-02 23:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> THAT'S THE POINT. That's what makes fibrils cooperative. The "only if 
> you block" is really what makes a fibril be something else than a 
> regular thread.

Well, in my picture, 'only if you block' is a pure thread utilization 
decision: bounce a piece of work to another thread if this thread cannot 
complete it. (if the kernel is lucky enough that the user context told 
it "it's fine to do that".)

it is 'incidental parallelism' instead of 'intentional parallelism', but 
the random and unpredictable nature of it doesnt change anything about 
the fundamental fact: we start a new thread of execution in essence.

Typically it will be rare in a workload as it will be driven by 
cachemisses, but for example in DB workloads the 'cachemiss' will be the 
/common case/ - because the DB manages the cache itself.

And how to run a thread of execution is a fundamental /scheduling/ 
decision: it is the acceptance of and the adoption to the cost of work 
migration - if no forced wait happens then often it's cheaper to execute 
all work locally and serially.

[ in fact, such a mechanism doesnt even always have to be driven from
  the scheduler itself: such a 'bounce current work to another thread'
  event could occur when we detect that a pagecache page is missing and
  that we have to do a ->readpage, etc. Tux does that since 1999: the
  cutoff for 'bounce work' was when a soft cache (the pagecache or the
  dentry cache) was missed - not when we went into the IO path. This has
  the advantage that the Tux cachemiss threads could do /all/ the IO
  preparation and IO completion on the same CPU and in one go - while
  the user context was able to continue executing. ]

But this is also a function of hardware: for example on a Transputer i'd 
bounce off all such work immediately (even if it's a sys_time() 
syscall), all the time, even if fully cached, no exceptions, because the 
hardware is such that another CPU can pick it up in the next cycle.

while we definitely dont want to bounce short-lived cached syscalls to 
another thread, for longer ones or ones which we /expect/ to block we 
might want to do it like that straight away. [Especially on a multi-core 
CPU that has a shared L2 cache (and doubly so on a HT/SMT CPU that has a 
shared L1 cache).]

i dont see anything here that mandates (or even strongly supports) the 
notion of cooperative scheduling. The moment a context sees a 'cache 
miss', it is totally fair to potentially distribute it to other CPUs. It 
wont run for a long time and it will be totally cache-cold when the 'IO 
done' event occurs - hence we should schedule it where the IO event 
occured. Which might easily be the same CPU where the user context is 
running right now (we prefer last-run CPUs on wakeups), but not 
necessarily - it's a general scheduling decision.

> > Note: such a 'flip' would only occur when the original context 
> > blocks, /not/ on every async syscall.
> 
> Right.
> 
> So can you take a look at Zach's fibril idea again? Because that's 
> exactly what it does. It basically sets a flag, saying "flip to this 
> when you block or yield". Of course, it's a bit bigger than just a 
> flag, since it needs to describe what to flip to, but that's the basic 
> idea.

i know Zach's code ... i really do. Even if i didnt look at the code 
(which i did), Jonathon Corbet did a very nice writeup about fibrils on 
LWN.net two days ago, which i've read as well:

  http://lwn.net/Articles/219954/

So there's no misunderstanding on my side i think.

> Now, if you want to make fibrils *also* then actually use a separate 
> thread, that's an extension.

oh please, Linus. I /did/ suggest this as an extension to Zach's idea! 
Look at the Subject line - i'm reacting to the specific fibril code of 
Zach. I wrote this:

| as per my other email, i dont really like this concept. This is the 
| killer:
|
| > [...]  There can be multiple of them in the process of executing for 
| > a given task_struct, but only one can every be actively running at a 
| > time. [...]
|
| there's almost no scheduling cost from being able to arbitrarily 
| schedule a kernel thread - but there are /huge/ benefits in it.
|
| would it be hard to redo your AIO patches based on a pool of plain 
| simple kernel threads?

see http://lkml.org/lkml/2007/2/1/40.

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
@ 2007-02-03  0:02               ` Davide Libenzi
  2007-02-05 17:12               ` Zach Brown
  2007-02-05 21:36               ` bert hubert
  2 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-03  0:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, 2 Feb 2007, Davide Libenzi wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> 
> > in fact an app could also /trigger/ the execution of a syscall in a 
> > different context - to create parallelism artificially - without any 
> > blocking event. So we could do:
> > 
> >   cookie1 = sys_async(sys_read, params);
> >   cookie2 = sys_async(sys_write, params);
> > 
> >   [ ... calculation loop ... ]
> > 
> >   wait_on_async_syscall(cookie1);
> >   wait_on_async_syscall(cookie2);
> > 
> > or something like that. Without user-space having to create threads 
> > itself, etc. So basically, we'd make kernel threads more useful, and 
> > we'd make threading safer - by only letting syscalls thread.
> 
> Since I still think that the many-thousands potential async operations 
> coming from network sockets are better handled with a classical event 
> machanism [1], and since smooth integration of new async syscall into the 
> standard POSIX infrastructure is IMO a huge win, I think we need to have a 
> "bridge" to allow async completions being detectable through a pollable 
> (by the mean of select/poll/epoll whatever) device.
> In that way you can handle async operations with the best mechanism that 
> is fit for them, and gather them in a single async scheduler.

To clarify further, below are the API and the use case of my userspace 
implementation. The guasi_fd() gives you back a pollable (POLLIN) fd to be 
integrated in your prefered event retrieval interface. Once it fd is 
signaled, you can fetch your completed requests using guasi_fetch() and 
schedule work based on that.
The GUASI implementation uses pthreads, but it is clear that an in-kernel 
async syscall implementation can take wiser decisions, and optimize the 
heck out of it (locks, queues, ...).




- Davide



/*
 * Example of async pread using GUASI
 */
static long guasi_wrap__pread(void *priv, long const *params) {

        return (long) pread((int) params[0], (void *) params[1],
                            (size_t) params[2], (off_t) params[3]);
}
 
guasi_req_t guasi__pread(guasi_t hctx, void *priv, void *asid, int prio,
                         int fd, void *buf, size_t size, off_t off) {

        return guasi_submit(hctx, priv, asid, prio, guasi_wrap__pread, 4,
                            (long) fd, (long) buf, (long) size, (long) off);
}


---
/*
 *  guasi by Davide Libenzi (generic userspace async syscall implementation)
 *  Copyright (C) 2003  Davide Libenzi
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, write to the Free Software
 *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 *
 *  Davide Libenzi <davidel@xmailserver.org>
 *
 */

#if !defined(_GUASI_H)
#define _GUASI_H


#define GUASI_MAX_PARAMS 16

#define GUASI_STATUS_PENDING 1
#define GUASI_STATUS_ACTIVE 2
#define GUASI_STATUS_COMPLETE 3


typedef long (*guasi_syscall_t)(void *, long const *);

typedef struct s_guasi { } *guasi_t;
typedef struct s_guasi_req { } *guasi_req_t;

struct guasi_reqinfo {
	void *priv;   /* Call private data. Passed to guasi_submit */
	void *asid;   /* Async request ID. Passed to guasi_submit */
	long result;  /* Return code of "proc" passed to guasi_submit */
	long error;   /* errno */
	int status;   /* GUASI_STATUS_* */
};



guasi_t guasi_create(int min_threads, int max_threads, int max_priority);
void guasi_free(guasi_t hctx);
int guasi_fd(guasi_t hctx);
guasi_req_t guasi_submit(guasi_t hctx, void *priv, void *asid, int prio,
			 guasi_syscall_t proc, int nparams, ...);
int guasi_fetch(guasi_t hctx, guasi_req_t *reqs, int nreqs);
int guasi_req_info(guasi_req_t hreq, struct guasi_reqinfo *rinf);
void guasi_req_free(guasi_req_t hreq);

#endif



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:17                       ` Linus Torvalds
@ 2007-02-03  0:04                         ` Alan
  2007-02-03  0:23                         ` bert hubert
  1 sibling, 0 replies; 151+ messages in thread
From: Alan @ 2007-02-03  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> When parallelising "real work", I absolutely agree with you: we should use 
> threads. But you need to look at what it is we parallelize here, and ask 
> yourself why we're doing what we're doing, and why people aren't *already* 
> just using a separate thread for it.

Because its a pain in the arse and because its very hard to self tune. If
you've got async_anything then the thread/fibril/synchronous/whatever
decision can be made kernel side based upon expected cost and other
tradeoffs, even if its as dumb as per syscall or per syscall/filp type
guessing.

Alan

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:17                       ` Linus Torvalds
  2007-02-03  0:04                         ` Alan
@ 2007-02-03  0:23                         ` bert hubert
  1 sibling, 0 replies; 151+ messages in thread
From: bert hubert @ 2007-02-03  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Alan, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Fri, Feb 02, 2007 at 03:17:57PM -0800, Linus Torvalds wrote:

> threads. But you need to look at what it is we parallelize here, and ask 
> yourself why we're doing what we're doing, and why people aren't *already* 
> just using a separate thread for it.

Partially this is for the bad reason that creating "i/o threads" (or even
processes) has a bad stigma to it, and additionally has always felt crummy.

On the first reason, the 'pain' of creating threads is actually rather
minor, so this feeling may have been wrong. The main thing is that you don't
wantonly create a thousand i/o threads, whereas you conceivably might want
to have a thousand outstanding i/o requests. At least I know I want to have
that ability.

Secondly, the actual mechanics of i/o processes isn't trivial, and feels
wasteful with lots of additional copying, or in the case of threads,
queueing and posting.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:55               ` Ingo Molnar
@ 2007-02-03  0:56                 ` Linus Torvalds
  2007-02-03  7:15                   ` Suparna Bhattacharya
  2007-02-03  8:23                   ` Ingo Molnar
  0 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-03  0:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Sat, 3 Feb 2007, Ingo Molnar wrote:
> 
> Well, in my picture, 'only if you block' is a pure thread utilization 
> decision: bounce a piece of work to another thread if this thread cannot 
> complete it. (if the kernel is lucky enough that the user context told 
> it "it's fine to do that".)

Sure, you can do it that way too. But at that point, your argument that we 
shouldn't do it with fibrils is wrong: you'd still need basically the 
exact same setup that Zach does in his fibril stuff, and the exact same 
hook in the scheduler, testing the exact same value ("do we have a pending 
queue of work").

So at that point, you really are arguing about a rather small detail in 
the implementation, I think.

Which is fair enough. 

But I actually think the *bigger* argument and problems are elsewhere, 
namely in the interface details. Notably, I think the *real* issues end up 
how we handle synchronization, and how we handle signalling. Those are in 
many ways (I think) more important than whether we actually can schedule 
these trivial things on multiple CPU's concurrently or not.

For example, I think serialization is potentially a much more expensive 
issue. Could we, for example, allow users to serialize with these things 
*without* having to go through the expense of doing a system call? Again, 
I'm thinking of the case of no IO happening, in which case there also 
won't be any actual threading taking place, in which case it's a total 
waste of time to do a system call at all.

And trying to do that actually has implications for the interfaces (like 
possibly returning a zero cookie for the async() system call if it was 
doable totally synchronously?)

Signal handling is similar: I actually think that a "async()" system call 
should be interruptible within the context of the caller, since we would 
want to *try* to execute it synchronously. That automatically means that 
we have semantic meaning for fibrils and signal handling.

Finally, can we actually get POSIX aio semantics with this? Can we 
implement the current aio_xyzzy() system calls using this same feature? 
And most importantly - does it perform well enough that we really can do 
that?

THOSE are to me bigger questions than what happens inside the kernel, and 
whether we actually end up using another thread if we end up doing it 
non-synchronously.

					Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  0:56                 ` Linus Torvalds
@ 2007-02-03  7:15                   ` Suparna Bhattacharya
  2007-02-03  8:23                   ` Ingo Molnar
  1 sibling, 0 replies; 151+ messages in thread
From: Suparna Bhattacharya @ 2007-02-03  7:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Zach Brown, linux-kernel, linux-aio, Benjamin LaHaise

On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote:
> 
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> > 
> > Well, in my picture, 'only if you block' is a pure thread utilization 
> > decision: bounce a piece of work to another thread if this thread cannot 
> > complete it. (if the kernel is lucky enough that the user context told 
> > it "it's fine to do that".)
> 
> Sure, you can do it that way too. But at that point, your argument that we 
> shouldn't do it with fibrils is wrong: you'd still need basically the 
> exact same setup that Zach does in his fibril stuff, and the exact same 
> hook in the scheduler, testing the exact same value ("do we have a pending 
> queue of work").
> 
> So at that point, you really are arguing about a rather small detail in 
> the implementation, I think.
> 
> Which is fair enough. 
> 
> But I actually think the *bigger* argument and problems are elsewhere, 
> namely in the interface details. Notably, I think the *real* issues end up 
> how we handle synchronization, and how we handle signalling. Those are in 
> many ways (I think) more important than whether we actually can schedule 
> these trivial things on multiple CPU's concurrently or not.
> 
> For example, I think serialization is potentially a much more expensive 
> issue. Could we, for example, allow users to serialize with these things 
> *without* having to go through the expense of doing a system call? Again, 
> I'm thinking of the case of no IO happening, in which case there also 
> won't be any actual threading taking place, in which case it's a total 
> waste of time to do a system call at all.
> 
> And trying to do that actually has implications for the interfaces (like 
> possibly returning a zero cookie for the async() system call if it was 
> doable totally synchronously?)

This would be useful - the application wouldn't have to set up state
to remember for handling completions for operations that complete synchronously
I know Samba folks would like that.

The laio_syscall implementation (Lazy asynchronous IO) seems to have
experimented with such an interface
http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html

Regards
Suparna

> 
> Signal handling is similar: I actually think that a "async()" system call 
> should be interruptible within the context of the caller, since we would 
> want to *try* to execute it synchronously. That automatically means that 
> we have semantic meaning for fibrils and signal handling.
> 
> Finally, can we actually get POSIX aio semantics with this? Can we 
> implement the current aio_xyzzy() system calls using this same feature? 
> And most importantly - does it perform well enough that we really can do 
> that?
> 
> THOSE are to me bigger questions than what happens inside the kernel, and 
> whether we actually end up using another thread if we end up doing it 
> non-synchronously.
> 
> 					Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  0:56                 ` Linus Torvalds
  2007-02-03  7:15                   ` Suparna Bhattacharya
@ 2007-02-03  8:23                   ` Ingo Molnar
  2007-02-03  9:25                     ` Matt Mackall
  2007-02-05 17:44                     ` Zach Brown
  1 sibling, 2 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-03  8:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> > 
> > Well, in my picture, 'only if you block' is a pure thread 
> > utilization decision: bounce a piece of work to another thread if 
> > this thread cannot complete it. (if the kernel is lucky enough that 
> > the user context told it "it's fine to do that".)
> 
> Sure, you can do it that way too. But at that point, your argument 
> that we shouldn't do it with fibrils is wrong: you'd still need 
> basically the exact same setup that Zach does in his fibril stuff, and 
> the exact same hook in the scheduler, testing the exact same value 
> ("do we have a pending queue of work").

did i ever lose a single word of complaint about those bits? Those are 
not an issue to me. They can be applied to kernel threads just as much.

As i babbled in the very first email about this topic:

| 1) improve our basic #1 design gradually. If something is a
|    bottleneck, if the scheduler has grown too fat, cut some slack. If 
|    micro-threads or fibrils offer anything nice for our basic thread 
|    model: integrate it into the kernel.

i should have said explicitly that to flip user-space from one kernel 
thread to another one (upon blocking or per request) is a nice thing and 
we should integrate that into the kernel's thread model.

But really, being a scheduler guy i was much more concerned about the 
duplication and problems caused by the fibril concept itself - which 
duplication and complexity makes up 80% of Zach's submitted patchset.
For example this bit:

   [PATCH 3 of 4] Teach paths to wake a specific void * target

would totally go away if we used kernel threads for this. In the fibril 
approach this is where the mess starts. Either a 'normal' wakeup has to 
wake up all fibrils, or we have to make damn sure that a wakeup that in 
reality goes to a fibril is never woken via wake_up/wake_up_process.

( Furthremore, i tried to include user-space micro-threads in the 
  argument as well, which Evgeniy Polyako raised not so long ago related 
  to the kevent patchset. All these micro-thread things are of a similar 
  genre. )

i totally agree that the API /should/ be the main focus - but i didnt 
pick the topic and most of the patchset's current size is due to the IMO 
avoidable fibril concept.

regarding the API, i dont really agree with the current form and design 
of Zach's interface.

fundamentally, the basic entity of this thing should be a /system call/, 
not the artificial fibril thing:

  +struct asys_call {
  +       struct asys_result      *result;
  +       struct fibril           fibril;
  +};

i.e. the basic entity should be something that represents a system call, 
with its up to 6 arguments, the later return code, state, flags and two 
list entries:

  struct async_syscall {
	unsigned long nr;
	unsigned long args[6];
	long err;
	unsigned long state;
	unsigned long flags;
	struct list_head list;
	struct list_head wait_list;
	unsigned long __pad[2];
  };

(64 bytes on 32-bit, 128 bytes on 64-bit)

furthermore, i think this API should be fundamentally vectored and 
fundamentally async, and hence could solve another issue as well: 
submitting many little pieces of work of different IO domains in one go.

[ detail: there should be no traditional signals used at all (Zach's 
  stuff doesnt use them, and correctly so), only if the async syscall 
  that is performed generates a signal. ]

The normal and most optimal workflow should be a user-space ring-buffer 
of these constant-size struct async_syscall entries:

  struct async_syscall ringbuffer[1024];

  LIST_HEAD(submitted);
  LIST_HEAD(pending);
  LIST_HEAD(completed);

the 3 list heads are both known to the kernel and to user-space, and are 
actively managed by both. The kernel drives the execution of the async 
system calls based on the 'submitted' list head (until it empties it) 
and moves them over to the 'pending' list. User-space can complete async 
syscalls based on the 'completed' list. (but a sycall can optinally be 
marked as 'autocomplete' as well via the 'flags' field, in that case 
it's not moved to the 'completed' list but simply removed from the 
'pending' list. This can be useful for system calls that have some 
implicit notification effect.)

( Note: optionally, a helper kernel-thread, when it finishes processing 
  a syscall, could also asynchronously check the 'submitted' list and 
  pick up new work. That would allow the submission of new syscalls 
  without any entry into the kernel. So for example on an SMT system, 
  this could result in essence one CPU could running in pure user-space 
  submitting async syscalls via the ringbuffer, while another CPU would
  in essence be running pure kernel-space, executing those entries. )

another crutial bit is the waiting on pending work. But because every 
pending syscall entity is either already completed or has a real kernel 
thread associated with it, that bit is mostly trivial: user-space can 
wait on 'any' pending syscall to complete, or it could wait for a 
specific list of syscalls to complete (using the ->wait_list). It could 
also wait on 'a minimum number of N syscalls to complete' - to create 
batching of execution. And of course it can periodically check the 
'completed' list head if it has a constant and highly parallel flow of 
workload - that way the 'waiting' does not actually have to happen most 
of the time.

Looks like we can hit many birds with this single stone: AIO, vectored 
syscalls, finegrained system-call parallelism. Hm?

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  8:23                   ` Ingo Molnar
@ 2007-02-03  9:25                     ` Matt Mackall
  2007-02-03 10:03                       ` Ingo Molnar
  2007-02-05 17:44                     ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Matt Mackall @ 2007-02-03  9:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> The normal and most optimal workflow should be a user-space ring-buffer 
> of these constant-size struct async_syscall entries:
> 
>   struct async_syscall ringbuffer[1024];
> 
>   LIST_HEAD(submitted);
>   LIST_HEAD(pending);
>   LIST_HEAD(completed);

It's wrong to call this a ring buffer as things won't be completed in
any particular order. So you'll need a fourth list head for which
buffer elements are free. At which point, you might as well leave it
entirely up to the application to manage the allocation of
async_syscall structs. It may know it only needs two, or ten thousand,
or five per client...

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  9:25                     ` Matt Mackall
@ 2007-02-03 10:03                       ` Ingo Molnar
  0 siblings, 0 replies; 151+ messages in thread
From: Ingo Molnar @ 2007-02-03 10:03 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Zach Brown, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise


* Matt Mackall <mpm@selenic.com> wrote:

> On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> > The normal and most optimal workflow should be a user-space ring-buffer 
> > of these constant-size struct async_syscall entries:
> > 
> >   struct async_syscall ringbuffer[1024];
> > 
> >   LIST_HEAD(submitted);
> >   LIST_HEAD(pending);
> >   LIST_HEAD(completed);
> 
> It's wrong to call this a ring buffer as things won't be completed in
> any particular order. [...]

yeah, i realized this when i sent the mail. I wanted to say 'array of 
elements' - and it's clear from these list heads that it's fully out of 
order. (it should be an array so that the pages of those entries can be 
pinned and that completion can be manipulated from any context, 
anytime.)

(the queueing i described closely resembles Tux's "Tux syscall request" 
handling scheme.)

> [...] So you'll need a fourth list head for which buffer elements are 
> free. At which point, you might as well leave it entirely up to the 
> application to manage the allocation of async_syscall structs. It may 
> know it only needs two, or ten thousand, or five per client...

sure - it should be variable but still the array should be compact, and 
should be registered with the kernel. That way security checks can be 
done once, the pages can be pinned, accessed anytime, etc.

	Ingo

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
  2007-02-01  8:36   ` Ingo Molnar
@ 2007-02-04  5:12   ` Davide Libenzi
  2007-02-05 17:54     ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-04  5:12 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

On Tue, 30 Jan 2007, Zach Brown wrote:

> +	/* 
> +	 * XXX The idea is to copy all but the actual call stack.  Obviously
> +	 * this is wildly arch-specific and belongs abstracted out.
> +	 */
> +	*next->ti = *ti;
> +	*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

arch copy_thread_info()?



> +	current->per_call = next->per_call;

Pointer instead of structure copy? percall_clone()/percall_free()?



> +	/* always switch to a runnable fibril if we aren't being preempted */
> +	if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
> +		     !list_empty(&prev->runnable_fibrils))) {
> +		schedule_fibril(prev);
> +		/* 
> +		 * finish_fibril_switch() drops the rq lock and enables
> +		 * premption, but the popfl disables interrupts again.  Watch
> +		 * me learn how context switch locking works before your very
> +		 * eyes!  XXX This will need to be fixed up by throwing
> +		 * together something like the prepare_lock_switch() path the
> +		 * scheduler does.  Guidance appreciated!
> +		 */
> +		local_irq_enable();
> +		return;
> +	}

Yes, please (prepare/finish) ... ;)



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls
  2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
  2007-01-31  8:58   ` Andi Kleen
  2007-02-01 20:26   ` bert hubert
@ 2007-02-04  5:12   ` Davide Libenzi
  2 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-04  5:12 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

On Tue, 30 Jan 2007, Zach Brown wrote:

> +void asys_task_exiting(struct task_struct *tsk)
> +{
> +	struct asys_result *res, *next;
> +
> +	list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
> +		kfree(res);
> +
> +	/* 
> +	 * XXX this only works if tsk->fibril was allocated by
> +	 * sys_asys_submit(), not if its embedded in an asys_call.  This
> +	 * implies that we must forbid sys_exit in asys_submit.
> +	 */
> +	if (tsk->fibril) {
> +		BUG_ON(!list_empty(&tsk->fibril->run_list));
> +		kfree(tsk->fibril);
> +		tsk->fibril = NULL;
> +	}
> +}

What happens to lingering fibrils? Better keep track of both runnable and 
sleepers, and do proper cleanup.



> +asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
> +				unsigned long nr_inp)
> +{
> +	struct asys_input inp;
> +	struct asys_result *res;
> +	struct asys_call *call;
> +	struct thread_info *ti;
> +	unsigned long i;
> +	long err = 0;
> +
> +	/* Allocate a fibril for the submitter's thread_info */
> +	if (current->fibril == NULL) {
> +		current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
> +		if (current->fibril == NULL)
> +			return -ENOMEM;
> +
> +		INIT_LIST_HEAD(&current->fibril->run_list);
> +		current->fibril->state = TASK_RUNNING;
> +		current->fibril->ti = current_thread_info();
> +	}

Why do we need the "special" submission fibril?




> +	for (i = 0; i < nr_inp; i++) {
> +
> +		if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
> +		if (res == NULL) {
> +			err = -ENOMEM;
> +			break;
> +		}
> +
> +		/* XXX kzalloc to init call.fibril.per_cpu, add helper */
> +		call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
> +		if (call == NULL) {
> +			kfree(res);
> +			err = -ENOMEM;
> +			break;
> +		}
> +
> +		ti = alloc_thread_info(tsk);
> +		if (ti == NULL) {
> +			kfree(res);
> +			kfree(call);
> +			err = -ENOMEM;
> +			break;
> +		}
> +
> +		err = asys_init_fibril(&call->fibril, ti, &inp);
> +		if (err) {
> +			kfree(res);
> +			kfree(call);
> +			free_thread_info(ti);
> +			break;
> +		}
> +
> +		res->comp.cookie = inp.cookie;
> +		call->result = res;
> +		ti->task = current;
> +
> +		sched_new_runnable_fibril(&call->fibril);
> +		schedule();
> +	}
> +
> +	return i ? i : err;
> +}

Streamline error path (kfree(NULL) is OK):

	err =  -ENOMEM;
	a = alloc();
	b = alloc();
	c = alloc();
	if (!a || !b || !c)
		goto error;
	...
error:
	kfree(c);
	kfree(b);
	kfree(a);
	return err;


- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
                   ` (5 preceding siblings ...)
  2007-01-31  2:04 ` Benjamin Herrenschmidt
@ 2007-02-04  5:13 ` Davide Libenzi
  2007-02-04 20:00   ` Davide Libenzi
  2007-02-09 22:33 ` Linus Torvalds
  7 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-04  5:13 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Zab, great stuff!
I've found a little time to take a look at the patches and throw some 
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff, 
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)




> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
> 
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call.  It does this by letting a task have
> multiple stacks executing system calls in the kernel.  Stacks are switched in
> schedule() as they block and are made runnable.

As I said in another email, I think this is a *great* idea. It allows for 
generic async execution, while leaving the core kernel AIO unware. Of 
course, ot be usable, a lot of details needs to be polished, and 
performance evaluated.




> We start in sys_asys_submit().  It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct.  This lets this submission
> fibril be scheduled along with the later asys system calls themselves.

Why do you need this "special" fibril for the submission task?



> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer.  To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti.  We update the book keeping in the task_struct to 
> consider the new thread_info as the current thread_info for the task.  Like so:
> 
>         *next->ti = *ti;
>         *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
> 
>         current->thread_info = next->ti;
>         current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
>         current->fibril = next;
>         current->state = next->state;
>         current->per_call = next->per_call;
> 
> Yeah, messy.  I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.

Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1 
relationship, just merging them.



> Did everyone catch that "per_call" thing there?  That's to switch members of
> task_struct which are local to a specific call.  link_count, journal_info, that
> sort of thing.  More on that as we talk about the costs later.

Yes ;) Brutally copying the structure over does not look good IMO. Better 
keep a pointer and swapping them. A clone_per_call() and free_per_call() 
might be needed.




> Eventually the timer fires and the hrtimer code path wakes the fibril:
> 
> -       if (task)
> -               wake_up_process(task);
> +       if (wake_target)
> +               wake_up_target(wake_target);
> 
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets.  In the fibril case it calls
> try_to_wake_up_fibril().  It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING.  It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task.  There's
> grossness here.  It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril.  To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.

Fine IMO. Better keep scheduling code localized inside schedule().



> - With get AIO support for all syscalls.  Every single one.  (Well, please, no
> asys sys_exit() :)).  Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.

Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The 
collector will become the async() events fetcher, instead of the other way 
around. Will work just fine ...



> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete().  No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->.  No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context.  In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.

This is the *big win* of the whole thing IMO.



> - AIO syscalls which *don't* block see very little overhead.  They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor.  There's room to optimize
> this path, too, of course.

Stack allocation can be optimized/cached, as someone else already said.



> - The 800lb elephant in the room.  It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending.  It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource.  (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.)  While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.

This should not be a huge problem IMO. High latency operations like 
network sockets can be handled with standard I/O events interfaces like 
poll/epoll, so I do not expect to have a huge number of fibrils around. 
The number of fibrils will be proportional to the number of active 
connections, not to the total number of connections.



> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack.  This uses less memory per blocking
> operation at the cpu cost of copying the used regions.  And it's a terrible
> idea, for at least two reasons.  First, to actually get the memory overhead
> savings you allocate at stack switch time.  If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory.  Deadlock city.  Second, it means
> that *you can't reference on-stack data in the wake-up path*.  This is a
> nightmare.  Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it.  Never mind, you know, basically
> every single user of <linux/wait.h>.   My current thinking is that it's just
> not worth it.

Agreed. Most definitely not worth it, for the reasons above.



> - We would now have some measure of task_struct concurrency.  Read that twice,
> it's scary.  As two fibrils execute and block in turn they'll each be
> referencing current->.  It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away.  The current implementation
> *does not* let preemption trigger a fibril switch.  So one only has to worry
> about racing with voluntary scheduling of the fibril paths.  This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril.  I think this is a
> manageable burden.

That seems the correct policy in any case.



> - Signals.  I have no idea what behaviour we want.  Help?  My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct.  If we want something like individual cancellation,  we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.

Fibril should IMO use current thread signal policies. I think a signal 
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread 
policies mandate that. I'd keep a list_head of currently scheduled-out 
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is 
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going 
to work as is (cleared at the first do_signal). Setting a bit in each 
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe 
a sequential signal counter in task_struct, matched by one in the fibril. 
A signal would increment the task_struct counter, and a fibril 
schedule-out would save the task_struct counter to the fibril. The 
signal_pending() for a fibril is a compare of the two. Or something 
similar.
In general, I think it'd make sense to have a fibril-based implemenation 
and a kthread-based one, and compare the messyness :) of the two related 
to cons/performnces.



- Davide


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-04  5:13 ` Davide Libenzi
@ 2007-02-04 20:00   ` Davide Libenzi
  0 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-04 20:00 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

On Sat, 3 Feb 2007, Davide Libenzi wrote:

> > - Signals.  I have no idea what behaviour we want.  Help?  My first guess is
> > that we'll want signal state to be shared by fibrils by keeping it in the
> > task_struct.  If we want something like individual cancellation,  we'll augment
> > signal_pending() with some some per-fibril test which will cause it to return
> > from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> > cancellation, I'll argue) as it would have if a signal was pending.
> 
> Fibril should IMO use current thread signal policies. I think a signal 
> should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread 
> policies mandate that. I'd keep a list_head of currently scheduled-out 
> TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is 
> delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
> The other thing is signal_pending(). The sigpending flag test is not going 
> to work as is (cleared at the first do_signal). Setting a bit in each 
> fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe 
> a sequential signal counter in task_struct, matched by one in the fibril. 
> A signal would increment the task_struct counter, and a fibril 
> schedule-out would save the task_struct counter to the fibril. The 
> signal_pending() for a fibril is a compare of the two. Or something 
> similar.

Another thing linked to signals that was not talked about, is cancellation 
of an in-flight request. We want to give the ability to cancel an 
in-flight request, with something like async_cancel(cookie). In my 
userspace library I simply disable SA_RESTART of SIGUSR2, and I do a 
pthread_kill() on the thread servicing the request. But this will IMO have 
other implications (linked to signal delivery) in a kernel fibril-based 
implementation, to think about it.



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 19:59           ` Alan
  2007-02-02 20:14             ` Linus Torvalds
@ 2007-02-05 16:44             ` Zach Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-05 16:44 UTC (permalink / raw)
  To: Alan
  Cc: Linus Torvalds, Ingo Molnar, linux-kernel, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

> Other questions really relate to the scheduling - Zach do you intend
> schedule_fibrils() to be a call code would make or just from  
> schedule() ?

I'd much rather keep the current sleeping API in as much as is  
possible.  So, yeah, if we can get schedule() to notice and behave  
accordingly I'd prefer that.  In the current code it's keyed off  
finding a stack allocation hanging off of current->.  If the caller  
didn't care about guaranteeing non-blocking submission then we  
wouldn't need that.. we could use a thread_info flag bit, or  
something.  Avoiding that allocation in the cached case would be nice.

> Alan (who used to use Co-routines in real languages on 36bit
> computers with 9bit bytes before learning C)

Yes, don't despair, I'm not co-routine ignorant.  In fact, I'm almost  
positive it was you who introduced them to me at some point in the  
previous millennium ;).

- z


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 22:21           ` Ingo Molnar
  2007-02-02 22:49             ` Linus Torvalds
  2007-02-02 23:37             ` Davide Libenzi
@ 2007-02-05 17:02             ` Zach Brown
  2007-02-05 18:52               ` Davide Libenzi
  2 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> ok, i think i noticed another misunderstanding. The kernel thread  
> based
> scheme i'm suggesting would /not/ 'switch' to another kernel thread in
> the cached case, by default. It would just execute in the original
> context (as if it were a synchronous syscall), and the switch to a
> kernel thread from the pool would only occur /if/ the context is about
> to block. (this 'switch' thing would be done by the scheduler)

Yeah, this is what I imagined when you described doing this with  
threads instead of these 'fibril' things.

It sounds like you're suggesting that we keep the 1:1 relationship  
between task_struct and thread_info.  That would avoid the risks that  
the current fibril approach brings.  It insists that all of  
task_struct is shared between concurrent fibrils (even if only  
between blocking points).  As I understand what Ingo is suggesting,  
we'd instead only explicitly share the fields that we migrate (copy  
or get a reference) as we move the stack from the submitting  
task_struct to a waiting_task struct as the submission blocks.

We trade initial effort to make things safe in the presence of  
universal sharing for effort to introduce sharing as people notice  
deficient behaviour.  If that's the way we prefer to go, I'm cool  
with that.  I might have gone slightly nuts in preferring *identical*  
sync and async behaviour.

The fast path would look almost identical to the existing fibril  
switch.  We'd just have a few more fields to sync up between the two  
task_structs.

Ingo, am I getting this right?  This sounds pretty straight forward  
to prototype from the current patches.  I can certainly give it a try.

> it's quite cheap to 'flip' it to under any arbitrary user-space  
> context:
> change its thread_info->task pointer to the user-space context's task
> struct, copy the mm pointer, the fs pointer to the "worker thread",
> switch the thread_info, update ptregs - done. Hm?

Or maybe you're talking about having concurrent executing  
thread_info's pointing to the user-space submitting task_struct?   
That really does sound like the current fibril approach, with even  
more sharing of thread_info's that might be executing on other cpus?

Either way, I want to give it a try.  If we can measure it performing  
reasonably in the cached case then I think everyone's happy?

> is not part of the signal set. (Although it might make sense to make
> such async syscalls interruptible, just like any syscall.)

I think we all agree that they have to be interruptible by now,  
right?  If for no other reason than to interrupt pending poll with no  
timeout, say, as the task exits..

> The 'pool' of kernel threads doesnt even have to be per-task, it  
> can be
> a natural per-CPU thing

Yeah, absolutely.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
@ 2007-02-05 17:12               ` Zach Brown
  2007-02-05 18:24                 ` Davide Libenzi
  2007-02-05 21:36               ` bert hubert
  2 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 17:12 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> Since I still think that the many-thousands potential async operations
> coming from network sockets are better handled with a classical event
> machanism [1], and since smooth integration of new async syscall  
> into the
> standard POSIX infrastructure is IMO a huge win, I think we need to  
> have a
> "bridge" to allow async completions being detectable through a  
> pollable
> (by the mean of select/poll/epoll whatever) device.

Ugh, I'd rather not if we don't have to.

It seems like you could get this behaviour from issuing a poll/select 
(really?)/epoll as one of the async calls to complete.  (And you  
mention this in a later email? :))

Part of my thinking on this is that we might want it to be really  
trivial to create and wait on groups of ops.. maybe as a context.   
One of the things posix AIO wants is the notion of submitting and  
waiting on a group of ops as a "list".  That sounds like we might be  
able to implement it by issuing ops against a context, created as  
part of the submission, and then waiting for it to drain.

Being able to wait on that with file->poll() obviously requires  
juggling file-> associations which sounds like more weight than we  
might want.  Or it'd be optional and we'd get more moving parts and  
divergent paths to test.

So, sure, it's possible and not terribly difficult, but I'd rather  
avoid it if people can be convinced to get the same behaviour by  
issuing an async instance of their favourite readiness syscall.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-03  8:23                   ` Ingo Molnar
  2007-02-03  9:25                     ` Matt Mackall
@ 2007-02-05 17:44                     ` Zach Brown
  2007-02-05 19:26                       ` Davide Libenzi
  1 sibling, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, linux-kernel, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> But really, being a scheduler guy i was much more concerned about the
> duplication and problems caused by the fibril concept itself - which
> duplication and complexity makes up 80% of Zach's submitted patchset.
> For example this bit:
>
>    [PATCH 3 of 4] Teach paths to wake a specific void * target
>
> would totally go away if we used kernel threads for this.

Uh, would it?  Are you talking about handing off the *task_struct*  
that it was submitted under to each worker thread that inherits the  
stack?

I guess I hadn't considered going that far.  I had somehow  
constructed a block in my mind that we couldn't release the  
task_struct from the submitting task.  But maybe we can be clever  
enough with the task_struct updating that userspace wouldn't notice a  
significant change.

Hmm.

> i totally agree that the API /should/ be the main focus - but i didnt
> pick the topic and most of the patchset's current size is due to  
> the IMO
> avoidable fibril concept.

I, too, totally agree.  I didn't even approach the subject for  
exactly the reason you allude to -- I wanted to get the hard parts of  
the kernel side right first.

> regarding the API, i dont really agree with the current form and  
> design
> of Zach's interface.

Haha, well, yes, of course.  You couldn't have thought that the dirt- 
stupid sys_asys_wait_for_completion() was anything more than simple  
scaffolding to test the kernel bits.

> fundamentally, the basic entity of this thing should be a /system  
> call/,
> not the artificial fibril thing:
>
>   +struct asys_call {
>   +       struct asys_result      *result;
>   +       struct fibril           fibril;
>   +};

You picked a weird struct to highlight here.  struct asys_input seems  
more related to the stuff you go on to discuss below.  This asys_call  
struct is a relatively irrelevant internal detail of how  
asys_teardown_stack() gets from a fibril to the pre-allocated  
completion state once the call has returned.

> The normal and most optimal workflow should be a user-space ring- 
> buffer
> of these constant-size struct async_syscall entries:
>
>   struct async_syscall ringbuffer[1024];
>
>   LIST_HEAD(submitted);
>   LIST_HEAD(pending);
>   LIST_HEAD(completed);

I strongly disagree here, and I'm hoping you're not as keen on this  
now -- your reply to Matt gives me hope.

As mentioned, that they complete out-of-order leads, at least, to  
having separate submission and completion rings.  I'm not sure a  
submission ring makes any sense given the goal of processing the  
calls in submission and only creating threads if it blocks.  A simple  
copy of an array of these input structs sounds fine to me.

When I think about the completion side I tend to hope we can end up  
with something like what VJ talked about in his net channels work.   
producer/consumer rings with head and tail pointers in different  
cache lines.  AFAIK the kevent work has headed in that direction, but  
I haven't kept up.  Uli has certainly mentioned it in his 'ec' (event  
channels) proposals.

The posix AIO list completion and, sadly, signals on completion need  
to be considered, too.

Honestly, though, I'm not worried about this part.  We'll easily come  
to an agreement.  I'm just not going to distract myself with it until  
we're happy with the scheduler side.

> Looks like we can hit many birds with this single stone: AIO, vectored
> syscalls, finegrained system-call parallelism. Hm?

Hmm, indeed.  Some flags could let userspace tell the kernel not to  
bother with all this threading/concurrency/aio nonsense and just  
issue them serially.  It'll sound nuts in these days of cheap  
syscalls and vsyscall helpers, but some Oracle folks might love this  
for issuing a gettimeofday() pair around syscalls they want to profile.

I hadn't considered that as a potential property of this interface.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-04  5:12   ` Davide Libenzi
@ 2007-02-05 17:54     ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-05 17:54 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Linus Torvalds

>
>
>> +	current->per_call = next->per_call;
>
> Pointer instead of structure copy?

Sure, there are lots of trade-offs there, but the story changes if we  
keep the 1:1 relationship between task_struct and thread_info.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:12               ` Zach Brown
@ 2007-02-05 18:24                 ` Davide Libenzi
  2007-02-05 21:44                   ` David Miller
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 18:24 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Since I still think that the many-thousands potential async operations
> > coming from network sockets are better handled with a classical event
> > machanism [1], and since smooth integration of new async syscall into the
> > standard POSIX infrastructure is IMO a huge win, I think we need to have a
> > "bridge" to allow async completions being detectable through a pollable
> > (by the mean of select/poll/epoll whatever) device.
> 
> Ugh, I'd rather not if we don't have to.
> 
> It seems like you could get this behaviour from issuing a
> poll/select(really?)/epoll as one of the async calls to complete.  (And you
> mention this in a later email? :))

Yes, no need for the above. We can just host a poll/epoll in an async() 
operation, and demultiplex once that gets ready.



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:02             ` Zach Brown
@ 2007-02-05 18:52               ` Davide Libenzi
  2007-02-05 19:20                 ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 18:52 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The 'pool' of kernel threads doesnt even have to be per-task, it can be
> > a natural per-CPU thing
> 
> Yeah, absolutely.

Hmmm, so we issue an async sys_read(), what a get_file(fd) will return for 
a per-CPU kthread executing such syscall? Unless we teach context_switch() 
to do a inherit-trick for "files" (even in that case, it won't work if 
we switch from another context). And, is it all for it?
IMO it's got to be either a per-process thread pool or a fibril approach.
Or we need some sort of enter_context()/leave_context() (adopt mm, files, 
...) to have a per-CPU kthread to be able to execute the syscall from the 
async() caller context. Hmmm?



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 18:52               ` Davide Libenzi
@ 2007-02-05 19:20                 ` Zach Brown
  2007-02-05 19:38                   ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 19:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> Or we need some sort of enter_context()/leave_context() (adopt mm,  
> files,
> ...) to have a per-CPU kthread to be able to execute the syscall  
> from the
> async() caller context.

I believe that's what Ingo is hoping for, yes.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 17:44                     ` Zach Brown
@ 2007-02-05 19:26                       ` Davide Libenzi
  2007-02-05 19:41                         ` Zach Brown
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 19:26 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The normal and most optimal workflow should be a user-space ring-buffer
> > of these constant-size struct async_syscall entries:
> > 
> >  struct async_syscall ringbuffer[1024];
> > 
> >  LIST_HEAD(submitted);
> >  LIST_HEAD(pending);
> >  LIST_HEAD(completed);
> 
> I strongly disagree here, and I'm hoping you're not as keen on this now --
> your reply to Matt gives me hope.
> 
> As mentioned, that they complete out-of-order leads, at least, to having
> separate submission and completion rings.  I'm not sure a submission ring
> makes any sense given the goal of processing the calls in submission and only
> creating threads if it blocks.  A simple copy of an array of these input
> structs sounds fine to me.

The "result" of one async operation is basically a cookie and a result 
code. Eight or sixteen bytes at most. IMO, before going wacko designing 
complex shared userspace-kernel result buffers, I think it'd be better 
measuring the worth-value of the thing ;)



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:20                 ` Zach Brown
@ 2007-02-05 19:38                   ` Davide Libenzi
  0 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 19:38 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Or we need some sort of enter_context()/leave_context() (adopt mm, files,
> > ...) to have a per-CPU kthread to be able to execute the syscall from the
> > async() caller context.
> 
> I believe that's what Ingo is hoping for, yes.

Ok, but then we should ask ourselves if it's really worth to have a 
per-CPU pool (that will require quite a few changes to the current way 
of doing things), or a per-process pool (that would basically work as is). 
What advantage gives us a per-CPU pool?
Setup cost? Not really IMO. Thread creation is pretty cheap, and a typical 
process using async will have a pretty huge lifespan (compared to the pool 
creation cost).
Configurability scores for a per-process pool, because it may allow each 
process (eventually) to size his own.
What's the real point in favour of a per-CPU pool, that justify all the 
changes that will have to be done in order to adopt such concept?



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:26                       ` Davide Libenzi
@ 2007-02-05 19:41                         ` Zach Brown
  2007-02-05 20:10                           ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 19:41 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> The "result" of one async operation is basically a cookie and a result
> code. Eight or sixteen bytes at most.

s/basically/minimally/

Well, yeah.  The patches I sent had:

struct asys_completion {
         long            return_code;
         unsigned long   cookie;
};

That's as stupid as it gets.

> IMO, before going wacko designing
> complex shared userspace-kernel result buffers, I think it'd be better
> measuring the worth-value of the thing ;)

Obviously, yes.

The potential win is to be able to have one place to wait for  
collection from multiple sources.  Some of them might want more data  
per event.  They can always indirect out via a cookie pointer,  sure,  
but at insanely high message rates (10gige small messages) one might  
not want that.

See also: the kevent thread.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 19:41                         ` Zach Brown
@ 2007-02-05 20:10                           ` Davide Libenzi
  2007-02-05 20:21                             ` Zach Brown
  2007-02-05 20:39                             ` Linus Torvalds
  0 siblings, 2 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 20:10 UTC (permalink / raw)
  To: Zach Brown
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The "result" of one async operation is basically a cookie and a result
> > code. Eight or sixteen bytes at most.
> 
> s/basically/minimally/
> 
> Well, yeah.  The patches I sent had:
> 
> struct asys_completion {
>        long            return_code;
>        unsigned long   cookie;
> };
> 
> That's as stupid as it gets.

No, that's *really* it ;)
The cookie you pass, and the return code of the syscall.
If there other data transfered? Sure, but that data transfered during the 
syscall processing, and handled by the syscall (filling up a sys_read 
buffer just for example).




> > IMO, before going wacko designing
> > complex shared userspace-kernel result buffers, I think it'd be better
> > measuring the worth-value of the thing ;)
> 
> Obviously, yes.
> 
> The potential win is to be able to have one place to wait for collection from
> multiple sources.  Some of them might want more data per event.  They can
> always indirect out via a cookie pointer,  sure, but at insanely high message
> rates (10gige small messages) one might not want that.

Did I miss something? The async() syscall will allow (with few 
restrictions) to execute whatever syscall in an async fashion. An syscall 
returns a result code (long). Plus, you need to pass back the 
userspace-provided cookie of course. A cookie is very likely a direct 
pointer to the userspace session the async syscall applies to, so a
"(my_session *) results[i].cookie" will bring you directly on topic.
Collection of multiple sources? What do you mean? What's wrong with:

int async_wait(struct asys_completion *results, int nresults);

Is saving an 8/16 bytes double copy worth going wacko in designing shared 
userspace/kernel buffers, when the syscall that lays behind an 
asys_completion is prolly touching KBs of RAM during its execution?




- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:10                           ` Davide Libenzi
@ 2007-02-05 20:21                             ` Zach Brown
  2007-02-05 20:42                               ` Linus Torvalds
  2007-02-05 20:39                             ` Linus Torvalds
  1 sibling, 1 reply; 151+ messages in thread
From: Zach Brown @ 2007-02-05 20:21 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> No, that's *really* it ;)

For syscalls, sure.

The kevent work incorporates Uli's desire to have more data per  
event.  Have you read his OLS stuff?  It's been a while since I did  
so I've lost the details of why he cares to have more.

Let me say it again, maybe a little louder this time:  I'm not  
interested in worrying about this aspect of the API until the  
scheduler mechanics are more solidified.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:10                           ` Davide Libenzi
  2007-02-05 20:21                             ` Zach Brown
@ 2007-02-05 20:39                             ` Linus Torvalds
  2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:21                               ` Zach Brown
  1 sibling, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-05 20:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise



On Mon, 5 Feb 2007, Davide Libenzi wrote:
>
> No, that's *really* it ;)
>
> The cookie you pass, and the return code of the syscall.
> If there other data transfered? Sure, but that data transfered during the 
> syscall processing, and handled by the syscall (filling up a sys_read 
> buffer just for example).

Indeed. One word is *exactly* what a normal system call returns too.

That said, normally we have a user-space library layer to turn that into 
the "errno + return value" thing, and in the case of async() calls we 
very basically wouldn't have that. So either:

 - we'd need to do it in the kernel (which is actually nasty, since 
   different system calls have slightly different semantics - some don't 
   return any error value at all, and negative numbers are real numbers)

 - we'd have to teach user space about the "negative errno" mechanism, in 
   which case one word really is alwats enough.

Quite frankly, I much prefer the second alternative. The "negative errno" 
thing has not only worked really really well inside the kernel, it's so 
obviously 100% superior to the standard UNIX "-1 + errno" approach that 
it's not even funny. 

To see why "negative errno" is better, just look at any threaded program, 
or look at any program that does multiple calls and needs to save the 
errno not from the last one, but from some earlier one (eg, it does a 
"close()" in between returning *its* error, and the real operation that 
we care about).

> Did I miss something? The async() syscall will allow (with few 
> restrictions) to execute whatever syscall in an async fashion. An syscall 
> returns a result code (long). Plus, you need to pass back the 
> userspace-provided cookie of course.

HOWEVER, they get returned differently. The cookie gets returned 
immediately, the system call result gets returned in-memory only after the 
async thing has actually completed.

I would actually argue that it's not the kernel that should generate any 
cookie, but that user-space should *pass*in* the cookie it wants to, and 
the kernel should consider it a pointer to a 64-bit entity which is the 
return code.

In other words, the only "cookie" we need is literally the pointer to the 
results. And that's obviously something that the user space has to set up 
anyway.

So how about making the async interface be:

	// returns negative for error
	// zero for "synchronous"
	// positive kernel "wait for me" cookie for success
	long sys_async_submit(
		unsigned long flags,
		long *user_result_ptr,
		long syscall,
		unsigned long *args);

and the "args" thing would literally just fill up the registers.

The real downside here is that it's very architecture-specific this way, 
and that means that x86-64 (and other 64-bit ones) would need to have 
emulation layers for the 32-bit ones, but they likely need to do that 
*anyway*, so it's probably not a huge downside. The alternative is to:

 - make a new architecture-independent system call enumeration for the 
   async interface

 - make everything use 64-bit values.
		
Now, making an architecture-independent system call enumeration may 
actually make sense regardless, because it would allow sys_async() to have 
its own system call table and put the limitations and rules for those 
system calls there, instead of depending on the per-architecture system 
call table that tends to have some really architecture-specific details.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:21                             ` Zach Brown
@ 2007-02-05 20:42                               ` Linus Torvalds
  0 siblings, 0 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-05 20:42 UTC (permalink / raw)
  To: Zach Brown
  Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise



On Mon, 5 Feb 2007, Zach Brown wrote:
> 
> For syscalls, sure.
> 
> The kevent work incorporates Uli's desire to have more data per event.  Have
> you read his OLS stuff?  It's been a while since I did so I've lost the
> details of why he cares to have more.

You'd still do that as _arguments_ to the system call, not as the return 
value.

Also, quite frankly, I tend to find Uli over-designs things. The whole 
disease of "make things general" is a CS disease that some people take to 
extreme.

The good thing about generic code is not that it solves some generic 
problem. The good thing about generics is that they mean that you can 
_avoid_ solving other problems AND HAVE LESS CODE. But some people seem to 
think that "generic" means that you have to have tons of code to handle 
all the possible cases, and that *completely* misses the point.

We want less code. The whole (and really, the _only_) point of the 
fibrils, at least as far as I'm concerned, is to *not* have special code 
for aio_read/write/whatever.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:39                             ` Linus Torvalds
@ 2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06  0:32                                 ` Davide Libenzi
  2007-02-05 21:21                               ` Zach Brown
  1 sibling, 2 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> Indeed. One word is *exactly* what a normal system call returns too.
> 
> That said, normally we have a user-space library layer to turn that into 
> the "errno + return value" thing, and in the case of async() calls we 
> very basically wouldn't have that. So either:
> 
>  - we'd need to do it in the kernel (which is actually nasty, since 
>    different system calls have slightly different semantics - some don't 
>    return any error value at all, and negative numbers are real numbers)
> 
>  - we'd have to teach user space about the "negative errno" mechanism, in 
>    which case one word really is alwats enough.
> 
> Quite frankly, I much prefer the second alternative. The "negative errno" 
> thing has not only worked really really well inside the kernel, it's so 
> obviously 100% superior to the standard UNIX "-1 + errno" approach that 
> it's not even funny. 

Currently it's in the syscall wrapper. Couldn't we have it in the 
asys_teardown_stack() stub?



> HOWEVER, they get returned differently. The cookie gets returned 
> immediately, the system call result gets returned in-memory only after the 
> async thing has actually completed.
> 
> I would actually argue that it's not the kernel that should generate any 
> cookie, but that user-space should *pass*in* the cookie it wants to, and 
> the kernel should consider it a pointer to a 64-bit entity which is the 
> return code.

Yes. Let's have the userspace to "mark" the async operation. IMO the 
cookie should be something transparent to the kernel.
Like you said though, that'd require compat-code (unless we fix the size).



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 20:39                             ` Linus Torvalds
  2007-02-05 21:09                               ` Davide Libenzi
@ 2007-02-05 21:21                               ` Zach Brown
  1 sibling, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-05 21:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

> - we'd need to do it in the kernel (which is actually nasty, since
>    different system calls have slightly different semantics - some  
> don't
>    return any error value at all, and negative numbers are real  
> numbers)
>
>  - we'd have to teach user space about the "negative errno"  
> mechanism, in
>    which case one word really is alwats enough.
>
> Quite frankly, I much prefer the second alternative. The "negative  
> errno"
> thing has not only worked really really well inside the kernel,  
> it's so
> obviously 100% superior to the standard UNIX "-1 + errno" approach  
> that
> it's not even funny.

I agree, and I imagine you'd have a hard time finding someone who  
actually *likes* the errno convention :)

> I would actually argue that it's not the kernel that should  
> generate any
> cookie, but that user-space should *pass*in* the cookie it wants  
> to, and
> the kernel should consider it a pointer to a 64-bit entity which is  
> the
> return code.

Yup.  That's how the current code (and epoll, and fs/aio.c, and..) work.

Cancelation comes into this discussion, I think.  Hopefully its  
reasonable to expect userspace to be able to manage cookies well  
enough that they can use them to issue cancels and only hit the ops  
they intend to.  It means we have to give them the tools to  
differentiate between a racing completion and cancelation so they can  
reuse a cookie at the right time, but that doesn't sound fatal.

>  - make everything use 64-bit values.

This would be my preference.

> Now, making an architecture-independent system call enumeration may
> actually make sense regardless, because it would allow sys_async()  
> to have
> its own system call table and put the limitations and rules for those
> system calls there, instead of depending on the per-architecture  
> system
> call table that tends to have some really architecture-specific  
> details.

Maybe, sure.  I don't have a lot of insight into this.  Hopefully  
some arch maintainers can jump in?

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:09                               ` Davide Libenzi
@ 2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06 20:25                                   ` Davide Libenzi
  2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06  0:32                                 ` Davide Libenzi
  1 sibling, 2 replies; 151+ messages in thread
From: Kent Overstreet @ 2007-02-05 21:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> > HOWEVER, they get returned differently. The cookie gets returned
> > immediately, the system call result gets returned in-memory only after the
> > async thing has actually completed.
> >
> > I would actually argue that it's not the kernel that should generate any
> > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > the kernel should consider it a pointer to a 64-bit entity which is the
> > return code.
>
> Yes. Let's have the userspace to "mark" the async operation. IMO the
> cookie should be something transparent to the kernel.
> Like you said though, that'd require compat-code (unless we fix the size).

You don't need an explicit cookie if you're passing in a pointer to
the return code, it doesn't really save you anything to do so. Say
you've got a bunch of user threads (with or without stacks, it doesn't
matter).

struct asys_ret {
     int ret;
     struct thread *p;
};

struct asys_ret r;
r.p = me;

async_read(fd, buf, nbytes, &r);

Then you just have your async_getevents return the same pointers you
passed in, and your main event loop gets pointers to its threads for
free.

It seems cleaner to do it this way vs. returning structs with the
actual return code and a cookie, as threads get the return code
exactly where they want it.

Keep in mind that the epoll way (while great for epoll, I do love it)
makes sense because it doesn't have to deal with any sort of return
codes.

My only other point is that you really do want a bulk asys_submit
instead of doing a syscall per async syscall; one of the great wins of
this approach is heavily IO driven apps can batch up syscalls.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-02 23:37             ` Davide Libenzi
  2007-02-03  0:02               ` Davide Libenzi
  2007-02-05 17:12               ` Zach Brown
@ 2007-02-05 21:36               ` bert hubert
  2007-02-05 21:57                 ` Linus Torvalds
  2 siblings, 1 reply; 151+ messages in thread
From: bert hubert @ 2007-02-05 21:36 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linus Torvalds, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Fri, Feb 02, 2007 at 03:37:09PM -0800, Davide Libenzi wrote:

> Since I still think that the many-thousands potential async operations 
> coming from network sockets are better handled with a classical event 
> machanism [1], and since smooth integration of new async syscall into the 
> standard POSIX infrastructure is IMO a huge win, I think we need to have a 
> "bridge" to allow async completions being detectable through a pollable 
> (by the mean of select/poll/epoll whatever) device.

> [1] Unless you really want to have thousands of kthreads/fibrils lingering 
>     on the system.

>From my end as an application developer, yes please. Either make it
perfectly ok to have thousands of outstanding asynchronous system calls (I
work with thousands of separate sockets), or allow me to select/poll/epoll
on the "async fd".

Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
well, the fd might be easier.

In fact, perhaps the communication channel might simply *be* an fd. Queueing
up syscalls sounds remarkably like sending datagrams. 

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 18:24                 ` Davide Libenzi
@ 2007-02-05 21:44                   ` David Miller
  2007-02-06  0:15                     ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: David Miller @ 2007-02-05 21:44 UTC (permalink / raw)
  To: davidel
  Cc: zach.brown, mingo, torvalds, linux-kernel, linux-aio, suparna, bcrl

From: Davide Libenzi <davidel@xmailserver.org>
Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)

> Yes, no need for the above. We can just host a poll/epoll in an async() 
> operation, and demultiplex once that gets ready.

I can hear Evgeniy crying 8,000 miles away.

I strongly encourage a lot of folks commenting in this thread to
familiarize themselves with kevent and how it handles this stuff.  I
see a lot of suggestions for things he has totally implemented and
solved already in kevent.

I'm not talking about Zach's fibril's, I'm talking about the interface
aspects of these discussions.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:36               ` bert hubert
@ 2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
                                     ` (2 more replies)
  0 siblings, 3 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-05 21:57 UTC (permalink / raw)
  To: bert hubert
  Cc: Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Mon, 5 Feb 2007, bert hubert wrote:
> 
> From my end as an application developer, yes please. Either make it
> perfectly ok to have thousands of outstanding asynchronous system calls (I
> work with thousands of separate sockets), or allow me to select/poll/epoll
> on the "async fd".

No can do.

Allocating an fd is actually too expensive, exactly because a lot of these 
operations are supposed to be a few hundred ns, and taking locks is simply 
a bad idea.

But if you want to, we could have a *separate* "convert async cookie to 
fd" so that you can poll for it, or something.

I doubt very many people want to do that. It would tend to simply be nicer 
to do

	async(poll);
	async(waitpid);
	async(.. wait foranything else ..)

followed by a

	wait_for_async();

That's just a much NICER approach, I would argue. And it automatically 
and very naturally solves the "wait for different kinds of events" 
question, in a way that "poll()" never did (except by turning all events 
into file descriptors or signals).

> Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
> well, the fd might be easier.

Again. NO WAY. Signals are just damn expensive. At most, it would be an 
option again, but if you want high performance, signals simply aren't very 
good. They are also a nice way to make your user-space code very racy.

> In fact, perhaps the communication channel might simply *be* an fd. Queueing
> up syscalls sounds remarkably like sending datagrams. 

I'm the first to say that file descriptors is the UNIX way, but so are 
processes, and I think this is MUCH better done as a "process" interface. 
In other words, instead of doing it as a filedescriptor, do it as a 
"micro-fork/exec", and have the "wait()" equivalent. It's just that we 
don't fork a "real process", and we don't exec a "real program", we just 
exec a single system call.

If you think of it in those terms, it all makes sense *without* any file 
descriptors what-so-ever, and the "wait_for_async()" interface also makes 
a ton of sense (it really *is* "waitpid()" for the system call).

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
@ 2007-02-05 22:07                   ` bert hubert
  2007-02-05 22:15                     ` Zach Brown
  2007-02-05 22:34                   ` Davide Libenzi
  2007-02-06  0:27                   ` Scot McKinley
  2 siblings, 1 reply; 151+ messages in thread
From: bert hubert @ 2007-02-05 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Mon, Feb 05, 2007 at 01:57:15PM -0800, Linus Torvalds wrote:

> I doubt very many people want to do that. It would tend to simply be nicer 
> to do
> 
> 	async(poll);

Yeah - I saw that technique being mentioned later on in the thread, and it
would work, I think.

To make up for the waste of time, some other news. I asked Matt Dillon of
DragonflyBSD why he removed asynchronous system calls from his OS, and he
told me it was because of the problems he had implementing them in the
kernel:

    There were two basic problems:  First, it added a lot of overhead when
    most system calls are either non-blocking anyway (like getpid()). 
    Second and more importantly it was very, very difficult to recode the
    system calls that COULD block to actually be asynchronous in the kernel. 
    I spent some time recoding nanosleep() to operate asynchronously and it
    was a huge mess.

Aside from that, they did not discover any skeletons hidden in the closet,
although from mailing list traffic, I gather the asynchronous system calls
didn't see a lot of use. If I understand it correctly, for a number of years
they emulated asynchronous system calls using threads.

We'd be sidestepping the need to update all syscalls via 'fibrils' of
course.

> If you think of it in those terms, it all makes sense *without* any file 
> descriptors what-so-ever, and the "wait_for_async()" interface also makes 
> a ton of sense (it really *is* "waitpid()" for the system call).

It has me excited in any case. Once anything even remotely testable appears
(Zach tells me not to try the current code), I'll work it into MTasker
(http://ds9a.nl/mtasker) and make it power a nameserver that does async i/o,
for use with very very large zones that aren't preloaded.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 22:07                   ` bert hubert
@ 2007-02-05 22:15                     ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-05 22:15 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Davide Libenzi, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> It has me excited in any case. Once anything even remotely testable  
> appears
> (Zach tells me not to try the current code), I'll work it into MTasker
> (http://ds9a.nl/mtasker) and make it power a nameserver that does  
> async i/o,
> for use with very very large zones that aren't preloaded.

I'll be sure to let you know :)

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
@ 2007-02-05 22:34                   ` Davide Libenzi
  2007-02-06  0:27                   ` Scot McKinley
  2 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-05 22:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bert hubert, Ingo Molnar, Zach Brown, Linux Kernel Mailing List,
	linux-aio, Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> On Mon, 5 Feb 2007, bert hubert wrote:
> > 
> > From my end as an application developer, yes please. Either make it
> > perfectly ok to have thousands of outstanding asynchronous system calls (I
> > work with thousands of separate sockets), or allow me to select/poll/epoll
> > on the "async fd".
> 
> No can do.
> 
> Allocating an fd is actually too expensive, exactly because a lot of these 
> operations are supposed to be a few hundred ns, and taking locks is simply 
> a bad idea.
> 
> But if you want to, we could have a *separate* "convert async cookie to 
> fd" so that you can poll for it, or something.
> 
> I doubt very many people want to do that. It would tend to simply be nicer 
> to do
> 
> 	async(poll);
> 	async(waitpid);
> 	async(.. wait foranything else ..)
> 
> followed by a
> 
> 	wait_for_async();
> 
> That's just a much NICER approach, I would argue. And it automatically 
> and very naturally solves the "wait for different kinds of events" 
> question, in a way that "poll()" never did (except by turning all events 
> into file descriptors or signals).

Bert, that was the first suggestion I gave to Zab. But then I realized 
that a multiplexed poll/epoll can be "hosted" in an async op, just like 
Linus showed above. Will work just fine IMO.




- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:44                   ` David Miller
@ 2007-02-06  0:15                     ` Davide Libenzi
  0 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-06  0:15 UTC (permalink / raw)
  To: David Miller
  Cc: zach.brown, Ingo Molnar, torvalds, Linux Kernel Mailing List,
	linux-aio, suparna, bcrl

On Mon, 5 Feb 2007, David Miller wrote:

> From: Davide Libenzi <davidel@xmailserver.org>
> Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)
> 
> > Yes, no need for the above. We can just host a poll/epoll in an async() 
> > operation, and demultiplex once that gets ready.
> 
> I can hear Evgeniy crying 8,000 miles away.
> 
> I strongly encourage a lot of folks commenting in this thread to
> familiarize themselves with kevent and how it handles this stuff.  I
> see a lot of suggestions for things he has totally implemented and
> solved already in kevent.

David, I'm sorry but I only briefly looked at the work Evgeniy did on 
kevent. So excuse me if I say something broken in the next few sentences.
Zab's async syscall interface is a pretty simple one. It accepts the 
syscall number, the parameters for the syscall, and a cookie. It returns a 
syscall result code, and your cookie (that's the meat of it, at least). 
IMO its interface should be optimized for what it does.
Could this submission/retrieval be inglobated inside a "generic" 
submission/retrieval API? Sure you can. But then you end up having 
submission/event structures with 17 members, 3 of which are valid at each 
time. The API becomes more difficult to use IMO, because suddendly you 
have to know which field are good for each event you're submitting/fetching.
IMHO, genericity can be built in userspace, *if* one really wants it and, 
of course, provided that the OS gives you the tools to build it.
The problem before, was that it was hard to bridge something like poll/epoll 
with other "by nature" sync operations. Evgeniy kevent is one attempt, 
Zab's async is another one. IMO the async syscall is a *very poweful* one, 
since it allows for *total coverage* async support w/out "plugs" all over 
the kernel paths.
But as Zab said, the kernel implementation is more important ATM.




- Davide


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:57                 ` Linus Torvalds
  2007-02-05 22:07                   ` bert hubert
  2007-02-05 22:34                   ` Davide Libenzi
@ 2007-02-06  0:27                   ` Scot McKinley
  2007-02-06  0:48                     ` David Miller
  2007-02-06  0:48                     ` Joel Becker
  2 siblings, 2 replies; 151+ messages in thread
From: Scot McKinley @ 2007-02-06  0:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bert hubert, Davide Libenzi, Ingo Molnar, Zach Brown,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise


As Joel mentioned earlier, from an Oracle perspective, one of the key 
things we are looking for is a nice clean *common* wait point. We don't 
really care whether this common wait point is the old libaio:async-poll, 
epoll, or "wait_for_async". And if "wait_for_async" has the added 
benefit of scaling, all the better.

However, it is desirable for that common wait-routine to have the 
ability to return explicit completions, instead of requiring a follow-on 
call to some other query/wait for events/completions for each of the 
different type of async submissions done (poll, pid, i/o, ...). 
Obviously not a "must-have", but desirable.

It is also desirable (if possible) to have immediate completions (either 
immediate errs or async submissions that complete synchronously) 
communicated at submission time, instead of via the common wait-routine.

Finally, it is agreed that neg-errno is a much better approach for the 
return code. The threading/concurrency issues associated w/ the current 
unix errno has always been buggy area for Oracle Networking code.

Regards, -Scot 

Linus Torvalds wrote:

>On Mon, 5 Feb 2007, bert hubert wrote:
>  
>
>>From my end as an application developer, yes please. Either make it
>>perfectly ok to have thousands of outstanding asynchronous system calls (I
>>work with thousands of separate sockets), or allow me to select/poll/epoll
>>on the "async fd".
>>    
>>
>
>No can do.
>
>Allocating an fd is actually too expensive, exactly because a lot of these 
>operations are supposed to be a few hundred ns, and taking locks is simply 
>a bad idea.
>
>But if you want to, we could have a *separate* "convert async cookie to 
>fd" so that you can poll for it, or something.
>
>I doubt very many people want to do that. It would tend to simply be nicer 
>to do
>
>	async(poll);
>	async(waitpid);
>	async(.. wait foranything else ..)
>
>followed by a
>
>	wait_for_async();
>
>That's just a much NICER approach, I would argue. And it automatically 
>and very naturally solves the "wait for different kinds of events" 
>question, in a way that "poll()" never did (except by turning all events 
>into file descriptors or signals).
>
>  
>
>>Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
>>well, the fd might be easier.
>>    
>>
>
>Again. NO WAY. Signals are just damn expensive. At most, it would be an 
>option again, but if you want high performance, signals simply aren't very 
>good. They are also a nice way to make your user-space code very racy.
>
>  
>
>>In fact, perhaps the communication channel might simply *be* an fd. Queueing
>>up syscalls sounds remarkably like sending datagrams. 
>>    
>>
>
>I'm the first to say that file descriptors is the UNIX way, but so are 
>processes, and I think this is MUCH better done as a "process" interface. 
>In other words, instead of doing it as a filedescriptor, do it as a 
>"micro-fork/exec", and have the "wait()" equivalent. It's just that we 
>don't fork a "real process", and we don't exec a "real program", we just 
>exec a single system call.
>
>If you think of it in those terms, it all makes sense *without* any file 
>descriptors what-so-ever, and the "wait_for_async()" interface also makes 
>a ton of sense (it really *is* "waitpid()" for the system call).
>
>		Linus
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-aio' in
>the body to majordomo@kvack.org.  For more info on Linux AIO,
>see: http://www.kvack.org/aio/
>Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>  
>


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:09                               ` Davide Libenzi
  2007-02-05 21:31                                 ` Kent Overstreet
@ 2007-02-06  0:32                                 ` Davide Libenzi
  1 sibling, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-06  0:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, 5 Feb 2007, Davide Libenzi wrote:

> On Mon, 5 Feb 2007, Linus Torvalds wrote:
> 
> > Indeed. One word is *exactly* what a normal system call returns too.
> > 
> > That said, normally we have a user-space library layer to turn that into 
> > the "errno + return value" thing, and in the case of async() calls we 
> > very basically wouldn't have that. So either:
> > 
> >  - we'd need to do it in the kernel (which is actually nasty, since 
> >    different system calls have slightly different semantics - some don't 
> >    return any error value at all, and negative numbers are real numbers)
> > 
> >  - we'd have to teach user space about the "negative errno" mechanism, in 
> >    which case one word really is alwats enough.
> > 
> > Quite frankly, I much prefer the second alternative. The "negative errno" 
> > thing has not only worked really really well inside the kernel, it's so 
> > obviously 100% superior to the standard UNIX "-1 + errno" approach that 
> > it's not even funny. 
> 
> Currently it's in the syscall wrapper. Couldn't we have it in the 
> asys_teardown_stack() stub?

Eeeek, that was something *really* stupid I said :D



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06  0:27                   ` Scot McKinley
@ 2007-02-06  0:48                     ` David Miller
  2007-02-06  0:48                     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: David Miller @ 2007-02-06  0:48 UTC (permalink / raw)
  To: scot.mckinley
  Cc: torvalds, bert.hubert, davidel, mingo, zach.brown, linux-kernel,
	linux-aio, suparna, bcrl

From: Scot McKinley <scot.mckinley@oracle.com>
Date: Mon, 05 Feb 2007 16:27:44 -0800

> As Joel mentioned earlier, from an Oracle perspective, one of the key 
> things we are looking for is a nice clean *common* wait point.

How much investigation have the Oracle folks (besides Zach :-) done
into Evgeniy's kevent interfaces and how much feedback have they given
to him.

I know it sounds like I'm being a pain in the ass, but it saddens
me that there is this whole large body of work implemented to solve
a problem, the maintainer keeps posting patch sets and the whole
discussions has gone silent.

I'd be quiet if there were some well formulated objections to his work
being posted, but people are posting nothing.  So either it's a
perfect API or people aren't giving it the attention and consideration
it deserves.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06  0:27                   ` Scot McKinley
  2007-02-06  0:48                     ` David Miller
@ 2007-02-06  0:48                     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2007-02-06  0:48 UTC (permalink / raw)
  To: Scot McKinley
  Cc: Linus Torvalds, bert hubert, Davide Libenzi, Ingo Molnar,
	Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On Mon, Feb 05, 2007 at 04:27:44PM -0800, Scot McKinley wrote:
> Finally, it is agreed that neg-errno is a much better approach for the 
> return code. The threading/concurrency issues associated w/ the current 
> unix errno has always been buggy area for Oracle Networking code.

	As Scot knows, when Oracle started using the current io_submit(2)
and io_getevents(2), -errno was a big win.

Joel

-- 

"Born under a bad sign.
 I been down since I began to crawl.
 If it wasn't for bad luck,
 I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:31                                 ` Kent Overstreet
@ 2007-02-06 20:25                                   ` Davide Libenzi
  2007-02-06 20:46                                   ` Linus Torvalds
  1 sibling, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-06 20:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Mon, 5 Feb 2007, Kent Overstreet wrote:

> > > HOWEVER, they get returned differently. The cookie gets returned
> > > immediately, the system call result gets returned in-memory only after the
> > > async thing has actually completed.
> > >
> > > I would actually argue that it's not the kernel that should generate any
> > > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > > the kernel should consider it a pointer to a 64-bit entity which is the
> > > return code.
> > 
> > Yes. Let's have the userspace to "mark" the async operation. IMO the
> > cookie should be something transparent to the kernel.
> > Like you said though, that'd require compat-code (unless we fix the size).
> 
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
> 
> struct asys_ret {
>     int ret;
>     struct thread *p;
> };
> 
> struct asys_ret r;
> r.p = me;
> 
> async_read(fd, buf, nbytes, &r);

Hmm, are you working for Symbian? Because that's exactly how they track 
pending async operations (address of a status variable - wrapped in a 
class of course, being them) ;)
That's another way of doing it, IMO no better no worse than letting 
explicit cookie selection from userspace. You still have to have the 
compat code though, either ways.



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-05 21:31                                 ` Kent Overstreet
  2007-02-06 20:25                                   ` Davide Libenzi
@ 2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06 21:16                                     ` David Miller
  2007-02-06 22:45                                     ` Kent Overstreet
  1 sibling, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-06 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Mon, 5 Feb 2007, Kent Overstreet wrote:
> 
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
> 
> struct asys_ret {
>     int ret;
>     struct thread *p;
> };
> 
> struct asys_ret r;
> r.p = me;
> 
> async_read(fd, buf, nbytes, &r);

That's horrible. It means that "r" cannot have automatic linkage (since 
the stack will be *gone* by the time we need to fill in "ret"), so now you 
need to track *two* pointers: "me" and "&r".

Wouldn't it be much better to just track one (both in user space and in 
kernel space).

In kernel space, the "one pointer" would be the fibril pointer (which 
needs to have all the information necessary for completing the operation 
anyway), and in user space, it would be better to have just the cookie be 
a pointer to the place where you expect the return value (since you need 
both anyway).

I think the point here (for *both* the kernel and user space) would be to 
try to keep the interfaces really easy to use. For the kernel, it means 
that we don't ever pass anything new around: the "fibril" pointer is 
basically defined by the current execution thread.

And for user space, it means that we pass the _one_ thing around that we 
need for both identifying the async operation to the kernel (the "cookie") 
for wait or cancel, and the place where we expect the return value to be 
found (which in turn can _easily_ represent a whole "struct aiocb *", 
since the return value obviously has to be embedded in there anyway).

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 20:46                                   ` Linus Torvalds
@ 2007-02-06 21:16                                     ` David Miller
  2007-02-06 21:28                                       ` Linus Torvalds
  2007-02-06 22:45                                     ` Kent Overstreet
  1 sibling, 1 reply; 151+ messages in thread
From: David Miller @ 2007-02-06 21:16 UTC (permalink / raw)
  To: torvalds
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 6 Feb 2007 12:46:11 -0800 (PST)

> And for user space, it means that we pass the _one_ thing around that we 
> need for both identifying the async operation to the kernel (the "cookie") 
> for wait or cancel, and the place where we expect the return value to be 
> found (which in turn can _easily_ represent a whole "struct aiocb *", 
> since the return value obviously has to be embedded in there anyway).

I really think that Evgeniy's kevent is a good event notification
mechanism for anything, including AIO.

Events are events, applications want a centralized way to receive and
process them.

It's already implemented, and if there are tangible problems with it,
Evgeniy has been excellent at responding to criticism and implementing
suggested changes to the interfaces.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:16                                     ` David Miller
@ 2007-02-06 21:28                                       ` Linus Torvalds
  2007-02-06 21:31                                         ` David Miller
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-06 21:28 UTC (permalink / raw)
  To: David Miller
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl



On Tue, 6 Feb 2007, David Miller wrote:
> 
> I really think that Evgeniy's kevent is a good event notification
> mechanism for anything, including AIO.
> 
> Events are events, applications want a centralized way to receive and
> process them.

Don't be silly. AIO isn't an event. AIO is an *action*.

The event part is hopefully something that doesn't even *happen*.

Why do people ignore this? Look at a web server: I can pretty much 
guarantee that 99% of all filesystem accesses are cached, and doing them 
as "events" would be a total and utter waste of time.

You want to do them synchronously, as fast as possible, and you do NOT 
want to see them as any kind of asynchronous events.

Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
Maybe the kevent queue works then, but if it needs any more setup than the 
nonblocking case, that's a big no.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:28                                       ` Linus Torvalds
@ 2007-02-06 21:31                                         ` David Miller
  2007-02-06 21:46                                           ` Eric Dumazet
  2007-02-06 21:50                                           ` Linus Torvalds
  0 siblings, 2 replies; 151+ messages in thread
From: David Miller @ 2007-02-06 21:31 UTC (permalink / raw)
  To: torvalds
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)

> Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
> Maybe the kevent queue works then, but if it needs any more setup than the 
> nonblocking case, that's a big no.

So the idea is to just run it to completion if it won't block and use
a fibril if it would?

kevent could support something like that too.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:31                                         ` David Miller
@ 2007-02-06 21:46                                           ` Eric Dumazet
  2007-02-06 21:50                                           ` Linus Torvalds
  1 sibling, 0 replies; 151+ messages in thread
From: Eric Dumazet @ 2007-02-06 21:46 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, kent.overstreet, davidel, zach.brown, mingo,
	linux-kernel, linux-aio, suparna, bcrl

David Miller a écrit :
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)
> 
>> Yeah, in 1% of all cases it will block, and you'll want to wait for them. 
>> Maybe the kevent queue works then, but if it needs any more setup than the 
>> nonblocking case, that's a big no.
> 
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?
> 
> kevent could support something like that too.

It seems to me that kevent was designed to handle many events sources on a 
single endpoint, like epoll (but with different internals). Typical load of 
thousand of sockets/pipes providers glued into one queue.

In the fibril case, I guess a thread wont have many fibrils lying around...

Also, kevent needs a fd lookup/fput to retrieve some queued events, and that 
may be a performance hit for the AIO case, (fget/fput in a multi-threaded 
program cost some atomic ops)


^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:31                                         ` David Miller
  2007-02-06 21:46                                           ` Eric Dumazet
@ 2007-02-06 21:50                                           ` Linus Torvalds
  2007-02-06 22:28                                             ` Zach Brown
  1 sibling, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-06 21:50 UTC (permalink / raw)
  To: David Miller
  Cc: kent.overstreet, davidel, zach.brown, mingo, linux-kernel,
	linux-aio, suparna, bcrl



On Tue, 6 Feb 2007, David Miller wrote:
> 
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?

That's not how the patches work right now, but yes, I at least personally 
think that it's something we should aim for (ie the interface shouldn't 
_require_ us to always wait for things even if perhaps an early 
implementation might make everything be delayed at first)

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 21:50                                           ` Linus Torvalds
@ 2007-02-06 22:28                                             ` Zach Brown
  0 siblings, 0 replies; 151+ messages in thread
From: Zach Brown @ 2007-02-06 22:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, kent.overstreet, davidel, mingo, linux-kernel,
	linux-aio, suparna, bcrl

> That's not how the patches work right now, but yes, I at least  
> personally
> think that it's something we should aim for (ie the interface  
> shouldn't
> _require_ us to always wait for things even if perhaps an early
> implementation might make everything be delayed at first)

I agree that we shouldn't require a seperate syscall just to get the  
return code from ops that didn't block.

It doesn't seem like much of a stretch to imagine a setup where we  
can specify completion context as part of the submission itself.

	declare_empty_ring(ring);
	struct submission sub;

	sub.ring = &ring;
	sub.nr = SYS_fstat64;
	sub.args == ...

	ret = submit(&sub, 1);
	if (ret == 0) {
		wait_for_elements(&ring, 1);
		printf("stat gave %d\n", ring[ring->head].rc);
	}

You get the idea, it's just an outline.

wait_for_elements() could obviously check the ring before falling  
back to kernel sync.  I'm pretty keen on the notion of producer/ 
consumer rings where userspace writes the head as it plucks  
completions and the kernel writes the tail as it adds them.

We might want per-call ring pointers, instead of per submission, to  
help submitters wait for a group of ops to complete without having to  
do their own tracking on event completion.  That only makes sense if  
we have the waiting mechanics let you only be woken as the number of  
events in the ring crosses some threshold.  Which I think we want  
anyway.

We'd be trading building up a specific completion state with syscalls  
for some complexity during submission that pins (and kmaps on  
completion) the user pages.  Submission could return failure if  
pinning these new pages would push us over some rlimit.  We'd have to  
be *awfully* careful not to let userspace corrupt (munmap?) the ring  
and confuse the hell out of the kernel.

Maybe not worth it, but if we *really* cared about making the non- 
blocking case almost identical to the sync case and wanted to use the  
same interface for batch submission and async completion then this  
seems like a possibility.

- z

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 20:46                                   ` Linus Torvalds
  2007-02-06 21:16                                     ` David Miller
@ 2007-02-06 22:45                                     ` Kent Overstreet
  2007-02-06 23:04                                       ` Linus Torvalds
  2007-02-06 23:23                                       ` Davide Libenzi
  1 sibling, 2 replies; 151+ messages in thread
From: Kent Overstreet @ 2007-02-06 22:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On 2/6/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 5 Feb 2007, Kent Overstreet wrote:
> >
> > struct asys_ret {
> >     int ret;
> >     struct thread *p;
> > };
> >
> > struct asys_ret r;
> > r.p = me;
> >
> > async_read(fd, buf, nbytes, &r);
>
> That's horrible. It means that "r" cannot have automatic linkage (since
> the stack will be *gone* by the time we need to fill in "ret"), so now you
> need to track *two* pointers: "me" and "&r".

You'd only allocate r on the stack if that stack is going to be around
later; i.e. if you're using user threads. Otherwise, you just allocate
it in some struct containing your aiocb or whatever.

> And for user space, it means that we pass the _one_ thing around that we
> need for both identifying the async operation to the kernel (the "cookie")
> for wait or cancel, and the place where we expect the return value to be
> found (which in turn can _easily_ represent a whole "struct aiocb *",
> since the return value obviously has to be embedded in there anyway).
>
>                 Linus

The "struct aiocb" isn't something you have to or necessarily want to
keep around. It's the way the current aio interface works (which I've
coded to), but I don't really see the point. All it really contains is
the syscall arguments, but once the syscall's in progress there's no
reason the kernel has to refer back to it; similarly for userspace,
it's just another struct that userspace has to keep track of and free
at some later time.

In fact, that's the only sane way you can have a ring for submitted
system calls, as otherwise elements of the ring are getting freed in
essentially random order.

I don't see the point in having a ring for completed events, since
it's at most two pointers per completion; quite a bit less data being
sent back than for submissions.

-----

The trouble with differentiating between calls that block and calls
that don't is you completely loose the ability to batch syscalls
together; this is potentially a major win of an asynchronous
interface.

An app can have a bunch of cheap, fast user space threads servicing
whatever; as they run, they can push their system calls onto a global
stack. When no more can run, it does a giant asys_submit (something
similar to io_submit), then the io_getevents equivilant, running the
user threads that had their syscalls complete.

This doesn't mean you can't run synchronously the syscalls that
wouldn't block, or that you have to allocate a fibril for every
syscall - but for servers that care more about throughput than
latency, this is potentially a big win, in cache effects if nothing
else.

(And this doesn't prevent you from having a different syscall that
submits an asynchronous syscall, but runs it right away if it was able
to without blocking).

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 22:45                                     ` Kent Overstreet
@ 2007-02-06 23:04                                       ` Linus Torvalds
  2007-02-07  1:22                                         ` Kent Overstreet
  2007-02-06 23:23                                       ` Davide Libenzi
  1 sibling, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-06 23:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise



On Tue, 6 Feb 2007, Kent Overstreet wrote:
> 
> The "struct aiocb" isn't something you have to or necessarily want to
> keep around.

Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be 
backwards compatibility. The point is, we'd need to keep that 
compatibility to be useful - otherwise we just end up having to duplicate 
the work (do _both_ fibrils _and_ the in-kernel AIO). 

> I don't see the point in having a ring for completed events, since
> it's at most two pointers per completion; quite a bit less data being
> sent back than for submissions.

I'm certainly personally perfectly happy with the kernel not remembering 
any completed events at all - once it's done, it's done and forgotten. So 
doing

	async(mycookie)
	wait_for_async(mycookie)

could actually return with -ECHILD (or similar error). 

In other words, if you see it as a "process interface" (instead of as a 
"filedescriptor interface"), I'd suggest automatic reaping of the fibril 
children. I do *not* think we want the equivalent of zombies - if only 
because they are just a lot of work to reap, and potentially a lot of 
memory to keep around.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 22:45                                     ` Kent Overstreet
  2007-02-06 23:04                                       ` Linus Torvalds
@ 2007-02-06 23:23                                       ` Davide Libenzi
  2007-02-06 23:39                                         ` Joel Becker
  1 sibling, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-06 23:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Kent Overstreet wrote:

> The trouble with differentiating between calls that block and calls
> that don't is you completely loose the ability to batch syscalls
> together; this is potentially a major win of an asynchronous
> interface.

It doesn't necessarly have to, once you extend the single return code to a 
vector:

struct async_submit {
	void *cookie;
	int sysc_nbr;
	int nargs;
	long args[ASYNC_MAX_ARGS];
	int async_result;
};

int async_submit(struct async_submit *a, int n);

And async_submit() can mark each one ->async_result with -EASYNC (syscall 
has been batched), or another code (syscall completed w/out schedule).
IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:23                                       ` Davide Libenzi
@ 2007-02-06 23:39                                         ` Joel Becker
  2007-02-06 23:56                                           ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2007-02-06 23:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> struct async_submit {
> 	void *cookie;
> 	int sysc_nbr;
> 	int nargs;
> 	long args[ASYNC_MAX_ARGS];
> 	int async_result;
> };
> 
> int async_submit(struct async_submit *a, int n);
> 
> And async_submit() can mark each one ->async_result with -EASYNC (syscall 
> has been batched), or another code (syscall completed w/out schedule).
> IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.

	There are pains here, though.  On every submit, you have to walk
the entire vector just to know what did or did not complete.  I've seen
this in other APIs (eg, async_result would be -EAGAIN for lack of
resources to start this particular fibril).  Userspace submit ends up
always walking the array of submissions twice - once to prep them, and
once to check if they actually went async.  For longer lists of I/Os,
this is expensive.

Joel

-- 

"Too much walking shoes worn thin.
 Too much trippin' and my soul's worn thin.
 Time to catch a ride it leaves today
 Her name is what it means.
 Too much walking shoes worn thin."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:39                                         ` Joel Becker
@ 2007-02-06 23:56                                           ` Davide Libenzi
  2007-02-07  0:06                                             ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-06 23:56 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> > struct async_submit {
> > 	void *cookie;
> > 	int sysc_nbr;
> > 	int nargs;
> > 	long args[ASYNC_MAX_ARGS];
> > 	int async_result;
> > };
> > 
> > int async_submit(struct async_submit *a, int n);
> > 
> > And async_submit() can mark each one ->async_result with -EASYNC (syscall 
> > has been batched), or another code (syscall completed w/out schedule).
> > IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.
> 
> 	There are pains here, though.  On every submit, you have to walk
> the entire vector just to know what did or did not complete.  I've seen
> this in other APIs (eg, async_result would be -EAGAIN for lack of
> resources to start this particular fibril).  Userspace submit ends up
> always walking the array of submissions twice - once to prep them, and
> once to check if they actually went async.  For longer lists of I/Os,
> this is expensive.

Async syscall submissions are a _one time_ things. It's not like a live fd 
that you can push inside epoll and avoid the multiple O(N) passes.
First of all, the amount of syscalls that you'd submit in a vectored way 
are limited. They do not depend on the total number of connections, but on 
the number of syscalls that you are actualy able to submit in parallel.
Note that it's not a trivial tasks to extract a long enough level of 
parallelism, that would make you feel pain in having to walk through the 
submission array. Think about the trivial web server case. Remote HTTP 
client asks one page, and you may think to batch a few ops together (like 
a stat, open, send headers, and sendfile for example), but those cannot be 
vectored since they have to complete in order. The stat would even trigger 
different response to the HTTP client. You need the open() fd to submit 
the send-headers and sendfile.
IMO there are no scalability problems in a multiple submission/retrieval 
API like the above (or any variation of it).



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:56                                           ` Davide Libenzi
@ 2007-02-07  0:06                                             ` Joel Becker
  2007-02-07  0:23                                               ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: Joel Becker @ 2007-02-07  0:06 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> Async syscall submissions are a _one time_ things. It's not like a live fd 
> that you can push inside epoll and avoid the multiple O(N) passes.
> First of all, the amount of syscalls that you'd submit in a vectored way 
> are limited. They do not depend on the total number of connections, but on 

	I regularly see apps that want to submit 1000 I/Os at once.
Every submit.  But it's all against one or two file descriptors.  So, if
you return to userspace, they have to walk all 1000 async_results every
time, just to see which completed and which didn't.  And *then* go wait
for the ones that didn't.  If they just wait for them all, they aren't
spinning cpu on the -EASYNC operations.
	I'm not saying that "don't return a completion if we can
non-block it" is inherently wrong or not a good idea.  I'm saying that
we need a way to flag them efficiently.

Joel

-- 

Life's Little Instruction Book #80

	"Slow dance"

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:06                                             ` Joel Becker
@ 2007-02-07  0:23                                               ` Davide Libenzi
  2007-02-07  0:44                                                 ` Joel Becker
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-07  0:23 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> > Async syscall submissions are a _one time_ things. It's not like a live fd 
> > that you can push inside epoll and avoid the multiple O(N) passes.
> > First of all, the amount of syscalls that you'd submit in a vectored way 
> > are limited. They do not depend on the total number of connections, but on 
> 
> 	I regularly see apps that want to submit 1000 I/Os at once.
> Every submit.  But it's all against one or two file descriptors.  So, if
> you return to userspace, they have to walk all 1000 async_results every
> time, just to see which completed and which didn't.  And *then* go wait
> for the ones that didn't.  If they just wait for them all, they aren't
> spinning cpu on the -EASYNC operations.
> 	I'm not saying that "don't return a completion if we can
> non-block it" is inherently wrong or not a good idea.  I'm saying that
> we need a way to flag them efficiently.

To how many "sessions" those 1000 *parallel* I/O operations refer to? 
Because, if you batch them in an async fashion, they have to be parallel.
Without the per-async operation status code, you'll need to wait a result 
*for each* submitted syscall, even the ones that completed syncronously.
Open questions are:

- Is the 1000 *parallel* syscall vectored submission case common?

- Is it more expensive to forcibly have to wait and fetch a result even 
  for in-cache syscalls, or it's faster to walk the submission array?



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:23                                               ` Davide Libenzi
@ 2007-02-07  0:44                                                 ` Joel Becker
  2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  6:16                                                   ` Michael K. Edwards
  0 siblings, 2 replies; 151+ messages in thread
From: Joel Becker @ 2007-02-07  0:44 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 04:23:52PM -0800, Davide Libenzi wrote:
> To how many "sessions" those 1000 *parallel* I/O operations refer to? 
> Because, if you batch them in an async fashion, they have to be parallel.

	They're independant.  Of course they have to be parallel, that's
what I/O wants.

> Without the per-async operation status code, you'll need to wait a result 
> *for each* submitted syscall, even the ones that completed syncronously.

	You are right, but it's more efficient in some cases.

> Open questions are:
> 
> - Is the 1000 *parallel* syscall vectored submission case common?

	Sure is for I/O.  It's the majority of the case.  If you have
1000 blocks to send out, you want them all down at the request queue at
once, where they can merge.

> - Is it more expensive to forcibly have to wait and fetch a result even 
>   for in-cache syscalls, or it's faster to walk the submission array?

	Not everything is in-cache.  Databases will be doing O_DIRECT
and will expect that 90% of their I/O calls will block.  Why should they
have to iterate this list every time?  If this is the API, they *have*
to.  If there's an efficient way to get "just the ones that didn't
block", then it's not a problem.

Joel


-- 

"The real reason GNU ls is 8-bit-clean is so that they can
 start using ISO-8859-1 option characters."
	- Christopher Davis (ckd@loiosh.kei.com)

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:44                                                 ` Joel Becker
@ 2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  1:24                                                     ` Kent Overstreet
  2007-02-07  1:30                                                     ` Joel Becker
  2007-02-07  6:16                                                   ` Michael K. Edwards
  1 sibling, 2 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-07  1:15 UTC (permalink / raw)
  To: Joel Becker
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, 6 Feb 2007, Joel Becker wrote:

> > - Is it more expensive to forcibly have to wait and fetch a result even 
> >   for in-cache syscalls, or it's faster to walk the submission array?
> 
> 	Not everything is in-cache.  Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block.  Why should they
> have to iterate this list every time?  If this is the API, they *have*
> to.  If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

If that's what is wanted, then the async_submit() API can detect the 
syncronous completion soon, and drop a result inside the result-queue 
immediately. It means that an immediately following async_wait() will find 
some completions soon. Or:

struct async_submit {
        void *cookie;
        int sysc_nbr;
        int nargs;
        long args[ASYNC_MAX_ARGS];
};
struct async_result {
        void *cookie;
	long result:
};

int async_submit(struct async_submit *a, struct async_result *r, int n);

Where "r" will store the ones that completed syncronously. I mean, there 
are really many ways to do this.
I think ATM the core kernel implementation should be the focus, because 
IMO we just scratched the surface of the potential problems that something 
like this can arise (scheduling, signaling, cleanup, cancel - just to 
name a few).



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-06 23:04                                       ` Linus Torvalds
@ 2007-02-07  1:22                                         ` Kent Overstreet
  0 siblings, 0 replies; 151+ messages in thread
From: Kent Overstreet @ 2007-02-07  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On 2/6/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Feb 2007, Kent Overstreet wrote:
> >
> > The "struct aiocb" isn't something you have to or necessarily want to
> > keep around.
>
> Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be
> backwards compatibility. The point is, we'd need to keep that
> compatibility to be useful - otherwise we just end up having to duplicate
> the work (do _both_ fibrils _and_ the in-kernel AIO).

Bah, I was unclear here, sorry. I was talking about the userspace interface.

Right now, with the aio interface, io_submit passes in an array of
pointers to struct iocb; there's nothing that says the kernel will be
done with the structs when io_submit returns, so while userspace is
free to reuse the array of pointers, it can't free the actual iocbs
until they complete.

This is slightly stupid, for a couple reasons, and if we're making a
new pair of sycalls it'd be better to do it slightly differently.

What you want is for the async_submit syscall (or whatever it's
called) to pass in an array of structs, and for the kernel to not
reference them after async_submit returns. This is easy; after
async_submit returns, each syscall in the array is either completed
(if it could be without blocking), or in progress, and there's no
reason to need the arguments again.

It also means that the kernel has to copy in only a single userspace
buffer, instead of one buffer per syscall; as Joel mentions, there are
plenty of apps that will be doing 1000s of syscalls at once. From a
userspace perspective it's awesome, it simplifies coding for it and
means you have to hit the heap that much less.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  1:15                                                   ` Davide Libenzi
@ 2007-02-07  1:24                                                     ` Kent Overstreet
  2007-02-07  1:30                                                     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Kent Overstreet @ 2007-02-07  1:24 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Joel Becker, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

> If that's what is wanted, then the async_submit() API can detect the
> syncronous completion soon, and drop a result inside the result-queue
> immediately. It means that an immediately following async_wait() will find
> some completions soon. Or:
>
> struct async_submit {
>         void *cookie;
>         int sysc_nbr;
>         int nargs;
>         long args[ASYNC_MAX_ARGS];
> };
> struct async_result {
>         void *cookie;
>         long result:
> };
>
> int async_submit(struct async_submit *a, struct async_result *r, int n);
>
> Where "r" will store the ones that completed syncronously. I mean, there
> are really many ways to do this.

That interface (modifying async_submit to pass in the size of the
result array) would work great.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  1:15                                                   ` Davide Libenzi
  2007-02-07  1:24                                                     ` Kent Overstreet
@ 2007-02-07  1:30                                                     ` Joel Becker
  1 sibling, 0 replies; 151+ messages in thread
From: Joel Becker @ 2007-02-07  1:30 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Kent Overstreet, Linus Torvalds, Zach Brown, Ingo Molnar,
	Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise

On Tue, Feb 06, 2007 at 05:15:02PM -0800, Davide Libenzi wrote:
> I think ATM the core kernel implementation should be the focus, because 

	Yeah, I was thinking the same thing.  I originally posted just
to make the point :-)

Joel

-- 

Life's Little Instruction Book #99

	"Think big thoughts, but relish small pleasures."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  0:44                                                 ` Joel Becker
  2007-02-07  1:15                                                   ` Davide Libenzi
@ 2007-02-07  6:16                                                   ` Michael K. Edwards
  2007-02-07  9:17                                                     ` Michael K. Edwards
  1 sibling, 1 reply; 151+ messages in thread
From: Michael K. Edwards @ 2007-02-07  6:16 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

On 2/6/07, Joel Becker <Joel.Becker@oracle.com> wrote:
>         Not everything is in-cache.  Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block.  Why should they
> have to iterate this list every time?  If this is the API, they *have*
> to.  If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

It's usually efficient, especially in terms of programmer effort, for
the immediate path to resemble as nearly as possible what you would
have done with the synchronous equivalent.  (If there's some value in
parallelizing the query across multiple CPUs, you probably don't want
the kernel guessing how to partition it.)  But what's efficient for
the delayed path is to be tightly bound to the arrival of the AIO
result, and to do little more than schedule it into the appropriate
event queue or drop it if it is stale.  The immediate and delayed
paths will often share part, but not all, of their implementation, and
most of the shared part is probably data structure setup that can
precede the call itself.  The rest of the delayed path is where the
design effort should go, because it's the part that has the sort of
complex impact on system performance that is hard for application
programmers to think clearly about.

Oracle isn't the only potential userspace user of massively concurrent
AIO with a significant, but not dominant, fraction of cache hits.  I'm
familiar with a similar use case in network monitoring, in which one
would like to implement the attribute tree and query translation more
or less as a userspace filesystem, while leaving both the front-end
caching and the back-end throttling, retries, etc. to in-kernel state
machines.  When 90% of the data requested by the front end (say, a
Python+WxWidgets GUI) is available from the VFS cache, only the other
10% should actually carry the AIO overhead.

Let's look at that immediately available fraction from the GUI
programmer's perspective.  He wants to look up some attributes from a
whole batch of systems, and wants to present all immediately available
results to the user, with the rest grayed out or something.  Each
request for data that is available from cache should result
immediately in a call to his (perhaps bytecode-language) callback,
which fills in a slot in the data structure that he's going to present
wholesale.  There's no reason why the immediate version of the
callback should be unable to allocate memory, poke at thread-local
structures, etc.; and in practice there's little to be gained by
parallelizing this fraction (or even aggressively delivering AIOs that
complete quickly) because you'd need to thread-safe that data
structure, which probably isn't worth it in performance and certainly
isn't in programmer effort and likelihood of Heisenbugs.

Delayed results, on the other hand, probably have to use the GUI's
event posting mechanism to queue the delivered data (probably in a
massaged form) into a GUI update thread.  Hence the delayed callback
can be delivered in some totally other context if it's VM- and
scheduler-efficient to do so; it's probably just doing a couple of
memcpys and a sem_post or some such.  The only reason it isn't a
totally separate chunk of code is that it uses the same context layout
as the immediate path, and may have to poke at some of the same
pre-allocated places to update completion statistics, etc.

(I implemented something similar to this in userspace using Python
generators for the closure-style callbacks, in the course of rewriting
a GUI that had a separate protocol translator process in place of the
userspace filesystem.  The thread pool that serviced responses from
the protocol translator operated much like Zach's fibrils, and used a
sort of lookup by request cookie to retrieve the closure and feed it
the result, which had the side effect of posting the appropriate
event.  It worked, fast, and it was more or less comprehensible to
later maintainers despite the use of Python's functional features,
because the AIO delivery was kept separate from both the plain-vanilla
immediate-path code and the GUI-idiom event queue processing.)

The broader issue here is that only the application developer really
knows how the AIO results ought to be funneled into the code that
wants them, which could be a database query engine or a GUI update
queue or Murphy knows what.  This "application AIO closure" step is
distinct from the application-neutral closure that needs to run in a
kernel "fibril" (extracting stat() results from an NFS response, or
whatever).  So it seems to me that applications ought to be able to
specify a userspace closure to be executed on async I/O completion (or
timeout, error, etc.), and this closure should be scheduled
efficiently on completion of the kernel bit.

The delayed path through the userspace closure would partly resemble a
signal handler in that it shouldn't touch thread or heap context, just
poke at pre-allocated process-global memory locations and/or
synchronization primitives.  (A closer parallel, for those familiar
with it, would be the "event handlers" of O/S's with cooperative
multitasking and a single foreground application; MacOS 6.x with
MultiFinder and PalmOS 4.x come to mind.)

What if we share a context+stack page between kernel and userspace to
be used by both the kernel "I/O completion" closure and the userspace
"event handler" closure?  After all, these are the pieces that
cooperatively multitask with one another.  Pop the kernel AIO closure
scheduler into the tasklet queue right after the softirq tasklet --
surely 99% of "fibrils" would become runnable due to something that
happens in a softirq, and it would make at least as much sense to run
there as in the task's schedule() path.  The event handler would be
scheduled in most respects like a signal handler in a POSIX threaded
process -- running largely in the context of some application thread
(on syscall exit or by preemption), and limited in the set of APIs it
can call.

In this picture, the ideal peristalsis would usually be ISR exit path
-> softirq -> kernel closure (possibly not thread-like at all, just a
completion scheduled from a tasklet) -> userspace closure ->
application thread.  The kernel and userspace closures could actually
share a stack page which also contains the completion context for
both.  Linus's async_stat() example is a good one, I think.  Here is
somewhat fuller userspace code, without the syntactic sugar that could
easily be used to make the callbacks more closure-ish:

/* linux/aeiou.h */
       typedef void (*aeiou_stat_cb_t) (int, struct aeiou_stat *);

       struct aeiou_stat __ALIGN_ME_PROPERLY__ {
               aeiou_stat_cb_t cb;        /* userspace completion hook */
               struct stat stat_buf;
               union {
                       int filedes;
                       char name[NAME_MAX+1];
               } u;
#ifdef __KERNEL__
               ... completion context for the kernel AIO closure ...
#endif
       }

       /* The returned pointer is the cookie for all */
       /* subsequent aeiou calls in this request group. */
       void *__aeiou_alloc_aeiou_stat(size_t uctx_bytes);

       #define aeiou_begin(ktype, utype, field) \
               (utype *)(__aeiou_alloc_##ktype(offsetof(utype, field))

/* foo.c */
       struct one_entry {
               ... closure context for the userspace event handler ...
               struct aeiou_stat s;
       }

       static void my_cb(int is_delayed, struct aeiou_stat *as) {
               struct one_entry *my_context = container_of(as, struct
one_entry, s);
               ... code that runs in userspace "event handler" context ...
       }

...

       struct one_entry *entry = aeiou_begin(aeiou_stat, struct one_entry, s);
       struct dirent *de;

       entry->s.cb = my_cb;
       /* set up some process-global data structure to hold */
       /* the results of this burst of async_stat calls */

       while ((de = readdir(dir)) != NULL) {
               strcpy(entry->s.u.name, de->d_name);
               /* set up any additional application context */
               /* in *entry for this individual async_stat call */

               aeiou_stat(entry);
       }
       /* application tracks outstanding AIOs using data structure */
       /* there could also be an aeiou_checkprogress(entry) */
       ...
       aeiou_end(entry);

(The use of "aeiou_stat" rather than a more general class of async I/O
calls is for illustration purposes.)

If the stat data is immediately available when aeiou_stat() is called,
the struct stat gets filled in and the callback is run immediately in
the current stack context.  If not, the contents of *entry are copied
to a new page (possibly using COW VM magic), and the syscall returns.
On the next trip through the scheduler (or when a large enough batch
of AIOs have been queued to be worth initiating them at the cost of
shoving the userspace code out of cache), the kernel closures are set
up in the opaque trailer to aeiou_stat in the copies, and the AIOs are
initiated.

The signature of aeiou_stat is deliberately limited to a single
pointer, since all of its arguments are likely to be interesting to
one or both closures.  There is no need to pass the offset to the
kernel parameter sub-struct into calls after the initial aeiou_begin;
the kernel has to check the validity of the "entry" pointer/cookie
anyway, so it had best keep track of the enclosing allocation bounds,
offset to the syscall parameter structure, etc. in a place where
userspace can't alter it.  Both kernel and userspace closures
eventually run with their stack in the shared page, after the closure
context area.  The userspace closure has to respect
signal-handler-like limitations on its powers if is_delayed is true;
it will run in the right process context but has no particular thread
context and can't call anything that could block or allocate memory.

I think this sort of interface might work well for both GUI event
frameworks and real-time streaming media playback/mixing, which are
two common ways for AIO to enter the mere userspace programmer's
sphere of concern (and also happen to be areas where I have some
expertise).  Would it work for the Oracle use case?

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  6:16                                                   ` Michael K. Edwards
@ 2007-02-07  9:17                                                     ` Michael K. Edwards
  2007-02-07  9:37                                                       ` Michael K. Edwards
  0 siblings, 1 reply; 151+ messages in thread
From: Michael K. Edwards @ 2007-02-07  9:17 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

Man, I should have edited that down before sending it.  Hopefully this
is clearer:

    - The usual programming model for AIO completion in GUIs, media
engines, and the like is an application callback.  Data that is
available immediately may be handled quite differently from data that
arrives after a delay, and usually the only reason for both code paths
to be in the same callback is shared code to maintain counters, etc.
associated with the AIO batch.  These shared operations, and the other
things one might want to do in the delayed path, needn't be able to
block or allocate memory.

    - AIO requests that are serviced from cache ought to immediately
invoke the callback, in the same thread context as the caller, fixing
up the stack so that the callback returns to the instruction following
the syscall.  That way the "immediate completion" path through the
callback can manipulate data structures, allocate memory, etc. just as
if it had followed a synchronous call.

    - AIO requests that need data not in cache should probably be
batched in order to avoid evicting the userspace AIO submission loop,
the immediate completion branch of the callback, and their data
structures from cache on every miss.  If you can use VM copy-on-write
tricks to punt a page of AIO request parameters and closure context
out to another CPU for immediate processing without stomping on your
local caches, great.

    - There's not much point in delivering AIO responses all the way
to userspace until the AIO submission loop is done, because they're
probably going to be handled through some completely different event
queue mechanism in the delayed path through the callback.  Trying to
squeeze a few AIO responses into the same data structure as if they
had been in cache is likely to create race conditions or impose
needless locking overhead on the otherwise serialized immediate
completion branch.

    - The result of the external AIO may arrive on a different CPU
with something completely else in foreground; but in real use cases
it's probably a different thread of the same process.  If you can use
the closure context page as the stack page for the kernel bit of the
AIO completion, and then use it again from userspace as the stack page
for the application bit, then the whole ISR -> softirq -> kernel
closure -> application closure path has minimal system impact.

    - The delayed path through the application callback can't block
and can't touch data structures that are thread-local or may be in an
incoherent state at this juncture (called during a more or less
arbitrary ISR exit path, a bit like a signal handler).  That's OK,
because it's probably just massaging the AIO response into fields of a
preallocated object dangling off of a global data structure and doing
a sem_post or some such.  (It might even just drop it if it's stale.)

    - As far as I can tell (knowing little about the scheduler per
se), these kernel closures aren't much like Zach's "fibrils"; they'd
be invoked from a tasklet chained more or less immediately after the
softirq dispatch tasklet.  I have no idea whether the cost of finding
the appropriate kernel closure(s) associated with the data that
arrived in the course of a softirq, pulling them over to the CPU where
the softirq just ran, and popping out to userspace to run the
application closure is exorbitant, or if it's even possible to force a
process switch from inside a tasklet that way.

Hope this helps, and sorry for the noise,
- Michael

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
  2007-02-07  9:17                                                     ` Michael K. Edwards
@ 2007-02-07  9:37                                                       ` Michael K. Edwards
  0 siblings, 0 replies; 151+ messages in thread
From: Michael K. Edwards @ 2007-02-07  9:37 UTC (permalink / raw)
  To: Davide Libenzi, Kent Overstreet, Linus Torvalds, Zach Brown,
	Ingo Molnar, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise

An idiot using my keyboard wrote:
>     - AIO requests that are serviced from cache ought to immediately
> invoke the callback, in the same thread context as the caller, fixing
> up the stack so that the callback returns to the instruction following
> the syscall.  That way the "immediate completion" path through the
> callback can manipulate data structures, allocate memory, etc. just as
> if it had followed a synchronous call.

Or, of course:
    if (async_stat(entry) == 0) {
        ... immediate completion code path ...
    }

Ugh.  But I think the discussion about the delayed path still holds.

- Michael

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
                   ` (6 preceding siblings ...)
  2007-02-04  5:13 ` Davide Libenzi
@ 2007-02-09 22:33 ` Linus Torvalds
  2007-02-09 23:11   ` Davide Libenzi
                     ` (3 more replies)
  7 siblings, 4 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-09 22:33 UTC (permalink / raw)
  To: Zach Brown
  Cc: Linux Kernel Mailing List, linux-aio, Suparna Bhattacharya,
	Benjamin LaHaise, Ingo Molnar


Ok, here's another entry in this discussion.

This is a *really* small patch. Yes, it adds 174 lines, and yes it's 
actually x86 (32-bit) only, but about half of it is totally generic, and 
*all* of it is almost ludicrously simple.

There's no new assembly language. The one-liner addition to 
"syscall_table.S" is just adding the system call entry stub. It's all in 
C, and none of it is even hard to understand.

It's architecture-specific, because different architectures do the whole 
"fork()" entrypath differently, and this is basically a "delayed fork()", 
not really an AIO thing at all.

So what this does, very simply is:

 - on system call entry just save away the pt_regs pointer needed to do a 
   fork (on some architectures, this means that you need to use a longer 
   system call entry that saves off all registers - on x86 even that isn't 
   an issue)

 - save that away as a magic cookie in the task structure

 - do the system call

 - IF the system call blocks, we call the architecture-specific 
   "schedule_async()" function before we even get any scheduler locks, and 
   it can just do a fork() at that time, and let the *child* return to the 
   original user space. The process that already started doing the system 
   call will just continue to do the system call.

 - when the system call is done, we check to see if it was done totally 
   synchronously or not. If we ended up doing the clone(), we just exit 
   the new thread.

Now, I agree that this is a bit ugly in some of the details: in 
particular, it means that if the system call blocks, we will literally 
return as a *different* thread to user space. If you care, you shouldn't 
use this interface, or come up with some way to make it work nicely (doing 
it this way meant that I could just re-use all the clone/fork code as-is).

Also, it actually does take the hit of creating a full new thread. We 
could optimize that a bit. But at least the cached case has basically 
*zero* overhead: we literally end up doing just a few extra CPU 
instructions to copy the arguments around etc, but no locked cycles, no 
memory allocations, no *nothing*.

So I actually like this, because it means that while we slow down real IO, 
we don't slow down the cached cases at all.

Final warning: I didn't do any cancel/wait crud. It doesn't even return 
the thread ID as it is now. And I only hooked up "stat64()" as an exmple. 
So this really is just a total toy. But it's kind of funny how simple it 
was, once I started thinking about how I could do this in some clever way.

I even added comments, so a lot of the few new added lines aren't even 
code!

		Linus

---

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index c641056..0909724 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -698,6 +698,71 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas
 	return prev_p;
 }
 
+/*
+ * This gets called when an async event causes a schedule.
+ * We should try to
+ *
+ *  (a) create a new thread
+ *  (b) within that new thread, return to the original
+ *      user mode call-site.
+ *  (c) clear the async event flag, since it is now no
+ *      longer relevant.
+ *
+ * If anything fails (a resource issue etc), we just do
+ * the async system call as a normal synchronous event!
+ */
+#define CLONE_ALL (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT | CLONE_THREAD)
+#define FAILED_CLONE ((struct pt_regs *)1)
+void schedule_async(void)
+{
+	struct pt_regs *regs = current->async_cookie;
+	int retval;
+
+	if (regs == FAILED_CLONE)
+		return;
+
+	current->async_cookie = NULL;
+	/*
+	 * This is magic. The child will return through "ret_from_fork()" to
+	 * where the original thread started it all. It's not the same thread
+	 * any more, and we don't much care. The "real thread" has now become
+	 * the async worker thread, and will exit once the async work is done.
+	 */
+	retval = do_fork(CLONE_ALL, regs->esp, regs, 0, NULL, NULL);
+
+	/*
+	 * If it failed, we could just restore the async_cookie and try again
+	 * on the next scheduling event. 
+	 *
+	 * But it's just better to set it to some magic value to indicate
+	 * "do not try this again". If it failed once, we shouldn't waste 
+	 * time trying it over and over again.
+	 *
+	 * Any non-NULL value will tell "do_async()" at the end that it was
+	 * done "synchronously".
+	 */
+	if (retval < 0)
+		current->async_cookie = FAILED_CLONE;
+}
+
+asmlinkage int sys_async(struct pt_regs regs)
+{
+	void *async_cookie;
+	unsigned long syscall, flags;
+	int __user *status;
+	unsigned long __user *user_args;
+
+	/* Pick out the do_async() arguments.. */
+	async_cookie = &regs;
+	syscall = regs.ebx;
+	flags = regs.ecx;
+	status = (int __user *) regs.edx;
+	user_args = (unsigned long __user *) regs.esi;
+
+	/* ..and call the generic helper routine */
+	return do_async(async_cookie, syscall, flags, status, user_args);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
 	return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..647193c 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_async			/* 320 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e14b11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,6 +844,13 @@ struct task_struct {
 
 	struct mm_struct *mm, *active_mm;
 
+	/*
+	 * The scheduler uses this to determine if the current call is a
+	 * standalone thread or just an async system call that hasn't
+	 * had its real thread created yet.
+	 */
+	void *async_cookie;
+
 /* task state */
 	struct linux_binfmt *binfmt;
 	long exit_state;
@@ -1649,6 +1656,12 @@ extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);
 
 extern void normalize_rt_tasks(void);
 
+/* Async system call support */
+extern long do_async(void *async_cookie, unsigned int syscall, unsigned long flags,
+	 int __user *status, unsigned long __user *user_args);
+extern void schedule_async(void);
+                                        
+
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 14f4d45..13bda9f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y     = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
-	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+	    hrtimer.o rwsem.o latency.o nsproxy.o srcu.o async.o
 
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
 obj-y += time/
diff --git a/kernel/async.c b/kernel/async.c
new file mode 100644
index 0000000..29b14f3
--- /dev/null
+++ b/kernel/async.c
@@ -0,0 +1,71 @@
+/*
+ * kernel/async.c
+ *
+ * Create a light-weight kernel-level thread.
+ */
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+/* Fake "generic" system call pointer type */
+typedef asmlinkage long (*master_syscall_t)(unsigned long arg, ...);
+
+#define ASYNC_SYSCALL(syscall, param) \
+	{ (master_syscall_t) (syscall), (param) }
+
+static struct async_call {
+	master_syscall_t fn;
+	int args;
+} call_descriptor[] = {
+	ASYNC_SYSCALL(sys_stat64, 2),
+};
+
+long do_async(
+	void *async_cookie,
+	unsigned int syscall,
+	unsigned long flags,
+	int __user *status,
+	unsigned long __user *user_args)
+{
+	int ret, size;
+	struct async_call *desc;
+	unsigned long args[6];
+
+	if (syscall >= ARRAY_SIZE(call_descriptor))
+		return -EINVAL;
+
+	desc = call_descriptor + syscall;
+	if (!desc->fn)
+		return -EINVAL;
+
+	if (desc->args > ARRAY_SIZE(args))
+		return -EINVAL;
+
+	size = sizeof(unsigned long)*desc->args;
+	if (copy_from_user(args, user_args, size))
+		return -EFAULT;
+
+	/* We don't nest async calls! */
+	if (current->async_cookie)
+		return -EINVAL;
+	current->async_cookie = async_cookie;
+
+	ret = desc->fn(args[0], args[1], args[2], args[3], args[4], args[5]);
+	put_user(ret, status);
+
+	/*
+	 * Did we end up doing part of the work in a separate thread?
+	 *
+	 * If so, the async thread-creation already returned in the
+	 * origial parent, and cleared out the async_cookie. We're
+	 * now just in the worker thread, and should just exit. Our
+	 * job here is done.
+	 */
+	if (!current->async_cookie)
+		do_exit(0);
+
+	/* We did it synchronously - return 0 */
+	current->async_cookie = 0;
+	return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d57118d..6f38c46 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,18 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+/*
+ * Architectures that don't have async support get this
+ * dummy async thread scheduler callback.
+ *
+ * They had better not set task->async_cookie in the
+ * first place, so this should never get called!
+ */
+void __attribute__ ((weak)) schedule_async(void)
+{
+	BUG();
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index cca93cc..cc73dee 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3436,6 +3436,17 @@ asmlinkage void __sched schedule(void)
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
+	/* Are we running within an async system call? */
+	if (unlikely(current->async_cookie)) {
+		/*
+		 * If so, we now try to start a new thread for it, but
+		 * not for a preemption event or a scheduler timeout
+		 * triggering!
+		 */
+		if (!(preempt_count() & PREEMPT_ACTIVE) && current->state != TASK_RUNNING)
+			schedule_async();
+	}
+
 need_resched:
 	preempt_disable();
 	prev = current;

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 22:33 ` Linus Torvalds
@ 2007-02-09 23:11   ` Davide Libenzi
  2007-02-09 23:35     ` Linus Torvalds
  2007-02-10  0:04   ` Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-09 23:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Fri, 9 Feb 2007, Linus Torvalds wrote:

> 
> Ok, here's another entry in this discussion.

That's another way to do it. But you end up creating/destroying a new 
thread for every request. May be performing just fine.
Another, even simpler way IMO, is to just have a plain per-task kthread 
pool, and a queue. An async_submit() drops a request in the queue, and 
wakes the requests queue-head where the kthreads are sleeping. One kthread 
picks up the request, service it, drops a result in the result queue, and 
wakes results queue-head (where async_fetch() are sleeping). Cancellation 
is not problem here (by the mean of sending a signal to the service 
kthread). Also, no problem with arch-dependent code. This is a 1:1 
match of what my userspace implementation does.
Of course, no hot-path optimization are performed here, and you need a few 
context switches more than necessary.
Let's have Zach (Ingo support to Zach would be great) play with the 
optimized version, and then we can maybe bench the three to see if the 
more complex code that the optimized version require, gets a pay-back from 
the performance side.

/me thinks it likely will



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 23:11   ` Davide Libenzi
@ 2007-02-09 23:35     ` Linus Torvalds
  2007-02-10 18:45       ` Davide Libenzi
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-09 23:35 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Fri, 9 Feb 2007, Davide Libenzi wrote:
> 
> That's another way to do it. But you end up creating/destroying a new 
> thread for every request. May be performing just fine.

Well, I actually wanted to add a special CLONE_ASYNC flag, because I
think we could do it better if we know it's a particularly limited special 
case. But that's really just a "small implementation detail", and I don't 
know how big a deal it is. I didn't want to obscure the basic idea with 
anything bigger.

I agree that the create/destroy is a big overhead, but at least it's now 
only done when we actually end up doing some IO (and _after_ we've started 
the IO, of course - that's when we block), so compared to doing it up 
front, I'm hoping that it's not actually that horrid.

The "fork-like" approach also means that it's very flexible. It's not 
really even limited to doing simple system calls any more: you *could*, 
for example, decide that since you already have the thread, and now that 
it's asynchronous, you'd actually return to user space (to let user space 
"complete" whatever asynchronous action it wanted to complete).

> Another, even simpler way IMO, is to just have a plain per-task kthread 
> pool, and a queue.

Yes, that is actually quite doable with basically the same interface. It's 
literally a "small decision" inside of "schedule_async()" on how it 
actually would want to handle the case of "hey, we now have concurrent 
work to be done".

But I actually don't think a per-task kthread pool is necessarily a good 
idea. If a thread pool works for this, then it should have worked for 
regular thread create/destroy loads too - ie there really is little reason 
to special-case the "async system call" case.

NOTE! I'm also not at all sure that we actually want to waste real threads 
on this. My patch is in no way meant to be an "exclusive alternative" to 
fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils, 
but I didn't like how Zach did the overhead of creating them up-front, 
because I really would like the cached case to be totally *synchronous*.

So I wrote my patch with a "schedule_async()" implementation that just 
creates a full-sized thread, but I actually wanted very much to try to 
make it use fibrils that are allocated on-demand too. I was just too lazy.

So the patch is really meant as a "ok, this is how easy it is to make the 
thread allocation be 'on-demand' instead of 'up-front'". The actual 
_policy_ on how thread allocation is done isn't even interesting to me, to 
some degree. I think Zack's fibrils would work fine, a thread pool would 
work fine, and just the silly outright "new thread for everything" that 
the example patch actually used may also possibly work well enough.

It's one reason I liked my patch. It was not only small and simple, it 
really is very flexible, I think. It's also totally independent on how 
you actually end up _executing_ the async requests.

(In fact, you could easily make it a config option whether you support any 
asynchronous behaviour AT ALL. The "async()" system call might still be 
there, but it would just return "0" all the time, and do the actual work 
synchronously).

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 22:33 ` Linus Torvalds
  2007-02-09 23:11   ` Davide Libenzi
@ 2007-02-10  0:04   ` Eric Dumazet
  2007-02-10  0:12     ` Linus Torvalds
  2007-02-10 10:47   ` bert hubert
  2007-02-11  0:56   ` David Miller
  3 siblings, 1 reply; 151+ messages in thread
From: Eric Dumazet @ 2007-02-10  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

Linus Torvalds a écrit :
> Ok, here's another entry in this discussion.

> 
>  - IF the system call blocks, we call the architecture-specific 
>    "schedule_async()" function before we even get any scheduler locks, and 
>    it can just do a fork() at that time, and let the *child* return to the 
>    original user space. The process that already started doing the system 
>    call will just continue to do the system call.


Well, I guess if the original program was mono-threaded, and syscall used 
fget_light(), we might have a problem here if the child try a close(). So you 
may have to disable fget_light() magic if async call is the originator of the 
syscall.

Eric

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10  0:04   ` Eric Dumazet
@ 2007-02-10  0:12     ` Linus Torvalds
  2007-02-10  0:34       ` Alan
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-10  0:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Sat, 10 Feb 2007, Eric Dumazet wrote:
> 
> Well, I guess if the original program was mono-threaded, and syscall used
> fget_light(), we might have a problem here if the child try a close(). So you
> may have to disable fget_light() magic if async call is the originator of the
> syscall.

Yes. All the issues that I already brought up with Zach's patches are 
still there. This doesn't really change any of them. Any optimization that 
checks for "am I single-threaded" will need to be aware of pending and 
running async things.

With my patch, any _running_ async things will always be seen as normal 
clones, but the pending ones won't. So you'd need to effectively change 
anything that looks like

	if (atomic_read(&current->mm->count) == 1)
		.. do some simplified version ..

into

	if (!current->async_cookie && atomic_read(..) == 1)
		.. do the simplified thing ..

to make it safe.

I think we only do it for fget_light and some VM TLB simplification, so it 
shouldn't be a big burden to check.

Side note: the real issues still remain. The interfaces, and the 
performance testing.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10  0:12     ` Linus Torvalds
@ 2007-02-10  0:34       ` Alan
  0 siblings, 0 replies; 151+ messages in thread
From: Alan @ 2007-02-10  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

> I think we only do it for fget_light and some VM TLB simplification, so it 
> shouldn't be a big burden to check.

And all the permission management stuff that relies on one thread not
being able to manipulate the uid/gid of another to race security checks.

Alan

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 22:33 ` Linus Torvalds
  2007-02-09 23:11   ` Davide Libenzi
  2007-02-10  0:04   ` Eric Dumazet
@ 2007-02-10 10:47   ` bert hubert
  2007-02-10 18:19     ` Davide Libenzi
  2007-02-11  0:56   ` David Miller
  3 siblings, 1 reply; 151+ messages in thread
From: bert hubert @ 2007-02-10 10:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:

>  - IF the system call blocks, we call the architecture-specific 
>    "schedule_async()" function before we even get any scheduler locks, and 
>    it can just do a fork() at that time, and let the *child* return to the 
>    original user space. The process that already started doing the system 
>    call will just continue to do the system call.

Ah - cool. The average time we have to wait is probably far greater than the
fork overhead, microseconds versus milliseconds. 

However, and there probably is a good reason for this, why isn't it possible
to do it the other way around, and have the *child* do the work and the
original return to userspace?

Would confuse me at lot less in any case.

	Bert

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10 10:47   ` bert hubert
@ 2007-02-10 18:19     ` Davide Libenzi
  0 siblings, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-10 18:19 UTC (permalink / raw)
  To: bert hubert
  Cc: Linus Torvalds, Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Sat, 10 Feb 2007, bert hubert wrote:

> On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:
> 
> >  - IF the system call blocks, we call the architecture-specific 
> >    "schedule_async()" function before we even get any scheduler locks, and 
> >    it can just do a fork() at that time, and let the *child* return to the 
> >    original user space. The process that already started doing the system 
> >    call will just continue to do the system call.
> 
> Ah - cool. The average time we have to wait is probably far greater than the
> fork overhead, microseconds versus milliseconds. 
> 
> However, and there probably is a good reason for this, why isn't it possible
> to do it the other way around, and have the *child* do the work and the
> original return to userspace?

If the parent is going to schedule(), someone above has already dropped 
the parent's task_struct inside a wait queue, so the *parent* will be the 
wakeup target [1].
Linus take to the generic AIO is a neat one, but IMO continuos fork/exits 
are going to be expensive. Even if the task is going to sleep, that does 
not mean that the parent (well, in Linus case, the child actually) does 
not have more stuff to feed to async(). IMO the frequency of AIO 
submission and retrieval can get pretty high (hence the frequency of 
fork/exit), and there might be a price to pay for it at the end.
IMO one solution, following the non-fibril way, may be:

- Keep a pool of per-process threads (a per-process pool already has stuff 
  like "files" already correctly setup, just for example - no need to 
  teach everywhere around the kernel of the "async" special case)

- When a schedule happen on the submission thread, we get a thread 
  (task_struct really) of the available pool

- We setup the submission (now going to sleep) thread return IP to an 
  async_complete (or whatever name) stub. This will drop a result in a 
  queue, and wake the async_wait (or whatever name) wait queue head

- We may want to swap at least the PID (signals, ...?) between the two, so 
  even if we're re-emrging with a new task_struct, the TID will be the same

- We make the "returning" thread to come back to userspace through some 
  special helper ala ret_from_fork (ret_from_async ?)

- We want also to keep a record (hash?) of userspace cookies and threads 
  currently servicing them, so that we can implement cancel (send signal)

Open issues:

- What if the pool becomes empty since all thread are stuck under schedule?
  o Grow the pool (and delay-shrink at quiter times)?
  o Make the caller really sleep?
  o Fall back in queue-request mode?

- Look at the Devil hiding in the details and showing up many times during 
  the process

Yup, I can see Zach having a lot of fun with it ;)



[1] Well, you could add a list_head to the task_struct, and teach the 
    add-to-waitqueue to drop a reference to all the wait queue entries 
    hosting the task_struct. Then walk&fix (likely be only one entry) when 
    you swap the submission thread context (thread_info, per_call stuff, ...) 
    over a service thread task_struct. At that point you can re-emerge 
    with the same task_struct. Pretty nasty though.


- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 23:35     ` Linus Torvalds
@ 2007-02-10 18:45       ` Davide Libenzi
  2007-02-10 19:01         ` Linus Torvalds
  0 siblings, 1 reply; 151+ messages in thread
From: Davide Libenzi @ 2007-02-10 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Fri, 9 Feb 2007, Linus Torvalds wrote:

> > Another, even simpler way IMO, is to just have a plain per-task kthread 
> > pool, and a queue.
> 
> Yes, that is actually quite doable with basically the same interface. It's 
> literally a "small decision" inside of "schedule_async()" on how it 
> actually would want to handle the case of "hey, we now have concurrent 
> work to be done".

For the queue approach, I meant the async_submit() to simply add the 
request (cookie, syscall number and params) inside queue, and not trying 
to execute the syscall. Once you're inside schedule, "stuff" has already 
partially happened, and you cannot have the same request re-initiated by a 
different thread.



> But I actually don't think a per-task kthread pool is necessarily a good 
> idea. If a thread pool works for this, then it should have worked for 
> regular thread create/destroy loads too - ie there really is little reason 
> to special-case the "async system call" case.

A per-process thread pool already has things correctly inherited, so we 
don't need to add special "adopt" routines for things like "files" and such.



> NOTE! I'm also not at all sure that we actually want to waste real threads 
> on this. My patch is in no way meant to be an "exclusive alternative" to 
> fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils, 
> but I didn't like how Zach did the overhead of creating them up-front, 
> because I really would like the cached case to be totally *synchronous*.

I'm not advocating threads against fibrils. The use of threads may make 
things easier under certain POVs (less ad-hoc changes into mainline). The 
ideal would be to have a look at both and see Pros&Cons under different 
POVs (performance, code impact, etc..).



- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10 18:45       ` Davide Libenzi
@ 2007-02-10 19:01         ` Linus Torvalds
  2007-02-10 19:35           ` Linus Torvalds
  2007-02-10 20:59           ` Davide Libenzi
  0 siblings, 2 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-10 19:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Sat, 10 Feb 2007, Davide Libenzi wrote:
> 
> For the queue approach, I meant the async_submit() to simply add the 
> request (cookie, syscall number and params) inside queue, and not trying 
> to execute the syscall. Once you're inside schedule, "stuff" has already 
> partially happened, and you cannot have the same request re-initiated by a 
> different thread.

But that makes it impossible to do things synchronously, which I think is 
a *major* mistake.

The whole (and really _only_) point of my patch was really the whole 
"synchronous call" part. I'm personally of the opinion that if you cannot 
handle the cached case as fast as just doing the system call directly, 
then the whole thing is almost pointless.

Things that take a long time we already have good methods for. "epoll" and 
"kevent" are always going to be the best way to handle the "we have ten 
thousand events outstanding". There simply isn't any question about it. 
You can *never* handle ten thousand long-running events efficiently with 
threads - even if you ignore all the CPU overhead, you're going to have a 
much bigger memory (and thus *cache*) footprint.

So anybody who wants to use AIO to do those kinds of long-running async 
things is out to lunch. It's not the point. 

You use the AIO stuff for things that you *expect* to be almost 
instantaneous. Even if you actually start ten thousand IO's in one go, and 
they all do IO, you would hopefully expect that the first ones start 
completingn before you've even submitted them all. If that's not true, 
then you'd just be better off using epoll.

Also, if you can make the cached case as fast as just doing the direct 
system call itself, that just changes the whole equation for using it. You 
suddenly don't have any downsides. You can start using the async 
interfaces in places you simply couldn't otherwise, or in places where 
you'd have to do a lot of performance tuning ("it makes sense under this 
particular load because I actually need to get 100 IO's going at the same 
time to saturate the disk").

So the "do cached things synchronously" really is important. Just because 
that makes a whole complicated optimization question go away: you 
basically *always* win for normal stuff.

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10 19:01         ` Linus Torvalds
@ 2007-02-10 19:35           ` Linus Torvalds
  2007-02-10 20:59           ` Davide Libenzi
  1 sibling, 0 replies; 151+ messages in thread
From: Linus Torvalds @ 2007-02-10 19:35 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar



On Sat, 10 Feb 2007, Linus Torvalds wrote:
> 
> But that makes it impossible to do things synchronously, which I think is 
> a *major* mistake.
> 
> The whole (and really _only_) point of my patch was really the whole 
> "synchronous call" part. I'm personally of the opinion that if you cannot 
> handle the cached case as fast as just doing the system call directly, 
> then the whole thing is almost pointless.

Side note: one of the nice things with "do it synchronously if you can" is 
that it also likely would allow us to do a reasonable job at "self-tuning" 
things in the kernel. With my async approach, we get notified only when we 
block, so it'seasy (for example) to have a simple counter that 
automatically adapts to the number of outstanding IO's, in a way that it's 
_not_ if we do things at submit time when we won't even know whether it 
will block or not.

As a trivial example: we actually see what *kind* of blocking it is. Is it 
blocking interruptibly ("long wait") or uninterruptibly ("disk wait")? So 
by the time schedule_async() is called, we actually have some more 
information about the situation, and we can even do different things 
(possibly based on just hints that the user and/or system maintainer gives 
us; ie you can tune the behaviour from _outside_ by setting different 
limits, for example).

			Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-10 19:01         ` Linus Torvalds
  2007-02-10 19:35           ` Linus Torvalds
@ 2007-02-10 20:59           ` Davide Libenzi
  1 sibling, 0 replies; 151+ messages in thread
From: Davide Libenzi @ 2007-02-10 20:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zach Brown, Linux Kernel Mailing List, linux-aio,
	Suparna Bhattacharya, Benjamin LaHaise, Ingo Molnar

On Sat, 10 Feb 2007, Linus Torvalds wrote:

> On Sat, 10 Feb 2007, Davide Libenzi wrote:
> > 
> > For the queue approach, I meant the async_submit() to simply add the 
> > request (cookie, syscall number and params) inside queue, and not trying 
> > to execute the syscall. Once you're inside schedule, "stuff" has already 
> > partially happened, and you cannot have the same request re-initiated by a 
> > different thread.
> 
> But that makes it impossible to do things synchronously, which I think is 
> a *major* mistake.

Yes! That's what I said when I described the method. No synco fast-paths. 
At that point you could implement the full-queued method in userspace.



> The whole (and really _only_) point of my patch was really the whole 
> "synchronous call" part. I'm personally of the opinion that if you cannot 
> handle the cached case as fast as just doing the system call directly, 
> then the whole thing is almost pointless.
> 
> Things that take a long time we already have good methods for. "epoll" and 
> "kevent" are always going to be the best way to handle the "we have ten 
> thousand events outstanding". There simply isn't any question about it. 
> You can *never* handle ten thousand long-running events efficiently with 
> threads - even if you ignore all the CPU overhead, you're going to have a 
> much bigger memory (and thus *cache*) footprint.

Think about the old-fashioned web server, using epoll to handle thousands 
of connections. You'll be hosting an epoll_wait() over an async 
thread/fibril. Now a burst of 500 connections becomes suddendly "hot", and 
you start looping through those 500 hot connections trying to handle them. 
You'll need to stat/open/read (let's assume a trivial, non-cached HTTP 
server) from the file pointed by the URL's doc, and those better be 
handled in async fashion otherwise you'll starve the others and pay huge 
time in performance. You can multiplex using a state machine or coroutines 
for example. Using coroutines your epoll dispatching loop end up doing 
something like:

	struct conn {
		coroutine_t co;
		int res;
		int skfd;
		...
	};

	void events_dispatch(struct epoll_event *events, int n) {

		for (i = 0; i < n; i++) {
			struct conn *c = (struct conn *) events[i].data;
			co_call(c->co);
		}
	}

Note that co_call() will make the coroutine to re-emerge from the last 
co_resume() they issued.
Your code doesn't not need to be coroutine/async aware, once you wrap the 
possibly-blocking calls with like:

	int my_stat(struct conn *c, const char *path, struct stat *buf) {
		/* "c" is the cookie */
		if ((c->res = async_submit(c, __NR_stat, path, buf)) == EASYNC)
			/* co_resume() will bounce back to the scheduler loop */
			co_resume();
		return c->res;
	}

Now, the *main* loop will be the async_wait() driven one:

	struct async_result {
		void *cookie;
		long result;
	};

	n = async_wait(ares, nares);
	for (i = 0; i < n; i++) {
		if (ares[i].cookie == epoll_special_cookie)
			events_dispatch(...);
		else {
			struct conn *c = (struct conn *) ares[i].cookie;
			c->res = ares[i].result;
			co_call(c->co);
		}
	}

Many of the async submission will complete in a synco way, but many of 
them will require reschedule and service-thread attention. Of these 500 
burst, you can expect 100 or more to require async service. Right away, 
not sometime later. According to Oracle, 1000 or more requests, 90% of 
which can be expected to block, can be fired at a time.
It is true that we need to have a fast synco path, but it is also true 
that we must not suck in the non-synco path, because the submitting thread 
has something else to do then simply issue one async request (it like 
have *many* of them to push).
There we go, I broke my "No more than 20 lines" rule again :)




- Davide



^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-09 22:33 ` Linus Torvalds
                     ` (2 preceding siblings ...)
  2007-02-10 10:47   ` bert hubert
@ 2007-02-11  0:56   ` David Miller
  2007-02-11  2:49     ` Linus Torvalds
  3 siblings, 1 reply; 151+ messages in thread
From: David Miller @ 2007-02-11  0:56 UTC (permalink / raw)
  To: torvalds; +Cc: zach.brown, linux-kernel, linux-aio, suparna, bcrl, mingo

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 9 Feb 2007 14:33:01 -0800 (PST)

> So I actually like this, because it means that while we slow down
> real IO, we don't slow down the cached cases at all.

Even if you have everything, every page, every log file, in the page
cache, everything talking over the network wants to block.

Will you create a thread every time tcp_sendmsg() hits the send queue
limits?

Add some logging to tcp_sendmsg() on a busy web server if you do not
believe me :-)

The idea is probably excellent for operations on real files, but it's
going to stink badly for networking stuff.

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-11  0:56   ` David Miller
@ 2007-02-11  2:49     ` Linus Torvalds
  2007-02-14 16:42       ` James Antill
  0 siblings, 1 reply; 151+ messages in thread
From: Linus Torvalds @ 2007-02-11  2:49 UTC (permalink / raw)
  To: David Miller; +Cc: zach.brown, linux-kernel, linux-aio, suparna, bcrl, mingo



On Sat, 10 Feb 2007, David Miller wrote:
> 
> Even if you have everything, every page, every log file, in the page
> cache, everything talking over the network wants to block.
> 
> Will you create a thread every time tcp_sendmsg() hits the send queue
> limits?

No. You use epoll() for those. 

> The idea is probably excellent for operations on real files, but it's
> going to stink badly for networking stuff.

And I actually talked about that in one of the emails already. There is no 
way you can beat an event-based thing for things that _are_ event-based. 
That means mainly networking.

For things that aren't event-based, but based on real IO (ie filesystems 
etc), event models *suck*. They suck because the code isn't amenable to it 
in the first place (ie anybody who thinks that a filesystem is like a 
network stack and can be done as a state machine with packets is just 
crazy!).

So you would be crazy to makea web server that uses this to handle _all_ 
outstanding IO. Network connections are often slow, and you can have tens 
of thousands outstanding (and some may be outstanding for hours until they 
time out, if ever). But that's the whole point: you can easily mix the 
two, as given in several examples already (ie you can easily make the main 
loop itself basically do just

	for (;;) {
		async(epoll);	/* wait for networking events */
		async_wait();	/* wait for epoll _or_ any of the outstanding file IO events */
		handle_completed_events();
	}

and it's actually a lot better than an event model, exactly because now 
you can handle events _and_ non-events well (a pure event model requires 
that _everything_ be an event, which works fine for some things, but works 
really badly for other things).

There's a reason why a lot of UNIX system calls are blocking: they just 
don't make sense as event models, because there is no sensible half-way 
point that you can keep track of (filename lookup is the most common 
example).

		Linus

^ permalink raw reply	[flat|nested] 151+ messages in thread

* Re: [PATCH 0 of 4] Generic AIO by scheduling stacks
  2007-02-11  2:49     ` Linus Torvalds
@ 2007-02-14 16:42       ` James Antill
  0 siblings, 0 replies; 151+ messages in thread
From: James Antill @ 2007-02-14 16:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-aio

On Sat, 10 Feb 2007 18:49:56 -0800, Linus Torvalds wrote:

> And I actually talked about that in one of the emails already. There is no 
> way you can beat an event-based thing for things that _are_ event-based. 
> That means mainly networking.
> 
> For things that aren't event-based, but based on real IO (ie filesystems 
> etc), event models *suck*. They suck because the code isn't amenable to it 
> in the first place (ie anybody who thinks that a filesystem is like a 
> network stack and can be done as a state machine with packets is just 
> crazy!).
> 
> So you would be crazy to makea web server that uses this to handle _all_ 
> outstanding IO. Network connections are often slow, and you can have tens 
> of thousands outstanding (and some may be outstanding for hours until they 
> time out, if ever). But that's the whole point: you can easily mix the 
> two, as given in several examples already (ie you can easily make the main 
> loop itself basically do just

 I don't see any replies to this, so here's my 2¢. The simple model of
what a webserver does when sending static data is:

1. local_disk_fd = open()
2. fstat(local_disk_fd)
3. TCP_CORK on
4. send_headers();
5. LOOP
5a. sendfile(network_con_fd, local_disk_fd)
5b. epoll(network_con_fd)
6. TCP_CORK off

...and here's my personal plan (again, somewhat simplified), which I
think will be "better":

7. helper_proc_pipe_fd = DO open() + fstat()
8. read_stat_event_data(helper_proc_pipe_fd)
9. TCP_CORK on network_con_fd
10. send_headers(network_con_fd);
11. LOOP
11a. splice(helper_proc_pipe_fd, network_con_fd)
11b. epoll(network_con_fd && helper_proc_pipe_fd)
12. TCP_CORK off network_con_fd

...where the "helper proc" is doing splice() from disk to the pipe, on the
other end. This, at least in theory, gives you an async webserver and zero
copy disk to network[1]. My assumption is that Evgeniy's aio_sendfile()
could fit into that model pretty easily, and would be faster.

 However, from what you've said above you're only trying to help #1 and #2
(which are likely to be cached in the app. anyway) and apps.
that want to sendfile() to the network either do horrible hacks like
lighttpd's "AIO"[2], do a read+write copy loop with AIO or don't use AIO.


[1] And allows things like IO limiting, which aio_sendfile() won't.

[2] http://illiterat.livejournal.com/2989.html

-- 
James Antill -- james@and.org
http://www.and.org/and-httpd/ -- $2,000 security guarantee
http://www.and.org/vstr/


^ permalink raw reply	[flat|nested] 151+ messages in thread

end of thread, other threads:[~2007-02-14 17:10 UTC | newest]

Thread overview: 151+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-30 20:39 [PATCH 0 of 4] Generic AIO by scheduling stacks Zach Brown
2007-01-30 20:39 ` [PATCH 1 of 4] Introduce per_call_chain() Zach Brown
2007-01-30 20:39 ` [PATCH 2 of 4] Introduce i386 fibril scheduling Zach Brown
2007-02-01  8:36   ` Ingo Molnar
2007-02-01 13:02     ` Ingo Molnar
2007-02-01 13:19       ` Christoph Hellwig
2007-02-01 13:52         ` Ingo Molnar
2007-02-01 17:13           ` Mark Lord
2007-02-01 18:02             ` Ingo Molnar
2007-02-02 13:23         ` Andi Kleen
2007-02-01 21:52       ` Zach Brown
2007-02-01 22:23         ` Benjamin LaHaise
2007-02-01 22:37           ` Zach Brown
2007-02-02 13:22       ` Andi Kleen
2007-02-01 20:07     ` Linus Torvalds
2007-02-02 10:49       ` Ingo Molnar
2007-02-02 15:56         ` Linus Torvalds
2007-02-02 19:59           ` Alan
2007-02-02 20:14             ` Linus Torvalds
2007-02-02 20:58               ` Davide Libenzi
2007-02-02 21:09                 ` Linus Torvalds
2007-02-02 21:30               ` Alan
2007-02-02 21:30                 ` Linus Torvalds
2007-02-02 22:42                   ` Ingo Molnar
2007-02-02 23:01                     ` Linus Torvalds
2007-02-02 23:17                       ` Linus Torvalds
2007-02-03  0:04                         ` Alan
2007-02-03  0:23                         ` bert hubert
2007-02-02 22:48                   ` Alan
2007-02-05 16:44             ` Zach Brown
2007-02-02 22:21           ` Ingo Molnar
2007-02-02 22:49             ` Linus Torvalds
2007-02-02 23:55               ` Ingo Molnar
2007-02-03  0:56                 ` Linus Torvalds
2007-02-03  7:15                   ` Suparna Bhattacharya
2007-02-03  8:23                   ` Ingo Molnar
2007-02-03  9:25                     ` Matt Mackall
2007-02-03 10:03                       ` Ingo Molnar
2007-02-05 17:44                     ` Zach Brown
2007-02-05 19:26                       ` Davide Libenzi
2007-02-05 19:41                         ` Zach Brown
2007-02-05 20:10                           ` Davide Libenzi
2007-02-05 20:21                             ` Zach Brown
2007-02-05 20:42                               ` Linus Torvalds
2007-02-05 20:39                             ` Linus Torvalds
2007-02-05 21:09                               ` Davide Libenzi
2007-02-05 21:31                                 ` Kent Overstreet
2007-02-06 20:25                                   ` Davide Libenzi
2007-02-06 20:46                                   ` Linus Torvalds
2007-02-06 21:16                                     ` David Miller
2007-02-06 21:28                                       ` Linus Torvalds
2007-02-06 21:31                                         ` David Miller
2007-02-06 21:46                                           ` Eric Dumazet
2007-02-06 21:50                                           ` Linus Torvalds
2007-02-06 22:28                                             ` Zach Brown
2007-02-06 22:45                                     ` Kent Overstreet
2007-02-06 23:04                                       ` Linus Torvalds
2007-02-07  1:22                                         ` Kent Overstreet
2007-02-06 23:23                                       ` Davide Libenzi
2007-02-06 23:39                                         ` Joel Becker
2007-02-06 23:56                                           ` Davide Libenzi
2007-02-07  0:06                                             ` Joel Becker
2007-02-07  0:23                                               ` Davide Libenzi
2007-02-07  0:44                                                 ` Joel Becker
2007-02-07  1:15                                                   ` Davide Libenzi
2007-02-07  1:24                                                     ` Kent Overstreet
2007-02-07  1:30                                                     ` Joel Becker
2007-02-07  6:16                                                   ` Michael K. Edwards
2007-02-07  9:17                                                     ` Michael K. Edwards
2007-02-07  9:37                                                       ` Michael K. Edwards
2007-02-06  0:32                                 ` Davide Libenzi
2007-02-05 21:21                               ` Zach Brown
2007-02-02 23:37             ` Davide Libenzi
2007-02-03  0:02               ` Davide Libenzi
2007-02-05 17:12               ` Zach Brown
2007-02-05 18:24                 ` Davide Libenzi
2007-02-05 21:44                   ` David Miller
2007-02-06  0:15                     ` Davide Libenzi
2007-02-05 21:36               ` bert hubert
2007-02-05 21:57                 ` Linus Torvalds
2007-02-05 22:07                   ` bert hubert
2007-02-05 22:15                     ` Zach Brown
2007-02-05 22:34                   ` Davide Libenzi
2007-02-06  0:27                   ` Scot McKinley
2007-02-06  0:48                     ` David Miller
2007-02-06  0:48                     ` Joel Becker
2007-02-05 17:02             ` Zach Brown
2007-02-05 18:52               ` Davide Libenzi
2007-02-05 19:20                 ` Zach Brown
2007-02-05 19:38                   ` Davide Libenzi
2007-02-04  5:12   ` Davide Libenzi
2007-02-05 17:54     ` Zach Brown
2007-01-30 20:39 ` [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct Zach Brown
2007-01-30 20:39 ` [PATCH 4 of 4] Introduce aio system call submission and completion system calls Zach Brown
2007-01-31  8:58   ` Andi Kleen
2007-01-31 17:15     ` Zach Brown
2007-01-31 17:21       ` Andi Kleen
2007-01-31 19:23         ` Zach Brown
2007-02-01 11:13           ` Suparna Bhattacharya
2007-02-01 19:50             ` Trond Myklebust
2007-02-02  7:19               ` Suparna Bhattacharya
2007-02-02  7:45                 ` Andi Kleen
2007-02-01 22:18             ` Zach Brown
2007-02-02  3:35               ` Suparna Bhattacharya
2007-02-01 20:26   ` bert hubert
2007-02-01 21:29     ` Zach Brown
2007-02-02  7:12       ` bert hubert
2007-02-04  5:12   ` Davide Libenzi
2007-01-30 21:58 ` [PATCH 0 of 4] Generic AIO by scheduling stacks Linus Torvalds
2007-01-30 22:23   ` Linus Torvalds
2007-01-30 22:53     ` Zach Brown
2007-01-30 22:40   ` Zach Brown
2007-01-30 22:53     ` Linus Torvalds
2007-01-30 23:45       ` Zach Brown
2007-01-31  2:07         ` Benjamin Herrenschmidt
2007-01-31  2:04 ` Benjamin Herrenschmidt
2007-01-31  2:46   ` Linus Torvalds
2007-01-31  3:02     ` Linus Torvalds
2007-01-31 10:50       ` Xavier Bestel
2007-01-31 19:28         ` Zach Brown
2007-01-31 17:59       ` Zach Brown
2007-01-31  5:16     ` Benjamin Herrenschmidt
2007-01-31  5:36     ` Nick Piggin
2007-01-31  5:51       ` Nick Piggin
2007-01-31  6:06       ` Linus Torvalds
2007-01-31  8:43         ` Ingo Molnar
2007-01-31 20:13         ` Joel Becker
2007-01-31 18:20       ` Zach Brown
2007-01-31 17:47     ` Zach Brown
2007-01-31 17:38   ` Zach Brown
2007-01-31 17:51     ` Benjamin LaHaise
2007-01-31 19:25       ` Zach Brown
2007-01-31 20:05         ` Benjamin LaHaise
2007-01-31 20:41           ` Zach Brown
2007-02-04  5:13 ` Davide Libenzi
2007-02-04 20:00   ` Davide Libenzi
2007-02-09 22:33 ` Linus Torvalds
2007-02-09 23:11   ` Davide Libenzi
2007-02-09 23:35     ` Linus Torvalds
2007-02-10 18:45       ` Davide Libenzi
2007-02-10 19:01         ` Linus Torvalds
2007-02-10 19:35           ` Linus Torvalds
2007-02-10 20:59           ` Davide Libenzi
2007-02-10  0:04   ` Eric Dumazet
2007-02-10  0:12     ` Linus Torvalds
2007-02-10  0:34       ` Alan
2007-02-10 10:47   ` bert hubert
2007-02-10 18:19     ` Davide Libenzi
2007-02-11  0:56   ` David Miller
2007-02-11  2:49     ` Linus Torvalds
2007-02-14 16:42       ` James Antill

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.